Skip to content

fix: index html-only Gmail message bodies#2

Open
RitwijParmar wants to merge 1 commit into
zeroentropy-ai:masterfrom
RitwijParmar:codex/zemail-html-body-fallback
Open

fix: index html-only Gmail message bodies#2
RitwijParmar wants to merge 1 commit into
zeroentropy-ai:masterfrom
RitwijParmar:codex/zemail-html-body-fallback

Conversation

@RitwijParmar
Copy link
Copy Markdown

Summary

  • preserve searchable body text for HTML-only Gmail payloads by falling back from text/plain to cleaned text/html
  • keep text/plain as the preferred source when multipart alternatives include both formats
  • add focused regression tests for HTML-only, multipart alternative, and nested Gmail payloads
  • remove an existing unused urlparse import so the touched file passes Ruff

Why

Zemail's README positions the tool as full-inbox semantic search, but many real Gmail messages are HTML-only. Before this change, _decode_body() returned an empty string for those messages, so indexing fell back to snippets rather than the actual email body. That makes retrieval/reranking weaker exactly on the long-tail emails where semantic search should help most.

The fallback is intentionally dependency-free: it strips script/style content, preserves block boundaries as newlines, unescapes entities, and normalizes whitespace before the body is embedded.

Validation

  • .venv/bin/python -m pytest tests/test_gmail_client.py -q -> 3 passed
  • .venv/bin/python -m ruff check zemail/gmail_client.py tests/test_gmail_client.py
  • .venv/bin/python -m compileall -q zemail tests
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant