Detect bare-host/root URLs as HTML#6
Conversation
For a target like https://example.com or https://example.com/ the parsed path is '' or '/', and the existing is_html heuristic fell through to False. The downloader then saved the page as a binary blob and never extracted <img>/<a>/<link> references, leaving the archive with only the index file and no assets. Treat empty and '/' paths as HTML.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughModified HTML-vs-asset detection in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Fixes the downloader’s HTML detection heuristic so bare-host/root URLs (e.g., https://example.com or https://example.com/) are classified as HTML, enabling link extraction and asset downloads for index pages.
Changes:
- Extend
is_htmlfallback logic to treat empty and/paths as HTML whencontent_typecannot be determined. - Add an inline comment documenting the bare-host/root URL behavior.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| url.endswith(".html") or | ||
| url.endswith(".htm") or | ||
| # Bare-host or root URLs (empty path or "/") are HTML. | ||
| (not parsed.path or parsed.path == "/") or | ||
| (parsed.path and not os.path.splitext(parsed.path)[1] and "?" not in url and not any(parsed.path.lower().endswith(ext) for ext in [".css", ".js", ".json", ".xml", ".txt"])) |
There was a problem hiding this comment.
The new fallback that treats empty/root paths as HTML is a regression-prone heuristic change, but there’s no unit test covering download()’s is_html classification for bare-host URLs. Please add a test that simulates a bare-host/root URL with content_type unresolved (e.g., content starting with an HTML comment/whitespace so signature sniffing doesn’t set text/html) and asserts that HTML processing/link extraction runs (e.g., _process_html is invoked / new links are queued).
GeiserX
left a comment
There was a problem hiding this comment.
Thanks for the contribution — the bug is real and the direction is right.
However, the following issues need to be addressed before this can be merged:
1. Add a regression test
The heuristic change is untested. Please add a test that covers bare-host/root URL classification. Example:
def test_bare_host_url_classified_as_html(self):
from urllib.parse import urlparse
import os, mimetypes
for url in ["https://example.com", "https://example.com/"]:
parsed = urlparse(url)
content_type, _ = mimetypes.guess_type(parsed.path)
assert content_type is None
is_html = (
url.endswith(".html") or url.endswith(".htm") or
(not parsed.path or parsed.path == "/") or
(parsed.path and not os.path.splitext(parsed.path)[1])
)
assert is_html, f"{url} (path={parsed.path!r}) should be HTML"2. Fix the content sniffing root cause
The real reason bare-host HTML goes undetected is that the content signature check at line 1981:
if content.startswith(b'<!DOCTYPE') or content.startswith(b'<html') or content.startswith(b'<HTML'):...does not .lstrip() before checking. If the response has a BOM, leading whitespace, or Wayback-injected scripts before <html>, this silently fails. Meanwhile, download_file() at line 542 does strip: content[:200].strip().
Please fix this to be consistent — it makes detection robust for all URLs, not just bare-host:
content_stripped = content.lstrip()
if content_stripped.startswith((b'<!DOCTYPE', b'<!doctype', b'<html', b'<HTML')):Note the sniffing also misses <!doctype lowercase, which download_file() line 543 already checks.
3. Unify the duplicate heuristic
There are two separate is_html heuristics that can drift independently:
download_file()line 522-528 (is_html_page) — already handlesnot path_lowerdownload()line 2017-2028 (is_html) — the one this PR fixes
Please extract a shared _is_html_url() method and use it in both places.
4. Note on partial redundancy
parsed.path == "/" is already covered by the existing fallback on line 2025 (parsed.path and not os.path.splitext(parsed.path)[1] and ...). Only not parsed.path (empty string, i.e. https://example.com without trailing slash) is genuinely new. The == "/" clause is harmless but redundant — feel free to keep it for clarity or drop it.
Address all review comments from PR #6: - Extract shared _is_html_url() static method used by both download_file() and download(), preventing heuristic drift - Fix content sniffing to .lstrip() before checking signatures, handling BOM/whitespace/Wayback-injected scripts before <html> - Add <!doctype lowercase to content signature detection - Add regression tests for bare-host, root, extension, and extensionless URL classification
Summary
Targets like `https://example.com\` or `https://example.com/\` weren't being classified as HTML by the downloader's `is_html` heuristic, because `urlparse().path` is either empty or `/` and every branch of the fallback required a non-empty path with no extension. The page was saved as a binary blob and link extraction never ran, so the resulting archive contained only the index HTML and zero assets.
This PR adds a case for empty and root paths.
Repro
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit