Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion wayback_archive/downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -2018,8 +2018,10 @@ def download(self):
not is_google_fonts_css and (
content_type == "text/html" or
(not content_type and (
url.endswith(".html") or
url.endswith(".html") or
url.endswith(".htm") or
# Bare-host or root URLs (empty path or "/") are HTML.
(not parsed.path or parsed.path == "/") or
(parsed.path and not os.path.splitext(parsed.path)[1] and "?" not in url and not any(parsed.path.lower().endswith(ext) for ext in [".css", ".js", ".json", ".xml", ".txt"]))
Comment on lines +2021 to 2025
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new fallback that treats empty/root paths as HTML is a regression-prone heuristic change, but there’s no unit test covering download()’s is_html classification for bare-host URLs. Please add a test that simulates a bare-host/root URL with content_type unresolved (e.g., content starting with an HTML comment/whitespace so signature sniffing doesn’t set text/html) and asserts that HTML processing/link extraction runs (e.g., _process_html is invoked / new links are queued).

Copilot uses AI. Check for mistakes.
))
)
Expand Down
Loading