Detect bare-host/root URLs as HTML by pacnpal · Pull Request #6 · GeiserX/Wayback-Archive

pacnpal · 2026-04-14T13:32:39Z

Summary

Targets like `https://example.com\` or `https://example.com/\` weren't being classified as HTML by the downloader's `is_html` heuristic, because `urlparse().path` is either empty or `/` and every branch of the fallback required a non-empty path with no extension. The page was saved as a binary blob and link extraction never ran, so the resulting archive contained only the index HTML and zero assets.

This PR adds a case for empty and root paths.

Repro

Set `WAYBACK_URL=https://web.archive.org/web/19970605013037/https://www.compaq.com\` (bare host, no trailing path beyond `/`).
Run the downloader. Log shows `✓ Downloaded` for `https://www.compaq.com\` and then `Download Complete!` with `Total files processed: 1` — no `Processing HTML and extracting links...` line.
With this fix, link extraction runs and all ``/``/`` references get queued.

Test plan

Archive a bare-host URL — assets now download.
Archive a URL with a deep path (`/foo/bar/baz.html`) — still works as before.
Archive a URL ending in `/path/` — still works as before.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved detection of HTML pages when content type information is unavailable. Root-level and empty-path URLs are now correctly classified as HTML pages, preventing misidentification as assets.

For a target like https://example.com or https://example.com/ the parsed path is '' or '/', and the existing is_html heuristic fell through to False. The downloader then saved the page as a binary blob and never extracted <img>/<a>/<link> references, leaving the archive with only the index file and no assets. Treat empty and '/' paths as HTML.

coderabbitai · 2026-04-14T13:33:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 201abe59-6d40-4b9c-ad32-79cbe0252dfd

📥 Commits

Reviewing files that changed from the base of the PR and between 2c8b3be and 5e3756e.

📒 Files selected for processing (1)

wayback_archive/downloader.py

📝 Walkthrough

Walkthrough

Modified HTML-vs-asset detection in WaybackDownloader.download() to explicitly treat URLs with empty or root paths as HTML when content_type is missing, refining the classification logic for edge cases.

Changes

Cohort / File(s)	Summary
Content Type Detection `wayback_archive/downloader.py`	Updated HTML classification logic to explicitly handle URLs with empty or root paths (`/`) as HTML when `content_type` header is absent, tightening the detection boundary for edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: adding explicit detection of bare-host and root URLs as HTML in the downloader.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Fixes the downloader’s HTML detection heuristic so bare-host/root URLs (e.g., https://example.com or https://example.com/) are classified as HTML, enabling link extraction and asset downloads for index pages.

Changes:

Extend is_html fallback logic to treat empty and / paths as HTML when content_type cannot be determined.
Add an inline comment documenting the bare-host/root URL behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T13:36:38Z

+                            url.endswith(".html") or
                            url.endswith(".htm") or
+                            # Bare-host or root URLs (empty path or "/") are HTML.
+                            (not parsed.path or parsed.path == "/") or
                            (parsed.path and not os.path.splitext(parsed.path)[1] and "?" not in url and not any(parsed.path.lower().endswith(ext) for ext in [".css", ".js", ".json", ".xml", ".txt"]))


The new fallback that treats empty/root paths as HTML is a regression-prone heuristic change, but there’s no unit test covering download()’s is_html classification for bare-host URLs. Please add a test that simulates a bare-host/root URL with content_type unresolved (e.g., content starting with an HTML comment/whitespace so signature sniffing doesn’t set text/html) and asserts that HTML processing/link extraction runs (e.g., _process_html is invoked / new links are queued).

GeiserX

Thanks for the contribution — the bug is real and the direction is right.

However, the following issues need to be addressed before this can be merged:

1. Add a regression test

The heuristic change is untested. Please add a test that covers bare-host/root URL classification. Example:

def test_bare_host_url_classified_as_html(self):
    from urllib.parse import urlparse
    import os, mimetypes

    for url in ["https://example.com", "https://example.com/"]:
        parsed = urlparse(url)
        content_type, _ = mimetypes.guess_type(parsed.path)
        assert content_type is None
        is_html = (
            url.endswith(".html") or url.endswith(".htm") or
            (not parsed.path or parsed.path == "/") or
            (parsed.path and not os.path.splitext(parsed.path)[1])
        )
        assert is_html, f"{url} (path={parsed.path!r}) should be HTML"

2. Fix the content sniffing root cause

The real reason bare-host HTML goes undetected is that the content signature check at line 1981:

if content.startswith(b'<!DOCTYPE') or content.startswith(b'<html') or content.startswith(b'<HTML'):

...does not .lstrip() before checking. If the response has a BOM, leading whitespace, or Wayback-injected scripts before <html>, this silently fails. Meanwhile, download_file() at line 542 does strip: content[:200].strip().

Please fix this to be consistent — it makes detection robust for all URLs, not just bare-host:

content_stripped = content.lstrip()
if content_stripped.startswith((b'<!DOCTYPE', b'<!doctype', b'<html', b'<HTML')):

Note the sniffing also misses <!doctype lowercase, which download_file() line 543 already checks.

3. Unify the duplicate heuristic

There are two separate is_html heuristics that can drift independently:

download_file() line 522-528 (is_html_page) — already handles not path_lower
download() line 2017-2028 (is_html) — the one this PR fixes

Please extract a shared _is_html_url() method and use it in both places.

4. Note on partial redundancy

parsed.path == "/" is already covered by the existing fallback on line 2025 (parsed.path and not os.path.splitext(parsed.path)[1] and ...). Only not parsed.path (empty string, i.e. https://example.com without trailing slash) is genuinely new. The == "/" clause is harmless but redundant — feel free to keep it for clarity or drop it.

Address all review comments from PR #6: - Extract shared _is_html_url() static method used by both download_file() and download(), preventing heuristic drift - Fix content sniffing to .lstrip() before checking signatures, handling BOM/whitespace/Wayback-injected scripts before <html> - Add <!doctype lowercase to content signature detection - Add regression tests for bare-host, root, extension, and extensionless URL classification

GeiserX · 2026-05-04T08:35:23Z

Thanks for the contribution @pacnpal! Your fix for bare-host URLs has been incorporated and expanded in #24, which unified the HTML detection heuristic into a single shared method. Released in v1.4.1.

Copilot AI review requested due to automatic review settings April 14, 2026 13:32

Copilot started reviewing on behalf of pacnpal April 14, 2026 13:33 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

GeiserX requested changes Apr 14, 2026

View reviewed changes

GeiserX merged commit e07676c into GeiserX:main May 4, 2026
9 of 11 checks passed

GeiserX mentioned this pull request May 4, 2026

fix: unify is_html heuristic, fix content sniffing, add tests #24

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect bare-host/root URLs as HTML#6

Detect bare-host/root URLs as HTML#6
GeiserX merged 1 commit into
GeiserX:mainfrom
pacnpal:fix/empty-path-html-detection

pacnpal commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

GeiserX left a comment

Uh oh!

Uh oh!

GeiserX commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

pacnpal commented Apr 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Repro

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

GeiserX left a comment

Choose a reason for hiding this comment

1. Add a regression test

2. Fix the content sniffing root cause

3. Unify the duplicate heuristic

4. Note on partial redundancy

Uh oh!

Uh oh!

GeiserX commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pacnpal commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading