Skip to content

Detect bare-host/root URLs as HTML#6

Merged
GeiserX merged 1 commit into
GeiserX:mainfrom
pacnpal:fix/empty-path-html-detection
May 4, 2026
Merged

Detect bare-host/root URLs as HTML#6
GeiserX merged 1 commit into
GeiserX:mainfrom
pacnpal:fix/empty-path-html-detection

Conversation

@pacnpal
Copy link
Copy Markdown

@pacnpal pacnpal commented Apr 14, 2026

Summary

Targets like `https://example.com\` or `https://example.com/\` weren't being classified as HTML by the downloader's `is_html` heuristic, because `urlparse().path` is either empty or `/` and every branch of the fallback required a non-empty path with no extension. The page was saved as a binary blob and link extraction never ran, so the resulting archive contained only the index HTML and zero assets.

This PR adds a case for empty and root paths.

Repro

  1. Set `WAYBACK_URL=https://web.archive.org/web/19970605013037/https://www.compaq.com\` (bare host, no trailing path beyond `/`).
  2. Run the downloader. Log shows `✓ Downloaded` for `https://www.compaq.com\` and then `Download Complete!` with `Total files processed: 1` — no `Processing HTML and extracting links...` line.
  3. With this fix, link extraction runs and all ``/``/`` references get queued.

Test plan

  • Archive a bare-host URL — assets now download.
  • Archive a URL with a deep path (`/foo/bar/baz.html`) — still works as before.
  • Archive a URL ending in `/path/` — still works as before.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved detection of HTML pages when content type information is unavailable. Root-level and empty-path URLs are now correctly classified as HTML pages, preventing misidentification as assets.

For a target like https://example.com or https://example.com/ the parsed
path is '' or '/', and the existing is_html heuristic fell through to
False. The downloader then saved the page as a binary blob and never
extracted <img>/<a>/<link> references, leaving the archive with only
the index file and no assets. Treat empty and '/' paths as HTML.
Copilot AI review requested due to automatic review settings April 14, 2026 13:32
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 201abe59-6d40-4b9c-ad32-79cbe0252dfd

📥 Commits

Reviewing files that changed from the base of the PR and between 2c8b3be and 5e3756e.

📒 Files selected for processing (1)
  • wayback_archive/downloader.py

📝 Walkthrough

Walkthrough

Modified HTML-vs-asset detection in WaybackDownloader.download() to explicitly treat URLs with empty or root paths as HTML when content_type is missing, refining the classification logic for edge cases.

Changes

Cohort / File(s) Summary
Content Type Detection
wayback_archive/downloader.py
Updated HTML classification logic to explicitly handle URLs with empty or root paths (/) as HTML when content_type header is absent, tightening the detection boundary for edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: adding explicit detection of bare-host and root URLs as HTML in the downloader.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the downloader’s HTML detection heuristic so bare-host/root URLs (e.g., https://example.com or https://example.com/) are classified as HTML, enabling link extraction and asset downloads for index pages.

Changes:

  • Extend is_html fallback logic to treat empty and / paths as HTML when content_type cannot be determined.
  • Add an inline comment documenting the bare-host/root URL behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2021 to 2025
url.endswith(".html") or
url.endswith(".htm") or
# Bare-host or root URLs (empty path or "/") are HTML.
(not parsed.path or parsed.path == "/") or
(parsed.path and not os.path.splitext(parsed.path)[1] and "?" not in url and not any(parsed.path.lower().endswith(ext) for ext in [".css", ".js", ".json", ".xml", ".txt"]))
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new fallback that treats empty/root paths as HTML is a regression-prone heuristic change, but there’s no unit test covering download()’s is_html classification for bare-host URLs. Please add a test that simulates a bare-host/root URL with content_type unresolved (e.g., content starting with an HTML comment/whitespace so signature sniffing doesn’t set text/html) and asserts that HTML processing/link extraction runs (e.g., _process_html is invoked / new links are queued).

Copilot uses AI. Check for mistakes.
GeiserX

This comment was marked as outdated.

Copy link
Copy Markdown
Owner

@GeiserX GeiserX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution — the bug is real and the direction is right.

However, the following issues need to be addressed before this can be merged:

1. Add a regression test

The heuristic change is untested. Please add a test that covers bare-host/root URL classification. Example:

def test_bare_host_url_classified_as_html(self):
    from urllib.parse import urlparse
    import os, mimetypes

    for url in ["https://example.com", "https://example.com/"]:
        parsed = urlparse(url)
        content_type, _ = mimetypes.guess_type(parsed.path)
        assert content_type is None
        is_html = (
            url.endswith(".html") or url.endswith(".htm") or
            (not parsed.path or parsed.path == "/") or
            (parsed.path and not os.path.splitext(parsed.path)[1])
        )
        assert is_html, f"{url} (path={parsed.path!r}) should be HTML"

2. Fix the content sniffing root cause

The real reason bare-host HTML goes undetected is that the content signature check at line 1981:

if content.startswith(b'<!DOCTYPE') or content.startswith(b'<html') or content.startswith(b'<HTML'):

...does not .lstrip() before checking. If the response has a BOM, leading whitespace, or Wayback-injected scripts before <html>, this silently fails. Meanwhile, download_file() at line 542 does strip: content[:200].strip().

Please fix this to be consistent — it makes detection robust for all URLs, not just bare-host:

content_stripped = content.lstrip()
if content_stripped.startswith((b'<!DOCTYPE', b'<!doctype', b'<html', b'<HTML')):

Note the sniffing also misses <!doctype lowercase, which download_file() line 543 already checks.

3. Unify the duplicate heuristic

There are two separate is_html heuristics that can drift independently:

  • download_file() line 522-528 (is_html_page) — already handles not path_lower
  • download() line 2017-2028 (is_html) — the one this PR fixes

Please extract a shared _is_html_url() method and use it in both places.

4. Note on partial redundancy

parsed.path == "/" is already covered by the existing fallback on line 2025 (parsed.path and not os.path.splitext(parsed.path)[1] and ...). Only not parsed.path (empty string, i.e. https://example.com without trailing slash) is genuinely new. The == "/" clause is harmless but redundant — feel free to keep it for clarity or drop it.

@GeiserX GeiserX merged commit e07676c into GeiserX:main May 4, 2026
9 of 11 checks passed
GeiserX added a commit that referenced this pull request May 4, 2026
Address all review comments from PR #6:

- Extract shared _is_html_url() static method used by both
  download_file() and download(), preventing heuristic drift
- Fix content sniffing to .lstrip() before checking signatures,
  handling BOM/whitespace/Wayback-injected scripts before <html>
- Add <!doctype lowercase to content signature detection
- Add regression tests for bare-host, root, extension, and
  extensionless URL classification
@GeiserX
Copy link
Copy Markdown
Owner

GeiserX commented May 4, 2026

Thanks for the contribution @pacnpal! Your fix for bare-host URLs has been incorporated and expanded in #24, which unified the HTML detection heuristic into a single shared method. Released in v1.4.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants