fix: unify is_html heuristic, fix content sniffing, add tests#24
Conversation
Address all review comments from PR #6: - Extract shared _is_html_url() static method used by both download_file() and download(), preventing heuristic drift - Fix content sniffing to .lstrip() before checking signatures, handling BOM/whitespace/Wayback-injected scripts before <html> - Add <!doctype lowercase to content signature detection - Add regression tests for bare-host, root, extension, and extensionless URL classification
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR refactors HTML-URL detection in the downloader by introducing a centralized ChangesHTML Detection Refactoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 54 minutes.Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #24 +/- ##
==========================================
+ Coverage 90.59% 90.69% +0.09%
==========================================
Files 5 5
Lines 1425 1440 +15
==========================================
+ Hits 1291 1306 +15
Misses 134 134
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
wayback_archive/downloader.py (1)
156-162: 💤 Low valueDead code:
non_htmlset is never used.The logic at line 157-158 returns
Falsefor any URL with an extension. Lines 159-162 are unreachable:
- If
extis truthy → returnsFalseat line 158- If
extis falsy → path cannot end with.css,.js, etc., so thenon_htmlcheck always returnsTrue♻️ Suggested simplification
ext = os.path.splitext(path_lower)[1] - if ext: - return False - non_html = {'.css', '.js', '.jpg', '.jpeg', '.png', '.gif', '.svg', - '.woff', '.woff2', '.ttf', '.eot', '.otf', '.ico', - '.json', '.xml', '.txt', '.pdf'} - return not any(path_lower.endswith(e) for e in non_html) + # Any extension means not HTML; no extension means likely HTML + return not ext🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wayback_archive/downloader.py` around lines 156 - 162, The current logic declares non_html but never uses it because the early "if ext: return False" makes the later check unreachable; change the logic to use ext against the non_html set instead: compute ext = os.path.splitext(path_lower)[1], then if ext: return ext not in non_html (where non_html = {'.css', '.js', ...}), else return True (no extension => treat as HTML). Update references to path_lower, ext and non_html accordingly and remove the unreachable code path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.coderabbit.yaml:
- Around line 43-45: Remove the duplicate entry in the filePatterns array so
"CLAUDE.md" only appears once; locate the filePatterns block in .coderabbit.yaml
and delete the redundant "CLAUDE.md" line (the duplicate entries under
filePatterns) so the array contains a single "CLAUDE.md" entry.
---
Nitpick comments:
In `@wayback_archive/downloader.py`:
- Around line 156-162: The current logic declares non_html but never uses it
because the early "if ext: return False" makes the later check unreachable;
change the logic to use ext against the non_html set instead: compute ext =
os.path.splitext(path_lower)[1], then if ext: return ext not in non_html (where
non_html = {'.css', '.js', ...}), else return True (no extension => treat as
HTML). Update references to path_lower, ext and non_html accordingly and remove
the unreachable code path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b0f67133-1cc9-4fc0-a660-cff3c98a6dc0
📒 Files selected for processing (5)
.coderabbit.yamlCLAUDE.mdtests/test_downloader.pywayback_archive/__init__.pywayback_archive/downloader.py
| filePatterns: | ||
| - "AGENTS.md" | ||
| - "CLAUDE.md" | ||
| - "CLAUDE.md" |
There was a problem hiding this comment.
Duplicate "CLAUDE.md" entry in filePatterns.
Lines 44 and 45 are identical. The rename of AGENTS.md → CLAUDE.md landed on top of an already-existing CLAUDE.md entry.
🛠️ Proposed fix
filePatterns:
- - "CLAUDE.md"
- "CLAUDE.md"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.coderabbit.yaml around lines 43 - 45, Remove the duplicate entry in the
filePatterns array so "CLAUDE.md" only appears once; locate the filePatterns
block in .coderabbit.yaml and delete the redundant "CLAUDE.md" line (the
duplicate entries under filePatterns) so the array contains a single "CLAUDE.md"
entry.
Summary
Addresses all review comments from PR #6 (bare-host URL detection):
_is_html_url()method: Extracted shared static method replacing two independent inline heuristics indownload_file()anddownload()that could drift independently.lstrip()before signature checks so BOM, leading whitespace, or Wayback-injected scripts before<html>don't cause false negatives. Also added<!doctypelowercase detection (consistent withdownload_file()line 543)Test plan
Summary by CodeRabbit
Release Notes
Version Update
Improvements
Tests