Fix downloading a linked css resource with no extension by jaltekruse · Pull Request #636 · learningequality/ricecooker

jaltekruse · 2025-10-24T21:00:50Z

Summary

While trying to download a textbook written with PreteXt, I ran into an issue with some google icons that they use not rendering properly. I found two bugs, one was incorrect filtering of css resources in link tags. The second bug involved the ArchiveDownloader incorrectly placing a generated filename of index.css at the root of the download, when hitting a linked resource with no extension, instead of nesting it into sub-directories relevant for the domain and other url paths of the resource. This caused further resources requested below the CSS file to incorrectly reference outside of the download directory entirely.

I saw this in the run of the ArchiveDownloader, which meant that this URL was not being downloaded/rewritten:
Skipping node with src https://fonts.googleapis.com/css2?family=Material+Symbols+Outlined:opsz,wght,FILL,GRAD@24,400,0,0

This is the relevant line from the source site:
<link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Material+Symbols+Outlined:opsz,wght,FILL,GRAD@24,400,0,0">

I found a bug in css_node_filter, it was excluding this tag even though it had a "rel" attribute. While the attributes can be accessed from the root Tag object like node["rel"], the in operator doesn't work as they are in a sub-property that gets re-exposed by the class.

Observed results of download with this first change to fix the css resource filtering

downloads/active_calc_2e_again_823433/
├── activecalculus.org
│   └── single2e
│       ├── external
│       ├── frontmatter.html
├── fonts.googleapis.com
├── fonts.gstatic.com
│   └── s
│       └── materialsymbolsoutlined
│           └── v290
│               └── kJF1BvYX7BgnkSrUwT8OhrdQw4oELdPIeeII9v6oDMzByHX9rA6RzaxHMPdY43zj-jCxv3fzvRNU22ZXGJpEpjC_1v-p_4MrImHCIJIZrDCvHOel.woff
├── index.css

The index.css file that was placed at the root had these contents, meaning it was making a relative path reference outside of the download directory:

@font-face {
  font-family: 'Material Symbols Outlined';
  font-style: normal;
  font-weight: 400;
  src: url("../../fonts.gstatic.com/s/materialsymbolsoutlined/v290/kJF1BvYX7BgnkSrUwT8OhrdQw4oELdPIeeII9v6oDMzByHX9rA6RzaxHMPdY43zj-jCxv3fzvRNU22ZXGJpEpjC_1v-p_4MrImHCIJIZrDCvHOel.woff") format('woff');
}

It seemed like the correct fix was to get this index.css to be placed down into the file-system hierarchy based on the resource URL, rather than change how this relative path was being generated. The fix was simple once I found the relevant line of code to change. Currently the diff is a little bigger than needed as I assumed I might want to do some escaping of the URL to turn it into a file path, but thinking a bit longer on it I thought it might be reasonably likely anything that can be in an HTTP URL might be fine to be in a unix path (assuming this tool is only going to be run in unix environments), and that all slashes are used to create sub-directories. I can simplify it down the the one line version of the fix and remove the todo if you agree.

References

Page I was downloading:
https://activecalculus.org/single2e/frontmatter.html

Reviewer guidance

I'm interested in adding a automated test, I don't necessarily want to hot link to the source site in a test, but as this is just for the scraping library I assume these tests aren't run that often. Considering if it might be useful to add some kind of test tool that would enable just checking in a directory of test web resources that could be "downloaded" by exposing them through a little local webserver or something. Happy to hear any thoughts or guidance on testing.

jaltekruse · 2025-10-31T16:28:41Z

@rtibbles I added a test, I did have a thought that this test could be a little more complete, as I took a folder of the downloaded result and served it from a local webserver. That just means that I am already including relative links in the test "source site", so I could be a little more thorough and start up two little servers on different ports that would exercise some of the logic around downloading from different servers/domains.

I'm very likely going to be hanging around in this code more, so I'm likely to write more tests and can do that as a follow up. I'm already working on fixing the code for capturing a site with executing javascript on a headless browser. The default PhantomJS browser that it is invoking when I run it is no longer supported, but swapping in headless chrome is easy, just currently hunting down further resulting things that are coming up when exercising that code.

jaltekruse · 2025-11-07T18:11:03Z

@rtibbles I fixed a few issues with the test, this should now be good to go. I didn't do the thing I mentioned in my last comment about serving the html and CSS/font from different servers/domains, but I don't think that is really necessary.

…ension issue

rtibbles · 2025-11-26T16:16:49Z

Hi @jaltekruse - sorry for the delay in review here, I've taken a quick read through, and I'm not seeing anything alarming, and the test coverage looks it provides a very realistic test! One possibility would have been to mock the requests responses instead of running a full server.

If you can run pre-commit run --all-files on the PR that should fix the linting issues. And lastly, it looks like the paths you've used are not working well on Windows, I can't quite tell from the error traceback if it's the paths of the files themselves, or how they're being composed in the test, but I suspect the former.

Also note that I rebased this onto develop as the tests on develop were broken previously! Feel free to do your own rebase if you like :)

Introduces ricecooker/utils/url_utils.py with pure-function utilities for detecting and rewriting external URL references in HTML, CSS, and H5P JSON content. These will be used by the archive processor (Phase 3) to download external resources and bundle them into archives for offline use. Functions: is_external_url, derive_local_filename, extract_urls_from_html, extract_urls_from_css, extract_urls_from_h5p_json, rewrite_urls_in_html, rewrite_urls_in_css, rewrite_urls_in_h5p_json. Includes bug fixes from PRs learningequality#636 (css_node_filter attr check) and learningequality#639 (CSS @import bare string regex) incorporated into the new utilities. 64 new tests in tests/test_url_utils.py — all pure-function, no I/O. Ref: learningequality#233, supersedes learningequality#303 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…egexes - Import _CSS_IMPORT_RE from url_utils instead of defining locally, consolidating the @import regex from PR learningequality#639 into the shared module - Use _CSS_URL_RE.search() for background-image extraction instead of a separate inline regex - Clean up srcset parsing to match url_utils conventions - Fix linting issues in test_downloader.py from PR learningequality#636 merge - Make parse_srcset public (renamed from _parse_srcset) - Relax _CSS_IMPORT_RE to allow optional whitespace (\s*) for compat Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jaltekruse · 2026-02-19T00:26:55Z

Edited - I'm pretty sure I'm being dumb, I shouldn't check in files to source control with names that are incompatible with some filesystems. I'm going to just change this to have one version of the test that should run on windows and unix.

Further Edit: I have updated the test and squashed the bad file paths out of the git history.

Also a note on the thing about the follow up issue I had mentioned below, I'm actually not confident that the underlying bug I assumed was there is present, I'll investigate further and if needed file a follow up issue.

Original comment:
Hey @rtibbles , it's my turn to apologize for the slow follow up. I believe I have fixed the lint issues and tests, it looks like the CI stuff may need manual approval, but the tests and lint checks are running okay in my Ubuntu environment. I "fixed" the test by splitting it in 2, one version acts like the existing one with the URL that has a colon, which Windows didn't like, and I put a conditional skip in Windows environments. I also made a second version of the test that runs in all environments with a manual removal of the colon symbol.

I'll file a follow up issue about the need to sanitize the URLs if this code is actually expected to be run in a native windows environment at some point. But seeing as how there are a number of places in the code that deal with these URLs, and check if a file is already present to prevent re-downloading, seems like I want to do a little more thorough thinking about the best place to enforce that constraint.

- for posterity this test was created based on a real-world case where a URL had a colon in it, which cannot appear in a file path on windows - this commit is a squashed version of interating on the test, eventually realizing that no files that will cause filesystem issues on any platform should get checked into version control, and I wanted to squash the bad path out of the git history

rtibbles self-assigned this Oct 28, 2025

jaltekruse added 2 commits November 25, 2025 11:25

Fix CSS resource filtering for extensionless URLs

3810d32

One possible fix for the downloading a linked resource without an ext…

73e5877

…ension issue

rtibbles force-pushed the fix-pretext-downloads branch from a56febf to 9daa7e4 Compare November 25, 2025 19:25

rtibbles mentioned this pull request Feb 13, 2026

Handle external URL references in archives to enable offline use in Kolibri #233

Open

rtibblesbot mentioned this pull request Feb 13, 2026

Handle external URL references in archives for offline Kolibri use #655

Closed

6 tasks

jaltekruse force-pushed the fix-pretext-downloads branch from c6b9a91 to 6536751 Compare February 20, 2026 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix downloading a linked css resource with no extension#636

Fix downloading a linked css resource with no extension#636
jaltekruse wants to merge 3 commits intolearningequality:mainfrom
jaltekruse:fix-pretext-downloads

jaltekruse commented Oct 24, 2025 •

edited

Loading

Uh oh!

jaltekruse commented Oct 31, 2025

Uh oh!

jaltekruse commented Nov 7, 2025

Uh oh!

rtibbles commented Nov 26, 2025

Uh oh!

jaltekruse commented Feb 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

jaltekruse commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

References

Reviewer guidance

Uh oh!

jaltekruse commented Oct 31, 2025

Uh oh!

jaltekruse commented Nov 7, 2025

Uh oh!

rtibbles commented Nov 26, 2025

Uh oh!

jaltekruse commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jaltekruse commented Oct 24, 2025 •

edited

Loading

jaltekruse commented Feb 19, 2026 •

edited

Loading