Conversation
Some EPUBs store UTF-8 filename bytes without setting the ZIP UTF-8 flag (bit 11), causing Python's zipfile to decode entries as CP437 mojibake and fail cover lookup. Recover such names by re-encoding CP437→UTF-8 in a new _resolve_archive_path helper. Also normalize cover paths with posixpath.normpath so OPF hrefs that use '..'/'.' segments (relative to a nested OPF) resolve to the actual archive entry — restoring behavior the previous endswith-basename fallback masked. Removes the # pragma: no cover on epub_cover and adds tests for every branch (cover-detection methods, error paths, both new fixes).
Photoshop-exported PNGs commonly carry zTXt chunks (e.g. tiff:37724 ImageSourceData) that decompress past PIL's 1 MB MAX_TEXT_CHUNK guard, causing thumbnail extraction to fail for valid EPUB covers. Raise the limit to 4 MB.
Tika parse failures (e.g. TIKA-237 SAXException on EPUBs with deeply nested XHTML) surfaced as bare TypeError, breaking the IsccExtractionError contract that callers rely on. text_extract and text_meta_extract now catch the TypeError and re-raise as IsccExtractionError with the original Tika message preserved. Upstream tracked in iscc/iscc-tika#7.
zuban 0.7.1's default "typed" mode flags every io.BytesIO(...) call as "Cannot instantiate abstract class" because typeshed declares BytesIO's parents (BinaryIO/IOBase via Generic[AnyStr]) with abstract methods. mypy mode does not flag these. Disable the abstract error code globally since we don't define abstract classes ourselves.
Prevents _resolve_archive_path from matching an unrelated UTF-8 entry whose correctly decoded name happens to collide with the CP437 re-encoding of the target path.
- Support SVG cover images in EPUBs by rasterizing via resvg - Add IsccThumbExtractionError for recoverable thumbnail failures - Generate thumbnails early in code_iscc() and continue without one on failure instead of raising - Fix EPUB3 cover-image detection for multi-token properties attributes - Remove fallback to first manifest image (only explicit cover refs) - Rename _resolve_archive_path to resolve_archive_path (public API) - Improve API docstrings for code_iscc, code_iscc_mt, code_content, code_text options - Update dependencies (pydantic, onnxruntime, huggingface-hub, etc.)
…untime dep Refactored code_iscc_mt() for better parallelism by extracting text upfront, overlapping thumbnail generation with sum/meta computation, and aligning result merge order with code_iscc(). Removed redundant onnxruntime from sci/sct optional dependency groups since it is already a transitive dependency. Skipped semantic tests on macOS Python 3.12 where onnxruntime 1.26.0 lacks wheels.
Add a parametrized equivalence test covering all processing modes (image, audio, video, text via PDF and DOCX). Asserts the multithreaded code_iscc_mt produces output identical to the single-threaded code_iscc, with add_units and granular enabled to guard ISCC-UNIT and feature ordering through the parallelism refactor.
iscc-schema 0.7.0 version-pins the embedded @context/$schema URLs, which now resolve to 0.7.0. Update test assertions and changelog accordingly.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #158 +/- ##
==========================================
+ Coverage 99.72% 99.73% +0.01%
==========================================
Files 23 23
Lines 1809 1894 +85
==========================================
+ Hits 1804 1889 +85
Misses 5 5 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Hard-coded 0.7.0 in the @context/$schema URL assertions broke when iscc-schema bumped to 0.8.0. Derive the version from iscc_schema.__version__ so these tests stay green across schema releases while still verifying serialization structure.
Finalize the v0.9.3 release: - Bump version 0.9.2 -> 0.9.3 (pyproject, uv.lock, version test) - Date the changelog heading and correct the stale iscc-schema floor bullet (>=0.7.0 -> >=0.8.0) Also harden thumbnail error handling surfaced in the release review: code_iscc()/code_iscc_mt() now re-raise fatal IsccExtractionError (corrupt/invalid source files) from the optional thumbnail step instead of swallowing it, while missing-cover and other thumbnailer errors stay recoverable. epub_cover() raises the recoverable IsccThumbExtractionError when a declared cover is absent from the archive.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Release branch for v0.9.3, collecting EPUB cover/thumbnail robustness fixes,
code_iscc_mt()parallelism improvements, and dependency updates (including theiscc-schema>=0.7.0bump).Changes
resvg)IsccThumbExtractionErrorfor recoverable thumbnail extraction failurescode_iscc()to handle thumbnail extraction failures gracefully (logs warning, continues without thumbnail instead of raising) and to generate thumbnails early, before heavy content processing../.segmentsiscc-tikaparse failures asIsccExtractionErrorintext_extract/text_meta_extractcode_iscc_mt()for improved parallelism; verified output matchescode_iscc()onnxruntimefromsci/sctoptional dependency groupsiscc-schemafloor to>=0.7.0(version-pinned@context/$schemaURLs now resolve to0.7.0)onnxruntime1.26.0 wheels)Testing
uv run poe allpasses locally (lint, build-docs, tests @ 100% coverage). CI green across Python 3.11–3.14 on Windows, Ubuntu, and macOS.Note
pyproject.tomlversion is still0.9.2and the changelog entry reads0.9.3 - Unreleased— version bump / release-date finalization not yet done.