fix(images): don't overwrite relative images that share a basename#122
fix(images): don't overwrite relative images that share a basename#122jichaowang02-lang wants to merge 2 commits into
Conversation
`copy_relative_images` named each destination after `src.name` (the basename
only). Two image references to different files that share a basename — e.g.
`` and `` — both copied to
`images_dir/logo.png`, so the second `shutil.copy2` silently overwrote the
first (one image's bytes are lost) and both links were rewritten to the same
path, rendering the same image in both places.
Track an assigned destination per source: reuse it when the identical source
is referenced more than once (no duplicate copy), and disambiguate with a
`{stem}_{n}{suffix}` suffix when a different source would collide on a name
already taken. (`extract_base64_images` already avoids this via its
`img_{counter:03d}` scheme.)
Adds regression tests for the same-basename collision and the
identical-source-referenced-twice cases.
There was a problem hiding this comment.
Pull request overview
Fixes a data-loss bug in the Markdown conversion pipeline where copy_relative_images could overwrite one relative image with another when different source paths share the same basename (e.g., a/logo.png and b/logo.png), causing both links to render the same copied file.
Changes:
- Add per-source destination tracking to dedupe repeated references to the same image and disambiguate basename collisions via
{stem}_{n}{suffix}. - Add tests covering same-basename/different-source behavior and same-source referenced twice (copy once).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| openkb/images.py | Prevent basename collisions and dedupe repeated references when copying relative images. |
| tests/test_images.py | Add regression tests for basename collisions and deduplication behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assigned: dict[Path, str] = {} | ||
| taken: set[str] = set() |
…g prior files Review feedback: copy_relative_images relied on the in-call taken set to disambiguate basenames, but a file already present in images_dir (e.g. from a prior conversion) was not in taken, so a same-basename source could still overwrite it. Seed taken from the existing directory contents.
There was a problem hiding this comment.
Core fix (first commit) is correct.
The second commit — seeding taken from images_dir, per Copilot's note — is the problem. This dir is per-doc_name, so it only ever holds this doc's own images from a prior run; overwriting them on re-convert is the desired refresh, not data loss. The real bug (two different sources colliding in one pass) is already prevented by taken starting empty. Seeding from disk instead makes re-convert non-idempotent — stale orphans + _1/_2 link churn on every edit.
Suggest dropping the second commit — the first one alone is the right fix.
Summary
copy_relative_imagescould silently lose an image and render the wrongone. It named every destination after
src.name(the basename only), so tworeferences to different files that happen to share a basename collide.
Root cause
For
+:images_dir/logo.png→ the secondcopy2overwrites the first(image A's bytes are gone), and
sources/images/<doc>/logo.png, so the rendereddocument shows the same image in both places.
extract_base64_imagesalready avoids this by numbering destinations(
img_{counter:03d}); only the relative-image path was affected.a/logo.png+b/logo.png(distinct)logo.png,logo_1.png), distinct linkslogo.pngreferenced twice (same file)Fix
Track the destination assigned to each source: reuse it when the identical
source is referenced more than once (no duplicate copy), and disambiguate with
a
{stem}_{n}{suffix}suffix when a different source would collide on aname already taken.
Testing
Adds
test_same_basename_different_dirs_no_overwrite(both images preserved,links distinct) and
test_same_image_referenced_twice_is_copied_once(dedup).