Skip to content

fix: post-merge follow-ups (font dedup + /process retention)#3

Open
Whateverdoa wants to merge 2 commits into
mainfrom
fix/post-merge-followups
Open

fix: post-merge follow-ups (font dedup + /process retention)#3
Whateverdoa wants to merge 2 commits into
mainfrom
fix/post-merge-followups

Conversation

@Whateverdoa
Copy link
Copy Markdown
Owner

@Whateverdoa Whateverdoa commented May 20, 2026

Two loose ends flagged during the color-analysis review (PR #2), both unrelated to that feature.

1. Remove duplicate font methods in PDFUtils (commit 311fc50)

embed_all_fonts, outline_all_fonts, and has_unembedded_fonts were each defined twice in the class. Python keeps the second copy, which lacked the q/Q graphics-state imbalance fix (and a less thorough font-descriptor scan). The module-level aliases added in PR #2 therefore bound to the inferior versions. Removed the second set so the q/Q-fixing implementations win.

Verified: inspect.getsource(PDFUtils.outline_all_fonts) now contains fix_q_Q_imbalance.

2. Wire retention storage into /process (commit d97ec05)

FileManager already had save_original / save_processed / cleanup_old_files, but no endpoint called them — retention was dead code. Now /api/pdf/process and /process-with-json-file persist the original upload + processed output into settings.storage_dir and prune past the retention window.

  • New _retain_pdf_files helper: best-effort (save_original + save_processed + cleanup_old_files), logs on failure, never raises — retention cannot break a successful processing response.
  • Un-xfail test_pdf_retention_flow now that the storage half is implemented (the tuple→FileResponse TypeError was already fixed in PR feat: L02 color metadata analysis for /api/pdf/analyze #2).
  • .gitignore pdf_storage/ and test_pdf_storage/ so retained PDFs are never committed.

Test plan

  • uv run pytest tests/ → 129 passed (was 128 + 1 xfail; retention test now passes)
  • import main succeeds
  • No stray storage dirs left in git status after the suite
  • Reviewer sanity check

🤖 Generated with Claude Code


Note

Medium Risk
Adds best-effort persistence of uploaded and processed PDFs during /api/pdf/process*, which changes runtime storage behavior and cleanup timing. Also alters font processing utilities by removing duplicate method definitions, which could affect Ghostscript post-processing results.

Overview
Enables runtime PDF retention for processing requests. The /api/pdf/process and /api/pdf/process-with-json-file endpoints now best-effort save the original upload and processed output via FileManager and prune files past the retention window (logging failures without breaking successful responses).

Cleans up font utility duplication. Removes the duplicate PDFUtils implementations of embed_all_fonts, outline_all_fonts, and has_unembedded_fonts so the intended versions are used.

Also updates .gitignore to exclude pdf_storage/ and test_pdf_storage/, and unmarks the retention test as expected-fail now that the persistence path is implemented.

Reviewed by Cursor Bugbot for commit d97ec05. Bugbot is set up for automated code reviews on this repo. Configure here.

PDF Dieline Processor and others added 2 commits May 20, 2026 17:26
embed_all_fonts, outline_all_fonts and has_unembedded_fonts were each defined
twice in the class. Python kept the second (later) copies, which lacked the
q/Q graphics-state imbalance fix and the more thorough font-descriptor scan.
The module-level aliases therefore bound to the inferior versions. Removed the
second set so the q/Q-fixing implementations win.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
/api/pdf/process and /process-with-json-file now persist the original upload
and the processed output into settings.storage_dir via FileManager, and prune
files past the retention window. Retention is best-effort: failures are logged
and never break a successful processing response.

- Add _retain_pdf_files helper (best-effort save_original + save_processed +
  cleanup_old_files; never raises).
- Un-xfail test_pdf_retention_flow now that the storage half is implemented.
- gitignore pdf_storage/ and test_pdf_storage/ so retained PDFs are never
  accidentally committed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Retention paths allow traversal
    • Added _sanitize_filename() helper in FileManager that strips path components and traversal sequences using os.path.basename() before constructing destination paths, preventing client-controlled filenames from escaping storage directories.

Create PR

Or push these changes by commenting:

@cursor push 151bbedc17
Preview (151bbedc17)
diff --git a/app/utils/file_manager.py b/app/utils/file_manager.py
--- a/app/utils/file_manager.py
+++ b/app/utils/file_manager.py
@@ -8,6 +8,20 @@
 
 logger = logging.getLogger(__name__)
 
+
+def _sanitize_filename(filename: str) -> str:
+    """Strip path components and traversal sequences to prevent directory escape.
+
+    Returns a safe base filename. Falls back to 'file' if the result is empty
+    or consists only of dots.
+    """
+    base = os.path.basename(filename.replace("\\", "/"))
+    base = base.lstrip(".")
+    if not base or base == "..":
+        return "file"
+    return base
+
+
 class FileManager:
     """
     Manages storage and cleanup of PDF files.
@@ -37,7 +51,7 @@
         try:
             # Add timestamp to filename to avoid collisions and track time
             timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-            safe_filename = f"{timestamp}_{filename}"
+            safe_filename = f"{timestamp}_{_sanitize_filename(filename)}"
             dest_path = self.original_dir / safe_filename
             
             shutil.copy2(file_path, dest_path)
@@ -61,7 +75,7 @@
         try:
             # Add timestamp if not already present (though processed files usually have unique names)
             timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-            safe_filename = f"{timestamp}_{filename}"
+            safe_filename = f"{timestamp}_{_sanitize_filename(filename)}"
             dest_path = self.processed_dir / safe_filename
             
             shutil.copy2(file_path, dest_path)

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit d97ec05. Configure here.

Comment thread app/api/endpoints/pdf.py
if input_path and os.path.exists(input_path):
fm.save_original(input_path, input_filename)
if output_path and os.path.exists(output_path):
fm.save_processed(output_path, f"{reference or 'processed'}.pdf")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retention paths allow traversal

High Severity

_retain_pdf_files passes client-controlled job_config reference and upload filename into FileManager without sanitizing path separators. FileManager joins those strings with Path, so values containing .. or / can write retained PDFs outside original/processed or into the sibling retention folder.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d97ec05. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant