fix: post-merge follow-ups (font dedup + /process retention) by Whateverdoa · Pull Request #3 · Whateverdoa/OGOS_fastapi_pdfmodule

Whateverdoa · 2026-05-20T15:27:19Z

Two loose ends flagged during the color-analysis review (PR #2), both unrelated to that feature.

1. Remove duplicate font methods in `PDFUtils` (commit `311fc50`)

embed_all_fonts, outline_all_fonts, and has_unembedded_fonts were each defined twice in the class. Python keeps the second copy, which lacked the q/Q graphics-state imbalance fix (and a less thorough font-descriptor scan). The module-level aliases added in PR #2 therefore bound to the inferior versions. Removed the second set so the q/Q-fixing implementations win.

Verified: inspect.getsource(PDFUtils.outline_all_fonts) now contains fix_q_Q_imbalance.

2. Wire retention storage into `/process` (commit `d97ec05`)

FileManager already had save_original / save_processed / cleanup_old_files, but no endpoint called them — retention was dead code. Now /api/pdf/process and /process-with-json-file persist the original upload + processed output into settings.storage_dir and prune past the retention window.

New _retain_pdf_files helper: best-effort (save_original + save_processed + cleanup_old_files), logs on failure, never raises — retention cannot break a successful processing response.
Un-xfail test_pdf_retention_flow now that the storage half is implemented (the tuple→FileResponse TypeError was already fixed in PR feat: L02 color metadata analysis for /api/pdf/analyze #2).
.gitignore pdf_storage/ and test_pdf_storage/ so retained PDFs are never committed.

Test plan

uv run pytest tests/ → 129 passed (was 128 + 1 xfail; retention test now passes)
import main succeeds
No stray storage dirs left in git status after the suite
Reviewer sanity check

🤖 Generated with Claude Code

Note

Medium Risk
Adds best-effort persistence of uploaded and processed PDFs during /api/pdf/process*, which changes runtime storage behavior and cleanup timing. Also alters font processing utilities by removing duplicate method definitions, which could affect Ghostscript post-processing results.

Overview
Enables runtime PDF retention for processing requests. The /api/pdf/process and /api/pdf/process-with-json-file endpoints now best-effort save the original upload and processed output via FileManager and prune files past the retention window (logging failures without breaking successful responses).

Cleans up font utility duplication. Removes the duplicate PDFUtils implementations of embed_all_fonts, outline_all_fonts, and has_unembedded_fonts so the intended versions are used.

Also updates .gitignore to exclude pdf_storage/ and test_pdf_storage/, and unmarks the retention test as expected-fail now that the persistence path is implemented.

^{Reviewed by Cursor Bugbot for commit d97ec05. Bugbot is set up for automated code reviews on this repo. Configure here.}

embed_all_fonts, outline_all_fonts and has_unembedded_fonts were each defined twice in the class. Python kept the second (later) copies, which lacked the q/Q graphics-state imbalance fix and the more thorough font-descriptor scan. The module-level aliases therefore bound to the inferior versions. Removed the second set so the q/Q-fixing implementations win. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

/api/pdf/process and /process-with-json-file now persist the original upload and the processed output into settings.storage_dir via FileManager, and prune files past the retention window. Retention is best-effort: failures are logged and never break a successful processing response. - Add _retain_pdf_files helper (best-effort save_original + save_processed + cleanup_old_files; never raises). - Un-xfail test_pdf_retention_flow now that the storage half is implemented. - gitignore pdf_storage/ and test_pdf_storage/ so retained PDFs are never accidentally committed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Retention paths allow traversal
- Added _sanitize_filename() helper in FileManager that strips path components and traversal sequences using os.path.basename() before constructing destination paths, preventing client-controlled filenames from escaping storage directories.

Or push these changes by commenting:

@cursor push 151bbedc17

Preview (151bbedc17)

diff --git a/app/utils/file_manager.py b/app/utils/file_manager.py
--- a/app/utils/file_manager.py
+++ b/app/utils/file_manager.py
@@ -8,6 +8,20 @@
 
 logger = logging.getLogger(__name__)
 
+
+def _sanitize_filename(filename: str) -> str:
+    """Strip path components and traversal sequences to prevent directory escape.
+
+    Returns a safe base filename. Falls back to 'file' if the result is empty
+    or consists only of dots.
+    """
+    base = os.path.basename(filename.replace("\\", "/"))
+    base = base.lstrip(".")
+    if not base or base == "..":
+        return "file"
+    return base
+
+
 class FileManager:
     """
     Manages storage and cleanup of PDF files.
@@ -37,7 +51,7 @@
         try:
             # Add timestamp to filename to avoid collisions and track time
             timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-            safe_filename = f"{timestamp}_{filename}"
+            safe_filename = f"{timestamp}_{_sanitize_filename(filename)}"
             dest_path = self.original_dir / safe_filename
             
             shutil.copy2(file_path, dest_path)
@@ -61,7 +75,7 @@
         try:
             # Add timestamp if not already present (though processed files usually have unique names)
             timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-            safe_filename = f"{timestamp}_{filename}"
+            safe_filename = f"{timestamp}_{_sanitize_filename(filename)}"
             dest_path = self.processed_dir / safe_filename
             
             shutil.copy2(file_path, dest_path)

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit d97ec05. Configure here.}

cursor · 2026-05-20T15:29:01Z

+        if input_path and os.path.exists(input_path):
+            fm.save_original(input_path, input_filename)
+        if output_path and os.path.exists(output_path):
+            fm.save_processed(output_path, f"{reference or 'processed'}.pdf")


Retention paths allow traversal

High Severity

_retain_pdf_files passes client-controlled job_config reference and upload filename into FileManager without sanitizing path separators. FileManager joins those strings with Path, so values containing .. or / can write retained PDFs outside original/processed or into the sibling retention folder.

^{Reviewed by Cursor Bugbot for commit d97ec05. Configure here.}

PDF Dieline Processor and others added 2 commits May 20, 2026 17:26

cursor Bot reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: post-merge follow-ups (font dedup + /process retention)#3

fix: post-merge follow-ups (font dedup + /process retention)#3
Whateverdoa wants to merge 2 commits into
mainfrom
fix/post-merge-followups

Whateverdoa commented May 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Whateverdoa commented May 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Remove duplicate font methods in PDFUtils (commit 311fc50)

2. Wire retention storage into /process (commit d97ec05)

Test plan

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 20, 2026

Choose a reason for hiding this comment

Retention paths allow traversal

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Whateverdoa commented May 20, 2026 •

edited by cursor Bot

Loading

1. Remove duplicate font methods in `PDFUtils` (commit `311fc50`)

2. Wire retention storage into `/process` (commit `d97ec05`)

cursor Bot left a comment •

edited

Loading