This document is the canonical, exhaustive roadmap for the production hardening and evolution of the PY_Txt_Extractor toolchain. It consolidates the modular architecture, format-routing/OCR lanes, and post-processing/compression/splitting lifecycle mandates.
- R1: No Regression: Preservation of merged TXT, per-file TXT, per-folder merged TXT, LLM export, run logs, and JSON job persistence is mandatory.
- R2: No Direct Engine Rewrites: Move old logic into modules first; optimize incrementally (RAM reduction, checkpoints, streaming) only after.
- R3: Sequential First: Multi-job execution is sequential for stability and Tkinter responsiveness.
- R4: State to Disk: No future features shall depend on giant in-memory buffers (e.g.,
llm_text_partsjoins). - R5: Additive Merge Only: Preservation of all existing JSON hooks and top-level keys.
- R6: Lossless Only: Compression must be lossless (
gzip,zip,lzma) and never replace source text silently. - R7: Split First, Compress Second: Split plain text first, then optionally compress parts.
- R8: Safe Isolation: All development happens in isolated version folders (V9B, V9C, etc.) to protect V9A production.
main.py: Bootstrapping.guiPy.py: Widgets and UI handlers.coreenginePy.py: Extraction logic and routing.jobconfigPy.py: JSON config handling (Schema v2).stateDbPy.py: SQLite runtime state (checkpoints/metrics).globrulesPy.py:fnmatchwildcard matching engine.compressionPy.py: Lossless compression drivers.splitterBridgePy.py: Wrapper for existing splitter tools.metadataContractPy.py: Shared constants (e.g.,=== FILE META NEW FILE STARTING ===).PyTxtExtConfig.json: Saved jobs and settings.PyTxtExtState.sqlite3: Runtime history and checkpoints.
- Lane 1 (Native):
.txt,.md, digital.pdf,.docx,.xlsx,.pptx. (Primary). - Lane 2 (OCR Fallback): Scanned PDFs, Images. Use
OCRmyPDForPyMuPDF+Tesseract. - Lane 3 (Legacy Conversion): Binary
.doc,.xls,.ppt(Phase 8+).
EXTRACT -> WRITE PLAIN -> SPLIT -> COMPRESS -> LOG EVERYTHING
- A001: Freeze
TxtExtPyV_9Aas read-only. - A002: Create
TxtExtPyV_9B_GlobSupportisolated folder. - A003: Record source baseline, timestamps, and known bug/mismatch lists.
- B001: Migrate v1 dict jobs to v2 list jobs.
- B002: Structure:
excludes(files/folders/globs),targets(files/folders),destinations. - B003: Add
enabled,job_order,extractor_mode,ocr_mode, andpostprocessblocks.
- C001: Create modular files (main, gui, core, jobconfig, stateDb, globrules).
- C002: Implement
metadataContractPy.pywith shared constants. - C003: Implement
classify_input_file(path)router incoreenginePy.py.
- D001: Add
python-docx,openpyxl, andpython-pptxextractors. - D002: Implement image intake classifier for
.png,.jpg,.tif,.webp, etc. - D003: Define
ExtractionResultobject for routing.
- E001: Add optional
OCRmyPDFandPyMuPDF+Tesseractpipelines. - E002: Implement scanned-PDF detector to route to OCR only when needed.
- E003: Office fallback: OCR embedded images in docx/pptx.
- F001:
compressionPy: Implementgzip(reproducible),zip, andlzma. - F002:
splitterBridge: Integrate existingsplitter_merger_console_v1_2andmarker_tool. - F003: Implement order matrix: Split plain first, then compress parts.
- G001: Replace
llm_text_partsjoins with streaming chunk writers. - G002: Replace in-memory
processed_entrieswith temp file references or incremental appends. - G003: Add size guardrails (
max_file_bytes,max_pdf_pages,max_image_pixels).
- H001:
stateDb: WAL-mode SQLite for runs, file status, metrics, and checkpoints. - H002: Persistence: Enable resume-by-run-id after failure.
- I001: Multi-job batch execution (Run Selected, Run All).
- I002: Add Job Management UI: Enable/Disable, Reorder, Delete.
- J001: Optional Rust CLI for fast hashing, manifest generation, and PDF preflight.
Plain only: Standard text output.Plain + Compressed copy: Source remains + archive copy.Compressed only: Space-saving mode (source deleted after verification).Plain + Splitter: Source text divided into chunks.Plain + Compressed + Splitter: Chunks archived individually.
splitter_merger_console_v1_2: Primary engine for pattern/size/meta-block splits.split_txt_by_line_marker.py: Specialist tool for exact-cut manual boundaries..cmdwrappers: Launchers only; do not own business logic.
Codified 2026-05-14 | Sources: Q#46, Q#47, Q#48 | Mandate: NO DATA LOSS