Skip to content

Latest commit

 

History

History
108 lines (84 loc) · 5.38 KB

File metadata and controls

108 lines (84 loc) · 5.38 KB

PY_Txt_Extractor: The Mega Master Evolution Plan (V9B to V9J+)

This document is the canonical, exhaustive roadmap for the production hardening and evolution of the PY_Txt_Extractor toolchain. It consolidates the modular architecture, format-routing/OCR lanes, and post-processing/compression/splitting lifecycle mandates.


1. CORE MANDATES & NON-NEGOTIABLE RULES

  • R1: No Regression: Preservation of merged TXT, per-file TXT, per-folder merged TXT, LLM export, run logs, and JSON job persistence is mandatory.
  • R2: No Direct Engine Rewrites: Move old logic into modules first; optimize incrementally (RAM reduction, checkpoints, streaming) only after.
  • R3: Sequential First: Multi-job execution is sequential for stability and Tkinter responsiveness.
  • R4: State to Disk: No future features shall depend on giant in-memory buffers (e.g., llm_text_parts joins).
  • R5: Additive Merge Only: Preservation of all existing JSON hooks and top-level keys.
  • R6: Lossless Only: Compression must be lossless (gzip, zip, lzma) and never replace source text silently.
  • R7: Split First, Compress Second: Split plain text first, then optionally compress parts.
  • R8: Safe Isolation: All development happens in isolated version folders (V9B, V9C, etc.) to protect V9A production.

2. ARCHITECTURAL TARGET MODEL

A. Target File System

  • main.py: Bootstrapping.
  • guiPy.py: Widgets and UI handlers.
  • coreenginePy.py: Extraction logic and routing.
  • jobconfigPy.py: JSON config handling (Schema v2).
  • stateDbPy.py: SQLite runtime state (checkpoints/metrics).
  • globrulesPy.py: fnmatch wildcard matching engine.
  • compressionPy.py: Lossless compression drivers.
  • splitterBridgePy.py: Wrapper for existing splitter tools.
  • metadataContractPy.py: Shared constants (e.g., === FILE META NEW FILE STARTING ===).
  • PyTxtExtConfig.json: Saved jobs and settings.
  • PyTxtExtState.sqlite3: Runtime history and checkpoints.

B. Format Extraction Lanes

  • Lane 1 (Native): .txt, .md, digital .pdf, .docx, .xlsx, .pptx. (Primary).
  • Lane 2 (OCR Fallback): Scanned PDFs, Images. Use OCRmyPDF or PyMuPDF+Tesseract.
  • Lane 3 (Legacy Conversion): Binary .doc, .xls, .ppt (Phase 8+).

C. Output Post-Processing Lifecycle

EXTRACT -> WRITE PLAIN -> SPLIT -> COMPRESS -> LOG EVERYTHING


3. MEGA FAT TINY ATOMIC ROADMAP

PHASE A: FREEZE AND MIRROR (DONE)

  • A001: Freeze TxtExtPyV_9A as read-only.
  • A002: Create TxtExtPyV_9B_GlobSupport isolated folder.
  • A003: Record source baseline, timestamps, and known bug/mismatch lists.

PHASE B: SCHEMA HARDENING (DONE)

  • B001: Migrate v1 dict jobs to v2 list jobs.
  • B002: Structure: excludes (files/folders/globs), targets (files/folders), destinations.
  • B003: Add enabled, job_order, extractor_mode, ocr_mode, and postprocess blocks.

PHASE C: MODULAR SKELETONS (NEXT)

  • C001: Create modular files (main, gui, core, jobconfig, stateDb, globrules).
  • C002: Implement metadataContractPy.py with shared constants.
  • C003: Implement classify_input_file(path) router in coreenginePy.py.

PHASE D: NATIVE DRIVERS & OFFICE LANES

  • D001: Add python-docx, openpyxl, and python-pptx extractors.
  • D002: Implement image intake classifier for .png, .jpg, .tif, .webp, etc.
  • D003: Define ExtractionResult object for routing.

PHASE E: OCR LANE INTEGRATION

  • E001: Add optional OCRmyPDF and PyMuPDF+Tesseract pipelines.
  • E002: Implement scanned-PDF detector to route to OCR only when needed.
  • E003: Office fallback: OCR embedded images in docx/pptx.

PHASE F: POST-PROCESSING (COMPRESS & SPLIT)

  • F001: compressionPy: Implement gzip (reproducible), zip, and lzma.
  • F002: splitterBridge: Integrate existing splitter_merger_console_v1_2 and marker_tool.
  • F003: Implement order matrix: Split plain first, then compress parts.

PHASE G: PERFORMANCE & RAM HARDENING

  • G001: Replace llm_text_parts joins with streaming chunk writers.
  • G002: Replace in-memory processed_entries with temp file references or incremental appends.
  • G003: Add size guardrails (max_file_bytes, max_pdf_pages, max_image_pixels).

PHASE H: SQLITE RUNTIME STATE

  • H001: stateDb: WAL-mode SQLite for runs, file status, metrics, and checkpoints.
  • H002: Persistence: Enable resume-by-run-id after failure.

PHASE I: JOB SYSTEM EXPANSION

  • I001: Multi-job batch execution (Run Selected, Run All).
  • I002: Add Job Management UI: Enable/Disable, Reorder, Delete.

PHASE J: NATIVE HELPERS (RUST)

  • J001: Optional Rust CLI for fast hashing, manifest generation, and PDF preflight.

4. DETAILED POST-PROCESSING POLICY

Supported Modes

  • Plain only: Standard text output.
  • Plain + Compressed copy: Source remains + archive copy.
  • Compressed only: Space-saving mode (source deleted after verification).
  • Plain + Splitter: Source text divided into chunks.
  • Plain + Compressed + Splitter: Chunks archived individually.

Splitter Feature Mapping

  • splitter_merger_console_v1_2: Primary engine for pattern/size/meta-block splits.
  • split_txt_by_line_marker.py: Specialist tool for exact-cut manual boundaries.
  • .cmd wrappers: Launchers only; do not own business logic.

Codified 2026-05-14 | Sources: Q#46, Q#47, Q#48 | Mandate: NO DATA LOSS