PY_Txt_Extractor: The Mega Master Evolution Plan (V9B to V9J+)

This document is the canonical, exhaustive roadmap for the production hardening and evolution of the PY_Txt_Extractor toolchain. It consolidates the modular architecture, format-routing/OCR lanes, and post-processing/compression/splitting lifecycle mandates.

1. CORE MANDATES & NON-NEGOTIABLE RULES

R1: No Regression: Preservation of merged TXT, per-file TXT, per-folder merged TXT, LLM export, run logs, and JSON job persistence is mandatory.
R2: No Direct Engine Rewrites: Move old logic into modules first; optimize incrementally (RAM reduction, checkpoints, streaming) only after.
R3: Sequential First: Multi-job execution is sequential for stability and Tkinter responsiveness.
R4: State to Disk: No future features shall depend on giant in-memory buffers (e.g., llm_text_parts joins).
R5: Additive Merge Only: Preservation of all existing JSON hooks and top-level keys.
R6: Lossless Only: Compression must be lossless (gzip, zip, lzma) and never replace source text silently.
R7: Split First, Compress Second: Split plain text first, then optionally compress parts.
R8: Safe Isolation: All development happens in isolated version folders (V9B, V9C, etc.) to protect V9A production.

2. ARCHITECTURAL TARGET MODEL

A. Target File System

main.py: Bootstrapping.
guiPy.py: Widgets and UI handlers.
coreenginePy.py: Extraction logic and routing.
jobconfigPy.py: JSON config handling (Schema v2).
stateDbPy.py: SQLite runtime state (checkpoints/metrics).
globrulesPy.py: fnmatch wildcard matching engine.
compressionPy.py: Lossless compression drivers.
splitterBridgePy.py: Wrapper for existing splitter tools.
metadataContractPy.py: Shared constants (e.g., === FILE META NEW FILE STARTING ===).
PyTxtExtConfig.json: Saved jobs and settings.
PyTxtExtState.sqlite3: Runtime history and checkpoints.

B. Format Extraction Lanes

Lane 1 (Native): .txt, .md, digital .pdf, .docx, .xlsx, .pptx. (Primary).
Lane 2 (OCR Fallback): Scanned PDFs, Images. Use OCRmyPDF or PyMuPDF+Tesseract.
Lane 3 (Legacy Conversion): Binary .doc, .xls, .ppt (Phase 8+).

C. Output Post-Processing Lifecycle

EXTRACT -> WRITE PLAIN -> SPLIT -> COMPRESS -> LOG EVERYTHING

3. MEGA FAT TINY ATOMIC ROADMAP

PHASE A: FREEZE AND MIRROR (DONE)

A001: Freeze TxtExtPyV_9A as read-only.
A002: Create TxtExtPyV_9B_GlobSupport isolated folder.
A003: Record source baseline, timestamps, and known bug/mismatch lists.

PHASE B: SCHEMA HARDENING (DONE)

B001: Migrate v1 dict jobs to v2 list jobs.
B002: Structure: excludes (files/folders/globs), targets (files/folders), destinations.
B003: Add enabled, job_order, extractor_mode, ocr_mode, and postprocess blocks.

PHASE C: MODULAR SKELETONS (NEXT)

C001: Create modular files (main, gui, core, jobconfig, stateDb, globrules).
C002: Implement metadataContractPy.py with shared constants.
C003: Implement classify_input_file(path) router in coreenginePy.py.

PHASE D: NATIVE DRIVERS & OFFICE LANES

D001: Add python-docx, openpyxl, and python-pptx extractors.
D002: Implement image intake classifier for .png, .jpg, .tif, .webp, etc.
D003: Define ExtractionResult object for routing.

PHASE E: OCR LANE INTEGRATION

E001: Add optional OCRmyPDF and PyMuPDF+Tesseract pipelines.
E002: Implement scanned-PDF detector to route to OCR only when needed.
E003: Office fallback: OCR embedded images in docx/pptx.

PHASE F: POST-PROCESSING (COMPRESS & SPLIT)

F001: compressionPy: Implement gzip (reproducible), zip, and lzma.
F002: splitterBridge: Integrate existing splitter_merger_console_v1_2 and marker_tool.
F003: Implement order matrix: Split plain first, then compress parts.

PHASE G: PERFORMANCE & RAM HARDENING

G001: Replace llm_text_parts joins with streaming chunk writers.
G002: Replace in-memory processed_entries with temp file references or incremental appends.
G003: Add size guardrails (max_file_bytes, max_pdf_pages, max_image_pixels).

PHASE H: SQLITE RUNTIME STATE

H001: stateDb: WAL-mode SQLite for runs, file status, metrics, and checkpoints.
H002: Persistence: Enable resume-by-run-id after failure.

PHASE I: JOB SYSTEM EXPANSION

I001: Multi-job batch execution (Run Selected, Run All).
I002: Add Job Management UI: Enable/Disable, Reorder, Delete.

PHASE J: NATIVE HELPERS (RUST)

J001: Optional Rust CLI for fast hashing, manifest generation, and PDF preflight.

4. DETAILED POST-PROCESSING POLICY

Supported Modes

Plain only: Standard text output.
Plain + Compressed copy: Source remains + archive copy.
Compressed only: Space-saving mode (source deleted after verification).
Plain + Splitter: Source text divided into chunks.
Plain + Compressed + Splitter: Chunks archived individually.

Splitter Feature Mapping

splitter_merger_console_v1_2: Primary engine for pattern/size/meta-block splits.
split_txt_by_line_marker.py: Specialist tool for exact-cut manual boundaries.
.cmd wrappers: Launchers only; do not own business logic.

Codified 2026-05-14 | Sources: Q#46, Q#47, Q#48 | Mandate: NO DATA LOSS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PY_Txt_Extractor: The Mega Master Evolution Plan (V9B to V9J+)

1. CORE MANDATES & NON-NEGOTIABLE RULES

2. ARCHITECTURAL TARGET MODEL

A. Target File System

B. Format Extraction Lanes

C. Output Post-Processing Lifecycle

3. MEGA FAT TINY ATOMIC ROADMAP

PHASE A: FREEZE AND MIRROR (DONE)

PHASE B: SCHEMA HARDENING (DONE)

PHASE C: MODULAR SKELETONS (NEXT)

PHASE D: NATIVE DRIVERS & OFFICE LANES

PHASE E: OCR LANE INTEGRATION

PHASE F: POST-PROCESSING (COMPRESS & SPLIT)

PHASE G: PERFORMANCE & RAM HARDENING

PHASE H: SQLITE RUNTIME STATE

PHASE I: JOB SYSTEM EXPANSION

PHASE J: NATIVE HELPERS (RUST)

4. DETAILED POST-PROCESSING POLICY

Supported Modes

Splitter Feature Mapping

FilesExpand file tree

FUTURE_PLAN_EXTRACTOR.md

Latest commit

History

FUTURE_PLAN_EXTRACTOR.md

File metadata and controls

PY_Txt_Extractor: The Mega Master Evolution Plan (V9B to V9J+)

1. CORE MANDATES & NON-NEGOTIABLE RULES

2. ARCHITECTURAL TARGET MODEL

A. Target File System

B. Format Extraction Lanes

C. Output Post-Processing Lifecycle

3. MEGA FAT TINY ATOMIC ROADMAP

PHASE A: FREEZE AND MIRROR (DONE)

PHASE B: SCHEMA HARDENING (DONE)

PHASE C: MODULAR SKELETONS (NEXT)

PHASE D: NATIVE DRIVERS & OFFICE LANES

PHASE E: OCR LANE INTEGRATION

PHASE F: POST-PROCESSING (COMPRESS & SPLIT)

PHASE G: PERFORMANCE & RAM HARDENING

PHASE H: SQLITE RUNTIME STATE

PHASE I: JOB SYSTEM EXPANSION

PHASE J: NATIVE HELPERS (RUST)

4. DETAILED POST-PROCESSING POLICY

Supported Modes

Splitter Feature Mapping