Skip to content

Feat/validation driven extraction#416

Open
Calebnzm wants to merge 4 commits intofireform-core:mainfrom
Calebnzm:feat/validation-driven-extraction
Open

Feat/validation driven extraction#416
Calebnzm wants to merge 4 commits intofireform-core:mainfrom
Calebnzm:feat/validation-driven-extraction

Conversation

@Calebnzm
Copy link
Copy Markdown

@Calebnzm Calebnzm commented Apr 7, 2026

Summary

This PR introduces a validation-driven extraction pipeline — replacing the previous single-pass, prompt-per-field approach with a multi-phase extraction agent that performs both syntactic and semantic validation of extracted values, with iterative self-correction.

It builds directly on feat: robust report management and configuration architecture + API, which introduced the Report Schema system with field-level configuration (data types, word limits, allowed values, descriptions). That is why the PR is really huge, it builds on top of the last one, this PR only modifies 2 files from the previous PR, so kindly check it out. This PR puts that configuration to work — the extraction pipeline now enforces those constraints at inference time and uses them to drive accurate, validated extraction.

In a nutshell: Instead of extracting each field independently with no quality checks, the LLM now extracts all canonical fields in a single structured pass, validates every result against the schema's constraints, and iteratively re-prompts itself to correct failures — producing verified, schema-compliant output.


Previous State

The extraction logic (LLM class) had the following limitations:

  1. One field at a time. main_loop iterated over fields and fired a separate Ollama prompt for each one. For a schema with 30 fields, that meant 30 independent inference calls with no shared context between them.

  2. No validation whatsoever. The LLM's raw text response was stored directly. If it returned "fourteen" for an int field, or a 200-word paragraph for a field with a 10-word limit, the system accepted it without question.

  3. Hardcoded to a single provider. The Ollama URL and model name (mistral) were embedded directly in the extraction loop. Switching to a cloud provider required code changes.

  4. No structured output. The LLM returned a raw string. There was no confidence signal, no reasoning trace, and no machine-parseable structure to validate against.

  5. Field configuration was unused. The previous PR introduced data_type, word_limit, allowed_values, and description on SchemaField — but the extraction pipeline didn't use any of them.


What This PR Introduces

1. Multi-Provider LLM Abstraction

The LLM class now supports Ollama (local) and Gemini (cloud) through a unified interface, configured via environment variables:

Variable Default Purpose
LLM_PROVIDER ollama Provider selection (ollama or gemini)
LLM_MODEL mistral Model name passed to the provider
OLLAMA_HOST http://localhost:11434 Ollama server URL
GEMINI_API_KEY Google Gemini API key

set_model_config() can be called with explicit arguments to override env vars, or with no arguments to use defaults. The previous hardcoded provider="gemini", model_name="gemini-2.5-flash" in file_manipulator.py has been removed — the system now respects the environment configuration.

inference() is a single classmethod that accepts an OpenAI-style messages array and routes to the correct provider API, translating message formats as needed (e.g. Gemini's system_instruction field, "model" role naming, responseMimeType for JSON mode).

2. Syntactic Validator

A static method that validates an extracted value against its CanonicalFieldEntry descriptor:

  • Data type enforcement — checks int parseability, string/date type conformance
  • Word limit — rejects strings exceeding word_limit
  • Allowed values — for enum fields, checks membership in the configured value set

Returns a list of structured error objects (e.g. {"data_type_error": "expected: int, however 'fourteen' is: str"}) or None if valid.

3. Semantic Validator

An LLM-as-judge pattern: a second inference call receives the extracted values alongside the original source text, field descriptions, and the extractor's reasoning — then evaluates whether each extraction is semantically correct.

Returns a dict mapping field names to error descriptions for any fields that fail validation. Fields not in the dict are considered semantically valid.

4. Validation-Driven Extraction Pipeline (extractor)

The core addition. A multi-phase agent loop that replaces main_loop for schema-driven extraction:

flowchart TD
    START["Receive canonical field descriptors"] --> EXTRACT

    EXTRACT["Send all pending fields to LLM in a single prompt
    Each field includes: description, expected type,
    word limit, required status, and allowed values"]

    EXTRACT --> PARSE["LLM responds with a JSON map:
    one entry per field containing the extracted value,
    its reasoning, and a confidence score"]

    PARSE --> SYNTACTIC["Syntactic Validation
    Check each value against its field descriptor:
    - Does the type match? e.g. int, string, date
    - Is the word count within the limit?
    - Is the value in the allowed set, if constrained?"]

    SYNTACTIC --> SYN_OK{"All fields
    syntactically valid?"}
    SYN_OK -- No --> CORRECT["Send failed fields back to the LLM
    with their specific errors, asking it
    to re-extract only those fields"]
    CORRECT -- "Retry up to 5 times" --> SYNTACTIC
    SYN_OK -- Yes --> CONFIDENCE

    CONFIDENCE{"Is the LLM's self-reported
    confidence >= 90%?"}
    CONFIDENCE -- Yes --> ACCEPT["Accept value"]
    CONFIDENCE -- No --> SEMANTIC

    SEMANTIC["Semantic Validation
    A separate LLM call reviews low-confidence values
    against the original source text and field descriptions
    to check if they actually make sense in context"]

    SEMANTIC --> SEM_OK{"Semantically
    correct?"}
    SEM_OK -- Yes --> ACCEPT
    SEM_OK -- No --> FEEDBACK["Feed the semantic errors back
    into the conversation history
    so the next extraction attempt
    has context on what went wrong"]
    FEEDBACK --> EXTRACT

    ACCEPT --> REMAINING{"Unresolved
    fields left?"}
    REMAINING -- "Yes, up to 10 iterations" --> EXTRACT
    REMAINING -- No --> DONE["Return all validated extracted values"]
Loading

Phase-by-phase:

Phase 1 — Batch Extraction. All pending canonical fields are sent to the LLM in a single prompt, with their full descriptors (description, expected data type, word limit, required status, allowed values). The LLM responds with a structured JSON map where each field contains candidate_value, reasoning, and confidence. The entire conversation history is maintained across iterations, giving the LLM context from prior attempts.

Phase 2 — Syntactic Validation & Correction. Each extracted value is validated against its field descriptor using syntactic_validator. Fields that fail are collected with their specific errors and re-prompted in a correction loop (up to 5 retries per outer iteration). The correction prompt includes the previous invalid output and the exact errors, allowing the LLM to fix targeted issues without re-extracting valid fields.

Phase 3 — Confidence Filtering. Syntactically valid fields are split by confidence score:

  • ≥ 0.90 — accepted directly into the results dict and removed from the pending set.
  • < 0.90 — forwarded to semantic validation.

Phase 4 — Semantic Validation. Low-confidence fields are batched and sent to semantic_validator, which invokes a separate LLM call to evaluate correctness against the source text. Fields that pass are accepted. Fields that fail have their error descriptions appended to the conversation history as user feedback, and the outer loop re-extracts them with that context.

Termination. The outer loop continues until all fields are resolved or the max iteration count (10) is reached. Unresolved fields after max retries are set to None.

5. Refactored Legacy Path

main_loop — used for standalone template filling — has been updated to use the unified inference() method instead of direct HTTP calls. Its behavior is otherwise unchanged.


Files Changed

File Change
src/llm.py +390 −25 — Provider abstraction, inference routing, syntactic/semantic validators, extractor pipeline, main_loop refactor
src/file_manipulator.py +1 −1 — Removed hardcoded provider/model from set_model_config() call

How It Fits Into the Fill Pipeline

The extraction pipeline is invoked during the schema fill flow (POST /schemas/{id}/fill):

API Request → Controller.fill_report() → FileManipulator.fill_report()
    → ReportSchemaProcessor.canonize()     # Build canonical target
    → LLM.extractor(canonical_schema)      # THIS PR — validated extraction
    → ReportSchemaProcessor.distribute()   # Map canonical → per-template
    → Filler.fill_form_by_name()           # Write to PDFs

The extractor receives a CanonicalSchema — the output of canonization from the previous PR — and returns a dict[str, Any] mapping canonical field names to validated values. These are then distributed to per-template field dicts via the field mappings and written to the PDF forms.


Related Issues

Issue Title How This PR Addresses It
Closes #403 Optimize LLM.main_loop() for batch extraction of fields The new extractor() replaces per-field sequential calls with a single batch extraction prompt. All canonical fields are sent in one structured request, regardless of form size.
Closes #399 Batch LLM extraction for faster PDF processing Same as above — batch extraction eliminates N round-trips. A 30-field schema now requires one extraction call instead of 30.
Closes #391 Implement Batch LLM Extraction for 10x Performance Gain Same as above. Additionally, format="json" is now used across both Ollama and Gemini providers to guarantee structured JSON output.
Closes #412 Prompt Engineering Layer for Robust LLM-Based Data Extraction The extraction prompt is now structured with per-field descriptors: description, expected data type, word limit, required status, and allowed values. This gives the LLM explicit, field-level guidance instead of a generic prompt.
Closes #409 Add Field-Level Validation Feedback (Human-Readable Errors for Extracted Data) The syntactic validator returns structured error objects per field — e.g. {"data_type_error": "expected: int, however 'fourteen' is: str"}, {"word_limit_error": "word count 25 exceeds limit of 10"}. These are used internally for self-correction and are available for surfacing to users.
Closes #256 Multi-model LLM support set_model_config() supports Ollama and Gemini via LLM_PROVIDER and LLM_MODEL environment variables. The model is no longer hardcoded to mistral.
Related to #408 Hybrid AI Orchestration & Pydantic Validation Layer The multi-provider abstraction with environment-based routing is a step toward the proposed hybrid orchestrator. Validation uses Pydantic models (CanonicalFieldEntry) for schema enforcement. Does not implement hardware-aware routing.
Related to #410 Department-Aware LLM Prompt Templating Canonical field descriptors (description, data_type, constraints) act as dynamic per-schema prompt configuration, achieving the same goal as per-agency prompt templates — but driven by the schema system rather than static YAML files.
Related to #404 Multi Template batch filling While the previous PR introduced the POST /schemas/{id}/fill endpoint, the new extractor() is the mechanism that makes single-extraction-multi-fill actually work — extracting once against the canonical target and distributing to all templates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment