Feat/validation driven extraction by Calebnzm · Pull Request #416 · fireform-core/FireForm

Calebnzm · 2026-04-07T15:30:48Z

Summary

This PR introduces a validation-driven extraction pipeline — replacing the previous single-pass, prompt-per-field approach with a multi-phase extraction agent that performs both syntactic and semantic validation of extracted values, with iterative self-correction.

It builds directly on feat: robust report management and configuration architecture + API, which introduced the Report Schema system with field-level configuration (data types, word limits, allowed values, descriptions). That is why the PR is really huge, it builds on top of the last one, this PR only modifies 2 files from the previous PR, so kindly check it out. This PR puts that configuration to work — the extraction pipeline now enforces those constraints at inference time and uses them to drive accurate, validated extraction.

In a nutshell: Instead of extracting each field independently with no quality checks, the LLM now extracts all canonical fields in a single structured pass, validates every result against the schema's constraints, and iteratively re-prompts itself to correct failures — producing verified, schema-compliant output.

Previous State

The extraction logic (LLM class) had the following limitations:

One field at a time. main_loop iterated over fields and fired a separate Ollama prompt for each one. For a schema with 30 fields, that meant 30 independent inference calls with no shared context between them.
No validation whatsoever. The LLM's raw text response was stored directly. If it returned "fourteen" for an int field, or a 200-word paragraph for a field with a 10-word limit, the system accepted it without question.
Hardcoded to a single provider. The Ollama URL and model name (mistral) were embedded directly in the extraction loop. Switching to a cloud provider required code changes.
No structured output. The LLM returned a raw string. There was no confidence signal, no reasoning trace, and no machine-parseable structure to validate against.
Field configuration was unused. The previous PR introduced data_type, word_limit, allowed_values, and description on SchemaField — but the extraction pipeline didn't use any of them.

What This PR Introduces

1. Multi-Provider LLM Abstraction

The LLM class now supports Ollama (local) and Gemini (cloud) through a unified interface, configured via environment variables:

Variable	Default	Purpose
`LLM_PROVIDER`	`ollama`	Provider selection (`ollama` or `gemini`)
`LLM_MODEL`	`mistral`	Model name passed to the provider
`OLLAMA_HOST`	`http://localhost:11434`	Ollama server URL
`GEMINI_API_KEY`	—	Google Gemini API key

set_model_config() can be called with explicit arguments to override env vars, or with no arguments to use defaults. The previous hardcoded provider="gemini", model_name="gemini-2.5-flash" in file_manipulator.py has been removed — the system now respects the environment configuration.

inference() is a single classmethod that accepts an OpenAI-style messages array and routes to the correct provider API, translating message formats as needed (e.g. Gemini's system_instruction field, "model" role naming, responseMimeType for JSON mode).

2. Syntactic Validator

A static method that validates an extracted value against its CanonicalFieldEntry descriptor:

Data type enforcement — checks int parseability, string/date type conformance
Word limit — rejects strings exceeding word_limit
Allowed values — for enum fields, checks membership in the configured value set

Returns a list of structured error objects (e.g. {"data_type_error": "expected: int, however 'fourteen' is: str"}) or None if valid.

3. Semantic Validator

An LLM-as-judge pattern: a second inference call receives the extracted values alongside the original source text, field descriptions, and the extractor's reasoning — then evaluates whether each extraction is semantically correct.

Returns a dict mapping field names to error descriptions for any fields that fail validation. Fields not in the dict are considered semantically valid.

4. Validation-Driven Extraction Pipeline (`extractor`)

The core addition. A multi-phase agent loop that replaces main_loop for schema-driven extraction:

flowchart TD
    START["Receive canonical field descriptors"] --> EXTRACT

    EXTRACT["Send all pending fields to LLM in a single prompt
    Each field includes: description, expected type,
    word limit, required status, and allowed values"]

    EXTRACT --> PARSE["LLM responds with a JSON map:
    one entry per field containing the extracted value,
    its reasoning, and a confidence score"]

    PARSE --> SYNTACTIC["Syntactic Validation
    Check each value against its field descriptor:
    - Does the type match? e.g. int, string, date
    - Is the word count within the limit?
    - Is the value in the allowed set, if constrained?"]

    SYNTACTIC --> SYN_OK{"All fields
    syntactically valid?"}
    SYN_OK -- No --> CORRECT["Send failed fields back to the LLM
    with their specific errors, asking it
    to re-extract only those fields"]
    CORRECT -- "Retry up to 5 times" --> SYNTACTIC
    SYN_OK -- Yes --> CONFIDENCE

    CONFIDENCE{"Is the LLM's self-reported
    confidence >= 90%?"}
    CONFIDENCE -- Yes --> ACCEPT["Accept value"]
    CONFIDENCE -- No --> SEMANTIC

    SEMANTIC["Semantic Validation
    A separate LLM call reviews low-confidence values
    against the original source text and field descriptions
    to check if they actually make sense in context"]

    SEMANTIC --> SEM_OK{"Semantically
    correct?"}
    SEM_OK -- Yes --> ACCEPT
    SEM_OK -- No --> FEEDBACK["Feed the semantic errors back
    into the conversation history
    so the next extraction attempt
    has context on what went wrong"]
    FEEDBACK --> EXTRACT

    ACCEPT --> REMAINING{"Unresolved
    fields left?"}
    REMAINING -- "Yes, up to 10 iterations" --> EXTRACT
    REMAINING -- No --> DONE["Return all validated extracted values"]

Phase-by-phase:

Phase 1 — Batch Extraction. All pending canonical fields are sent to the LLM in a single prompt, with their full descriptors (description, expected data type, word limit, required status, allowed values). The LLM responds with a structured JSON map where each field contains candidate_value, reasoning, and confidence. The entire conversation history is maintained across iterations, giving the LLM context from prior attempts.

Phase 2 — Syntactic Validation & Correction. Each extracted value is validated against its field descriptor using syntactic_validator. Fields that fail are collected with their specific errors and re-prompted in a correction loop (up to 5 retries per outer iteration). The correction prompt includes the previous invalid output and the exact errors, allowing the LLM to fix targeted issues without re-extracting valid fields.

Phase 3 — Confidence Filtering. Syntactically valid fields are split by confidence score:

≥ 0.90 — accepted directly into the results dict and removed from the pending set.
< 0.90 — forwarded to semantic validation.

Phase 4 — Semantic Validation. Low-confidence fields are batched and sent to semantic_validator, which invokes a separate LLM call to evaluate correctness against the source text. Fields that pass are accepted. Fields that fail have their error descriptions appended to the conversation history as user feedback, and the outer loop re-extracts them with that context.

Termination. The outer loop continues until all fields are resolved or the max iteration count (10) is reached. Unresolved fields after max retries are set to None.

5. Refactored Legacy Path

main_loop — used for standalone template filling — has been updated to use the unified inference() method instead of direct HTTP calls. Its behavior is otherwise unchanged.

Files Changed

File	Change
`src/llm.py`	+390 −25 — Provider abstraction, inference routing, syntactic/semantic validators, `extractor` pipeline, `main_loop` refactor
`src/file_manipulator.py`	+1 −1 — Removed hardcoded provider/model from `set_model_config()` call

How It Fits Into the Fill Pipeline

The extraction pipeline is invoked during the schema fill flow (POST /schemas/{id}/fill):

API Request → Controller.fill_report() → FileManipulator.fill_report()
    → ReportSchemaProcessor.canonize()     # Build canonical target
    → LLM.extractor(canonical_schema)      # THIS PR — validated extraction
    → ReportSchemaProcessor.distribute()   # Map canonical → per-template
    → Filler.fill_form_by_name()           # Write to PDFs

The extractor receives a CanonicalSchema — the output of canonization from the previous PR — and returns a dict[str, Any] mapping canonical field names to validated values. These are then distributed to per-template field dicts via the field mappings and written to the PDF forms.

Related Issues

	Issue	Title
Closes #403	Optimize `LLM.main_loop()` for batch extraction of fields	The new `extractor()` replaces per-field sequential calls with a single batch extraction prompt. All canonical fields are sent in one structured request, regardless of form size.
Closes #399	Batch LLM extraction for faster PDF processing	Same as above — batch extraction eliminates N round-trips. A 30-field schema now requires one extraction call instead of 30.
Closes #391	Implement Batch LLM Extraction for 10x Performance Gain	Same as above. Additionally, `format="json"` is now used across both Ollama and Gemini providers to guarantee structured JSON output.
Closes #412	Prompt Engineering Layer for Robust LLM-Based Data Extraction	The extraction prompt is now structured with per-field descriptors: description, expected data type, word limit, required status, and allowed values. This gives the LLM explicit, field-level guidance instead of a generic prompt.
Closes #409	Add Field-Level Validation Feedback (Human-Readable Errors for Extracted Data)	The syntactic validator returns structured error objects per field — e.g. `{"data_type_error": "expected: int, however 'fourteen' is: str"}`, `{"word_limit_error": "word count 25 exceeds limit of 10"}`. These are used internally for self-correction and are available for surfacing to users.
Closes #256	Multi-model LLM support	`set_model_config()` supports Ollama and Gemini via `LLM_PROVIDER` and `LLM_MODEL` environment variables. The model is no longer hardcoded to `mistral`.
Related to #408	Hybrid AI Orchestration & Pydantic Validation Layer	The multi-provider abstraction with environment-based routing is a step toward the proposed hybrid orchestrator. Validation uses Pydantic models (`CanonicalFieldEntry`) for schema enforcement. Does not implement hardware-aware routing.
Related to #410	Department-Aware LLM Prompt Templating	Canonical field descriptors (`description`, `data_type`, constraints) act as dynamic per-schema prompt configuration, achieving the same goal as per-agency prompt templates — but driven by the schema system rather than static YAML files.
Related to #404	Multi Template batch filling	While the previous PR introduced the `POST /schemas/{id}/fill` endpoint, the new `extractor()` is the mechanism that makes single-extraction-multi-fill actually work — extracting once against the canonical target and distributing to all templates.

Calebnzm and others added 4 commits March 31, 2026 11:16

feat: introduced report-centric schema and abstraction layer

0fdc632

feat: robust report management and configuration architecture + API

206f2ea

feat: syntactic+semantic validation-driven extraction pipeline

550fb56

Merge branch 'main' into feat/validation-driven-extraction

20b6ce7

Calebnzm mentioned this pull request Apr 8, 2026

feat: add React client application for FireForm #402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/validation driven extraction#416

Feat/validation driven extraction#416
Calebnzm wants to merge 4 commits intofireform-core:mainfrom
Calebnzm:feat/validation-driven-extraction

Calebnzm commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Calebnzm commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Previous State

What This PR Introduces

1. Multi-Provider LLM Abstraction

2. Syntactic Validator

3. Semantic Validator

4. Validation-Driven Extraction Pipeline (extractor)

5. Refactored Legacy Path

Files Changed

How It Fits Into the Fill Pipeline

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Calebnzm commented Apr 7, 2026 •

edited

Loading

4. Validation-Driven Extraction Pipeline (`extractor`)