Feat/validation driven extraction#416
Open
Calebnzm wants to merge 4 commits intofireform-core:mainfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a validation-driven extraction pipeline — replacing the previous single-pass, prompt-per-field approach with a multi-phase extraction agent that performs both syntactic and semantic validation of extracted values, with iterative self-correction.
It builds directly on feat: robust report management and configuration architecture + API, which introduced the Report Schema system with field-level configuration (data types, word limits, allowed values, descriptions). That is why the PR is really huge, it builds on top of the last one, this PR only modifies 2 files from the previous PR, so kindly check it out. This PR puts that configuration to work — the extraction pipeline now enforces those constraints at inference time and uses them to drive accurate, validated extraction.
Previous State
The extraction logic (
LLMclass) had the following limitations:One field at a time.
main_loopiterated over fields and fired a separate Ollama prompt for each one. For a schema with 30 fields, that meant 30 independent inference calls with no shared context between them.No validation whatsoever. The LLM's raw text response was stored directly. If it returned
"fourteen"for anintfield, or a 200-word paragraph for a field with a 10-word limit, the system accepted it without question.Hardcoded to a single provider. The Ollama URL and model name (
mistral) were embedded directly in the extraction loop. Switching to a cloud provider required code changes.No structured output. The LLM returned a raw string. There was no confidence signal, no reasoning trace, and no machine-parseable structure to validate against.
Field configuration was unused. The previous PR introduced
data_type,word_limit,allowed_values, anddescriptiononSchemaField— but the extraction pipeline didn't use any of them.What This PR Introduces
1. Multi-Provider LLM Abstraction
The
LLMclass now supports Ollama (local) and Gemini (cloud) through a unified interface, configured via environment variables:LLM_PROVIDERollamaollamaorgemini)LLM_MODELmistralOLLAMA_HOSThttp://localhost:11434GEMINI_API_KEYset_model_config()can be called with explicit arguments to override env vars, or with no arguments to use defaults. The previous hardcodedprovider="gemini", model_name="gemini-2.5-flash"infile_manipulator.pyhas been removed — the system now respects the environment configuration.inference()is a single classmethod that accepts an OpenAI-stylemessagesarray and routes to the correct provider API, translating message formats as needed (e.g. Gemini'ssystem_instructionfield,"model"role naming,responseMimeTypefor JSON mode).2. Syntactic Validator
A static method that validates an extracted value against its
CanonicalFieldEntrydescriptor:intparseability,string/datetype conformanceword_limitenumfields, checks membership in the configured value setReturns a list of structured error objects (e.g.
{"data_type_error": "expected: int, however 'fourteen' is: str"}) orNoneif valid.3. Semantic Validator
An LLM-as-judge pattern: a second inference call receives the extracted values alongside the original source text, field descriptions, and the extractor's reasoning — then evaluates whether each extraction is semantically correct.
Returns a dict mapping field names to error descriptions for any fields that fail validation. Fields not in the dict are considered semantically valid.
4. Validation-Driven Extraction Pipeline (
extractor)The core addition. A multi-phase agent loop that replaces
main_loopfor schema-driven extraction:flowchart TD START["Receive canonical field descriptors"] --> EXTRACT EXTRACT["Send all pending fields to LLM in a single prompt Each field includes: description, expected type, word limit, required status, and allowed values"] EXTRACT --> PARSE["LLM responds with a JSON map: one entry per field containing the extracted value, its reasoning, and a confidence score"] PARSE --> SYNTACTIC["Syntactic Validation Check each value against its field descriptor: - Does the type match? e.g. int, string, date - Is the word count within the limit? - Is the value in the allowed set, if constrained?"] SYNTACTIC --> SYN_OK{"All fields syntactically valid?"} SYN_OK -- No --> CORRECT["Send failed fields back to the LLM with their specific errors, asking it to re-extract only those fields"] CORRECT -- "Retry up to 5 times" --> SYNTACTIC SYN_OK -- Yes --> CONFIDENCE CONFIDENCE{"Is the LLM's self-reported confidence >= 90%?"} CONFIDENCE -- Yes --> ACCEPT["Accept value"] CONFIDENCE -- No --> SEMANTIC SEMANTIC["Semantic Validation A separate LLM call reviews low-confidence values against the original source text and field descriptions to check if they actually make sense in context"] SEMANTIC --> SEM_OK{"Semantically correct?"} SEM_OK -- Yes --> ACCEPT SEM_OK -- No --> FEEDBACK["Feed the semantic errors back into the conversation history so the next extraction attempt has context on what went wrong"] FEEDBACK --> EXTRACT ACCEPT --> REMAINING{"Unresolved fields left?"} REMAINING -- "Yes, up to 10 iterations" --> EXTRACT REMAINING -- No --> DONE["Return all validated extracted values"]Phase-by-phase:
Phase 1 — Batch Extraction. All pending canonical fields are sent to the LLM in a single prompt, with their full descriptors (description, expected data type, word limit, required status, allowed values). The LLM responds with a structured JSON map where each field contains
candidate_value,reasoning, andconfidence. The entire conversation history is maintained across iterations, giving the LLM context from prior attempts.Phase 2 — Syntactic Validation & Correction. Each extracted value is validated against its field descriptor using
syntactic_validator. Fields that fail are collected with their specific errors and re-prompted in a correction loop (up to 5 retries per outer iteration). The correction prompt includes the previous invalid output and the exact errors, allowing the LLM to fix targeted issues without re-extracting valid fields.Phase 3 — Confidence Filtering. Syntactically valid fields are split by confidence score:
Phase 4 — Semantic Validation. Low-confidence fields are batched and sent to
semantic_validator, which invokes a separate LLM call to evaluate correctness against the source text. Fields that pass are accepted. Fields that fail have their error descriptions appended to the conversation history as user feedback, and the outer loop re-extracts them with that context.Termination. The outer loop continues until all fields are resolved or the max iteration count (10) is reached. Unresolved fields after max retries are set to
None.5. Refactored Legacy Path
main_loop— used for standalone template filling — has been updated to use the unifiedinference()method instead of direct HTTP calls. Its behavior is otherwise unchanged.Files Changed
src/llm.pyextractorpipeline,main_looprefactorsrc/file_manipulator.pyset_model_config()callHow It Fits Into the Fill Pipeline
The extraction pipeline is invoked during the schema fill flow (
POST /schemas/{id}/fill):The
extractorreceives aCanonicalSchema— the output of canonization from the previous PR — and returns adict[str, Any]mapping canonical field names to validated values. These are then distributed to per-template field dicts via the field mappings and written to the PDF forms.Related Issues
LLM.main_loop()for batch extraction of fieldsextractor()replaces per-field sequential calls with a single batch extraction prompt. All canonical fields are sent in one structured request, regardless of form size.format="json"is now used across both Ollama and Gemini providers to guarantee structured JSON output.{"data_type_error": "expected: int, however 'fourteen' is: str"},{"word_limit_error": "word count 25 exceeds limit of 10"}. These are used internally for self-correction and are available for surfacing to users.set_model_config()supports Ollama and Gemini viaLLM_PROVIDERandLLM_MODELenvironment variables. The model is no longer hardcoded tomistral.CanonicalFieldEntry) for schema enforcement. Does not implement hardware-aware routing.description,data_type, constraints) act as dynamic per-schema prompt configuration, achieving the same goal as per-agency prompt templates — but driven by the schema system rather than static YAML files.POST /schemas/{id}/fillendpoint, the newextractor()is the mechanism that makes single-extraction-multi-fill actually work — extracting once against the canonical target and distributing to all templates.