Evaluation Framework

Objective

Assess whether current large language models can produce correct, end-to-end nanopore metagenomics pipelines through sequential prompting in a way that preserves scientific validity across chained workflow stages. The benchmark uses two independent ground truth pipelines to test whether model competence generalizes across different analytical paradigms.

Why sequential evaluation is necessary

Standard code benchmarks evaluate isolated units. That misses the main failure surface of bioinformatics pipelines: compositional correctness. In a real workflow, each step constrains the next through file formats, assumptions, tool compatibility, and biological context.

A model can therefore fail even when an individual command looks plausible:

the basecalling output is not compatible with the downstream trimming step
the wrong Kraken2 database makes later taxonomic summaries misleading
the assembly choice invalidates binning assumptions
the annotation stage loses the read-level branch required by the validated pipeline

These are chaining failures, not syntax failures.

Protocol

Ground truths

The benchmark is anchored to two validated workflows:

Pipeline 1 — Aerobiome (7 steps, linear):

Reska T, Pozdniakova S, Urban L. Air monitoring by nanopore sequencing. ISME Communications (2024). DOI: 10.1093/ismeco/ycae058

Reference: pipeline_reference.md | Pipeline: ../pipelines/aerobiome/

Pipeline 2 — Wetland surveillance (10 steps, 4 parallel tracks):

Perlas A*, Reska T*, et al. Real-time genomic pathogen, resistance, and host range characterization from passive water sampling of wetland ecosystems. Applied and Environmental Microbiology (2025/2026).

Reference: pipeline_reference_wetland.md | Pipeline: ../pipelines/wetland-surveillance/

Dual-Pipeline Design

The two ground truth pipelines test fundamentally different analytical dimensions:

Dimension	Aerobiome	Wetland
Structure	7-step linear	10-step, 4 parallel tracks
Nucleic acids	DNA only	DNA + RNA
Paradigms	Shotgun metagenomics	Shotgun + amplicon + reference-based + phylogenetic
Basecalling	HAC mode (Guppy/Dorado v4.x)	SUP mode (Dorado v5.0.0)
Assembly	Single assembler (MetaFlye)	Dual assembler (metaFlye + nanoMDBG)
Unique tools	—	MEGAN-CE, Prodigal, PlasmidFinder, OBITools4, VSEARCH, MIDORI2, BCFtools, IQ-TREE2
Tool count	~10	~30

The wetland pipeline is designed to be harder: models that achieve a fully correct aerobiome pipeline may still fail on the wetland pipeline's multi-omics, multi-paradigm structure. The cross-pipeline comparison directly measures whether model competence generalizes.

For the wetland pipeline, the stateless cumulative protocol is adapted to handle multi-track branching: the evaluator carries forward the track context (which nucleic acid, which analysis paradigm) in addition to the expected output state.

Stateless cumulative prompting

This benchmark does not use one continuous conversation thread per model. Instead, it uses a stateless fresh-chat protocol:

Individual steps are first evaluated in isolated fresh sessions.
Integration prompts are then reconstructed cumulatively from the expected prior output state.
The evaluator manually passes forward the output type and biological context required for the next step.
Errors are not corrected before the next prompt, allowing upstream mistakes to propagate.

This design isolates a specific scientific question: can a model preserve correctness when prior state must be carried forward explicitly rather than being rediscovered or silently repaired?

Controls

Prompt structure matched across evaluated entries. Public prompt files document the reconstructed prompt shape for each step.
Benchmark-critical constraints preserved. Score-relevant constraints that were explicit in the benchmark setup are documented in the reconstructed public prompt files.
No mid-benchmark correction. Once an upstream error appears, it is preserved in the carried state.
Five-dimensional scoring. Each step is evaluated for tool selection, parameter accuracy, output compatibility, scientific validity, and executability.

Scoring Dimensions

Dimension	What it measures	Why it matters
Tool Selection	Conceptual workflow choice	Wrong tool selection invalidates the analysis even if the code runs
Parameter Accuracy	Domain-specific implementation detail	Correct tool with wrong flags can still produce misleading outputs
Output Compatibility	File and pipeline chaining	A pipeline can fail even when every individual command is plausible
Scientific Validity	Analytical defensibility	Fluency is not a substitute for domain judgment
Executability	Practical utility	Non-running code is unusable regardless of analytical intent

The detailed rubric is in scoring_criteria.md.

Public Artifact Boundaries

This repository contains:

the validated reference workflow
reconstructed public prompt documents
the scored matrix in results/tables/scoring_matrix.csv
generated summaries and figures derived from that matrix

This repository does not contain:

verbatim raw web-interface chat transcripts
a complete archive of raw model response logs

The responses/ directory is retained as a scaffold, but it is not a public transcript archive in the current checked-in tree.

The prompt reconstructions are benchmark documentation, not transcript surrogates. When the evaluated setup made a score-relevant requirement explicit, the public prompt files record that requirement. This documentation clarification does not modify the matrix, rankings, or rubric outcomes.

Access Method

The evaluated outputs were collected through public interfaces rather than API-only execution environments. Results should therefore be interpreted as interface-level benchmark behavior rather than as a claim about any one provider's raw model endpoint under fixed API parameters.

Limitations

Same research group: both ground truth pipelines involve the same first/co-first author, which means both share similar tool preferences and analytical style.
Nanopore-only scope: the results should not be generalized automatically to other sequencing modalities.
Dated snapshot: the scoring matrix reflects tested behavior at specific dates.
Human scoring: the rubric was applied by a single domain expert.
Protocol dependence: the benchmark measures performance under stateless state-carrying prompts; other prompting regimes may differ.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Framework

Objective

Why sequential evaluation is necessary

Protocol

Ground truths

Dual-Pipeline Design

Stateless cumulative prompting

Controls

Scoring Dimensions

Public Artifact Boundaries

Access Method

Limitations

FilesExpand file tree

evaluation_framework.md

Latest commit

History

evaluation_framework.md

File metadata and controls

Evaluation Framework

Objective

Why sequential evaluation is necessary

Protocol

Ground truths

Dual-Pipeline Design

Stateless cumulative prompting

Controls

Scoring Dimensions

Public Artifact Boundaries

Access Method

Limitations