Assess whether current large language models can produce correct, end-to-end nanopore metagenomics pipelines through sequential prompting in a way that preserves scientific validity across chained workflow stages. The benchmark uses two independent ground truth pipelines to test whether model competence generalizes across different analytical paradigms.
Standard code benchmarks evaluate isolated units. That misses the main failure surface of bioinformatics pipelines: compositional correctness. In a real workflow, each step constrains the next through file formats, assumptions, tool compatibility, and biological context.
A model can therefore fail even when an individual command looks plausible:
- the basecalling output is not compatible with the downstream trimming step
- the wrong Kraken2 database makes later taxonomic summaries misleading
- the assembly choice invalidates binning assumptions
- the annotation stage loses the read-level branch required by the validated pipeline
These are chaining failures, not syntax failures.
The benchmark is anchored to two validated workflows:
Pipeline 1 — Aerobiome (7 steps, linear):
Reska T, Pozdniakova S, Urban L. Air monitoring by nanopore sequencing. ISME Communications (2024). DOI: 10.1093/ismeco/ycae058
Reference: pipeline_reference.md | Pipeline: ../pipelines/aerobiome/
Pipeline 2 — Wetland surveillance (10 steps, 4 parallel tracks):
Perlas A*, Reska T*, et al. Real-time genomic pathogen, resistance, and host range characterization from passive water sampling of wetland ecosystems. Applied and Environmental Microbiology (2025/2026).
Reference: pipeline_reference_wetland.md | Pipeline: ../pipelines/wetland-surveillance/
The two ground truth pipelines test fundamentally different analytical dimensions:
| Dimension | Aerobiome | Wetland |
|---|---|---|
| Structure | 7-step linear | 10-step, 4 parallel tracks |
| Nucleic acids | DNA only | DNA + RNA |
| Paradigms | Shotgun metagenomics | Shotgun + amplicon + reference-based + phylogenetic |
| Basecalling | HAC mode (Guppy/Dorado v4.x) | SUP mode (Dorado v5.0.0) |
| Assembly | Single assembler (MetaFlye) | Dual assembler (metaFlye + nanoMDBG) |
| Unique tools | — | MEGAN-CE, Prodigal, PlasmidFinder, OBITools4, VSEARCH, MIDORI2, BCFtools, IQ-TREE2 |
| Tool count | ~10 | ~30 |
The wetland pipeline is designed to be harder: models that achieve a fully correct aerobiome pipeline may still fail on the wetland pipeline's multi-omics, multi-paradigm structure. The cross-pipeline comparison directly measures whether model competence generalizes.
For the wetland pipeline, the stateless cumulative protocol is adapted to handle multi-track branching: the evaluator carries forward the track context (which nucleic acid, which analysis paradigm) in addition to the expected output state.
This benchmark does not use one continuous conversation thread per model. Instead, it uses a stateless fresh-chat protocol:
- Individual steps are first evaluated in isolated fresh sessions.
- Integration prompts are then reconstructed cumulatively from the expected prior output state.
- The evaluator manually passes forward the output type and biological context required for the next step.
- Errors are not corrected before the next prompt, allowing upstream mistakes to propagate.
This design isolates a specific scientific question: can a model preserve correctness when prior state must be carried forward explicitly rather than being rediscovered or silently repaired?
- Prompt structure matched across evaluated entries. Public prompt files document the reconstructed prompt shape for each step.
- Benchmark-critical constraints preserved. Score-relevant constraints that were explicit in the benchmark setup are documented in the reconstructed public prompt files.
- No mid-benchmark correction. Once an upstream error appears, it is preserved in the carried state.
- Five-dimensional scoring. Each step is evaluated for tool selection, parameter accuracy, output compatibility, scientific validity, and executability.
| Dimension | What it measures | Why it matters |
|---|---|---|
| Tool Selection | Conceptual workflow choice | Wrong tool selection invalidates the analysis even if the code runs |
| Parameter Accuracy | Domain-specific implementation detail | Correct tool with wrong flags can still produce misleading outputs |
| Output Compatibility | File and pipeline chaining | A pipeline can fail even when every individual command is plausible |
| Scientific Validity | Analytical defensibility | Fluency is not a substitute for domain judgment |
| Executability | Practical utility | Non-running code is unusable regardless of analytical intent |
The detailed rubric is in scoring_criteria.md.
This repository contains:
- the validated reference workflow
- reconstructed public prompt documents
- the scored matrix in
results/tables/scoring_matrix.csv - generated summaries and figures derived from that matrix
This repository does not contain:
- verbatim raw web-interface chat transcripts
- a complete archive of raw model response logs
The responses/ directory is retained as a scaffold, but it is not a public transcript archive in the current checked-in tree.
The prompt reconstructions are benchmark documentation, not transcript surrogates. When the evaluated setup made a score-relevant requirement explicit, the public prompt files record that requirement. This documentation clarification does not modify the matrix, rankings, or rubric outcomes.
The evaluated outputs were collected through public interfaces rather than API-only execution environments. Results should therefore be interpreted as interface-level benchmark behavior rather than as a claim about any one provider's raw model endpoint under fixed API parameters.
- Same research group: both ground truth pipelines involve the same first/co-first author, which means both share similar tool preferences and analytical style.
- Nanopore-only scope: the results should not be generalized automatically to other sequencing modalities.
- Dated snapshot: the scoring matrix reflects tested behavior at specific dates.
- Human scoring: the rubric was applied by a single domain expert.
- Protocol dependence: the benchmark measures performance under stateless state-carrying prompts; other prompting regimes may differ.