LLM TaskBench Architecture

High-Level Overview
System Architecture
Component Descriptions
Data Flow
Design Decisions
Technology Stack

High-Level Overview

LLM TaskBench is a task-specific LLM evaluation framework that enables developers and researchers to objectively compare language models on custom, domain-specific tasks. The framework combines agentic orchestration with LLM-as-judge evaluation to provide accurate, reproducible performance metrics.

Recent Enhancements

Folder-based use cases: Human-readable USE-CASE.md files with automatic prompt generation from ground truth analysis
LLM-driven prompt generation: Framework analyzes use case + ground truth to generate task prompts, judge prompts, and rubrics
Results organization: Auto-saved to results/{usecase-name}/{timestamp}_{datafile}.json
Robust judge parsing: Handles varied LLM response formats (floats, nested metrics, object violations)
Timestamp range extraction: Judge receives full context of input timestamp range for accurate evaluation
Model recommendations: orchestrator suggests models from OpenRouter based on use-case traits
Cost visibility: inline + generation lookup billing, per-run and per-use-case rollups
Parallel execution by default with retry/backoff and rate limiting hooks
Model-aware chunking: executor derives chunk sizes from selected models' context windows

Key Features

Folder-Based Use Cases: Define use cases in human-readable Markdown with automatic prompt generation
Task-Specific Evaluation: Custom tasks with YAML or folder-based configuration
Multi-Model Support: Evaluate multiple LLMs simultaneously via OpenRouter API
LLM-as-Judge: Automated evaluation using Claude Sonnet 4.5 with robust response parsing
Auto-Generated Prompts: LLM analyzes use case and ground truth to derive optimal prompts and rubrics
Cost Tracking: Real-time cost calculation and tracking
Rich CLI: User-friendly command-line interface with progress indicators
Results Organization: Auto-organized by use case with timestamped filenames

Design Philosophy

Declarative Task Definition: Tasks are defined in YAML, separating configuration from code
Type Safety: Pydantic models ensure data validation throughout the pipeline
Async-First: Asynchronous I/O for efficient API interactions
Separation of Concerns: Clear boundaries between API client, executor, judge, and cost tracking
Observable: Rich console output and detailed logging for transparency

System Architecture

flowchart TD
    CLI["CLI Interface<br/>(taskbench/cli/main.py)<br/><br/>Commands: evaluate, models, validate"]
    
    subgraph Orchestration["Orchestration Layer (taskbench/evaluation/*)"]
        Executor["Executor<br/>- Builds prompts<br/>- Runs tasks<br/>- Collects results"]
        Judge["Judge<br/>- Evaluates outputs<br/>- Scores models<br/>- Detects violations"]
        CostTracker["CostTracker<br/>- Calculates costs<br/>- Tracks usage<br/>- Provides stats"]
        Orchestrator["Model Recommender<br/>- Use-case aware<br/>- Cost/context filters"]
    end
    
    subgraph APIClient["API Client Layer (taskbench/api/*)"]
        OpenRouter["OpenRouterClient<br/>- HTTP requests<br/>- Authentication<br/>- Error handling<br/>- JSON mode"]
        Retry["Retry Logic<br/>- Exponential backoff<br/>- Rate limiting<br/>- Transient errors"]
    end
    
    subgraph CoreModels["Core Models (taskbench/core/*)"]
        TaskDef["TaskDefinition<br/>ModelConfig"]
        EvalResult["EvaluationResult<br/>CompletionResp."]
        JudgeScore["JudgeScore<br/>TaskParser"]
    end
    
    subgraph External["External Services"]
        OpenRouterAPI["OpenRouter API<br/>(aggregates multiple LLM providers)<br/><br/>• Anthropic (Claude)<br/>• OpenAI (GPT-4)<br/>• Google (Gemini)<br/>• Meta (Llama)<br/>• Alibaba (Qwen)<br/>• And more..."]
    end
    
    CLI --> Orchestration
    Orchestration --> APIClient
    CLI --> UI["FastAPI/Streamlit UI<br/>Runs & Use-cases"]
    UI --> Orchestration
    APIClient --> CoreModels
    CoreModels --> External
    
    style CLI fill:#e1f5ff
    style Orchestration fill:#fff4e1
    style APIClient fill:#f0e1ff
    style CoreModels fill:#e1ffe1
    style External fill:#ffe1e1

Component Descriptions

1. CLI Interface (`taskbench/cli/main.py`)

Purpose: Provides user-facing command-line interface.

Responsibilities:

Parse command-line arguments using Typer
Load environment variables and configuration
Coordinate between evaluation components
Display results with Rich formatting
Handle errors and provide user feedback

Key Commands:

run: Run evaluation on a folder-based use case (primary command)
list-usecases: List available use cases in a folder
generate-prompts: Generate prompts from use case without running
evaluate: Run multi-model evaluation on a YAML task (legacy)
models: List available models and pricing
validate: Validate task definition YAML files

Design Pattern: Command pattern with async/await for I/O operations.

2. Core Models (`taskbench/core/models.py`)

Purpose: Define type-safe data structures using Pydantic.

Key Models:

TaskDefinition: Represents a user-defined evaluation task
- Name, description, input/output types
- Evaluation criteria and constraints
- Examples and judge instructions
- Validation rules for input/output formats
CompletionResponse: API response from LLM completion
- Response content and metadata
- Token usage (input, output, total)
- Latency metrics
EvaluationResult: Single model evaluation result
- Model output and status
- Token usage and cost
- Timestamp and error information
JudgeScore: LLM-as-judge scoring result
- Multi-dimensional scores (accuracy, format, compliance)
- Violations list
- Detailed reasoning
ModelConfig: Model pricing and configuration
- Pricing per million tokens
- Context window size
- Provider information

Design Pattern: Data Transfer Objects (DTOs) with built-in validation.

3. Task Parser (`taskbench/core/task.py`)

Purpose: Load, validate, and save task definitions from YAML.

Responsibilities:

Parse YAML files into TaskDefinition objects
Validate task structure and constraints
Check for logical errors (e.g., min >= max)
Save tasks back to YAML format

Validation Rules:

Required fields must be present
Input/output types must be valid
Constraints must be logically consistent
Min/max pairs must satisfy min < max

4. API Client (`taskbench/api/client.py`)

Purpose: Interface with OpenRouter API for LLM completions.

Responsibilities:

Manage HTTP connections with httpx
Handle authentication and headers
Parse API responses
Calculate latency metrics
Support both standard and JSON mode completions

Error Handling:

AuthenticationError: Invalid API key (401)
RateLimitError: Rate limit exceeded (429)
BadRequestError: Malformed request (400)
OpenRouterError: Server errors (5xx)

Design Pattern: Async context manager for resource management.

5. Retry Logic (`taskbench/api/retry.py`)

Purpose: Handle transient errors and rate limiting.

Components:

RateLimiter: Token bucket rate limiting
- Tracks requests per minute
- Sleeps when limit would be exceeded
- Thread-safe with async locks
retry_with_backoff: Decorator for exponential backoff
- Retries transient errors (rate limits, timeouts, 5xx)
- Skips non-retryable errors (auth, bad requests)
- Exponential backoff with jitter

Design Pattern: Decorator pattern for cross-cutting concerns.

6. Model Executor (`taskbench/evaluation/executor.py`)

Purpose: Execute tasks on LLM models and collect results.

Responsibilities:

Build comprehensive prompts from task definitions
Make API calls with configured parameters
Calculate costs using CostTracker
Handle execution errors gracefully
Display progress with Rich progress bars

Prompt Building Strategy:

Task description and context
Output format requirements (emphasized)
CRITICAL CONSTRAINTS section (bold)
Examples of good outputs
Evaluation criteria
Input data
Final instructions

Design Pattern: Template method for prompt building.

7. LLM Judge (`taskbench/evaluation/judge.py`)

Purpose: Evaluate model outputs using LLM-as-judge pattern.

Responsibilities:

Build evaluation prompts for judge model
Request JSON-formatted scores
Parse and validate judge responses
Categorize violations by type
Generate comparison reports

Evaluation Dimensions:

Accuracy Score (0-100): Content correctness
Format Score (0-100): Format compliance
Compliance Score (0-100): Constraint adherence
Overall Score (0-100): Weighted combination

Violation Categories:

under_min: Below minimum requirements
over_max: Exceeds maximum limits
format: Format specification violations
missing_field: Required fields absent
other: Miscellaneous issues

8. Model Comparison (`taskbench/evaluation/judge.py`)

Purpose: Compare and rank evaluation results.

Responsibilities:

Combine evaluation results with judge scores
Sort models by overall score
Calculate value metrics (score/cost ratio)
Generate comparison tables
Identify best overall and best value models

Design Pattern: Strategy pattern for different comparison metrics.

9. Cost Tracker (`taskbench/evaluation/cost.py`)

Purpose: Calculate and track evaluation costs.

Responsibilities:

Load model pricing from YAML configuration
Calculate costs from token usage
Track cumulative costs across evaluations
Provide cost breakdowns by model
Generate cost statistics

Pricing Model:

Input tokens: Price per 1M tokens
Output tokens: Price per 1M tokens (usually higher)
Total cost = (input_tokens/1M × input_price) + (output_tokens/1M × output_price)

Design Pattern: Repository pattern for pricing data.

Data Flow

Evaluation Flow

sequenceDiagram
    actor User
    participant CLI
    participant TaskParser
    participant Executor
    participant APIClient
    participant CostTracker
    participant Judge
    participant Comparison
    
    User->>CLI: taskbench evaluate task.yaml --models model1,model2
    CLI->>TaskParser: load_from_yaml()
    TaskParser-->>CLI: TaskDefinition
    
    CLI->>TaskParser: validate_task()
    TaskParser-->>CLI: (bool, List[errors])
    
    CLI->>CLI: Load input data
    
    loop For each model
        CLI->>Executor: build_prompt(task, input)
        Executor-->>CLI: prompt string
        
        CLI->>APIClient: complete(prompt)
        APIClient-->>CLI: CompletionResponse
        
        CLI->>CostTracker: calculate_cost(tokens)
        CostTracker-->>CLI: cost float
        
        CLI->>Executor: create_result()
        Executor-->>CLI: EvaluationResult
    end
    
    alt --judge enabled
        loop For each result
            CLI->>Judge: build_judge_prompt()
            Judge-->>CLI: judge prompt
            
            CLI->>APIClient: complete_with_json()
            APIClient-->>CLI: JSON scores
            
            CLI->>Judge: parse_scores()
            Judge-->>CLI: JudgeScore
        end
    end
    
    CLI->>Comparison: compare_results()
    Comparison-->>CLI: comparison_data
    
    CLI->>Comparison: generate_comparison_table()
    Comparison-->>CLI: Rich.Table
    
    CLI->>CLI: Save to JSON
    CLI-->>User: Display results

Task Definition Flow

flowchart LR
    YAML[YAML File] --> Parser[TaskParser]
    Parser --> Validation[Validation]
    Validation --> TaskDef[TaskDefinition]
    TaskDef --> Executor[Executor]
    Executor --> Prompt[Prompt]
    
    subgraph PromptComponents["Task components used in prompt"]
        Desc[description:<br/>Main instruction]
        Format[output_format:<br/>Format requirements]
        Constraints[constraints:<br/>CRITICAL CONSTRAINTS]
        Examples[examples:<br/>Example outputs]
        Criteria[evaluation_criteria:<br/>What to aim for]
        JudgeNote[judge_instructions:<br/>NOT used in task prompt<br/>only for judge]
    end
    
    TaskDef -.-> PromptComponents
    
    style YAML fill:#e1f5ff
    style TaskDef fill:#fff4e1
    style Prompt fill:#e1ffe1
    style JudgeNote fill:#ffe1e1

Judge Evaluation Flow

flowchart TD
    Input["Input:<br/>• TaskDefinition<br/>• EvaluationResult<br/>• Original Input Data"]
    
    Input --> BuildPrompt["Build judge prompt"]
    
    subgraph PromptContents["Prompt includes:"]
        PC1[Task description & criteria]
        PC2[Constraints to check]
        PC3[Original input for context]
        PC4[Model output to evaluate]
        PC5[Scoring rubric]
        PC6[JSON response format]
    end
    
    BuildPrompt -.-> PromptContents
    BuildPrompt --> SendAPI["Send to judge model<br/>(Claude Sonnet 4.5)<br/>with JSON mode"]
    
    SendAPI --> ParseJSON["Parse JSON response"]
    
    subgraph JSONSchema["JSON Response:"]
        JS1[accuracy_score: 0-100]
        JS2[format_score: 0-100]
        JS3[compliance_score: 0-100]
        JS4[overall_score: 0-100]
        JS5[violations: list of issues]
        JS6[reasoning: explanation]
    end
    
    ParseJSON -.-> JSONSchema
    ParseJSON --> CreateScore["Create JudgeScore object<br/>with validation"]
    
    CreateScore --> Return["Return score for comparison"]
    
    style Input fill:#e1f5ff
    style SendAPI fill:#fff4e1
    style Return fill:#e1ffe1

Design Decisions

1. Why OpenRouter?

Decision: Use OpenRouter as the unified API gateway.

Rationale:

Single API for multiple providers (Anthropic, OpenAI, Google, Meta, etc.)
Consistent interface across different models
Built-in rate limiting and load balancing
Cost-effective pricing
No need to manage multiple API keys

Trade-offs:

Dependency on third-party service
Slight latency overhead vs. direct APIs
Limited to models available on OpenRouter

2. Why LLM-as-Judge?

Decision: Use Claude Sonnet 4.5 as the evaluation judge.

Rationale:

Scales to custom tasks without manual evaluation
Provides detailed, explainable scores
Detects subtle violations humans might miss
Consistent evaluation criteria across runs
Faster and cheaper than human evaluation

Trade-offs:

Judge model adds cost (mitigated by using temperature=0.3)
Judge can have biases or errors
Requires careful prompt engineering for judge instructions

Validation: Research shows LLM-as-judge correlates well with human judgments for many tasks.

3. Why Pydantic Models?

Decision: Use Pydantic for all data structures.

Rationale:

Runtime type checking and validation
Automatic JSON serialization/deserialization
Clear documentation through type hints
IDE autocomplete and type checking
Validation errors provide clear messages

Benefits:

Catches errors early (at data ingestion)
Self-documenting code
Easy to extend with validators
Seamless integration with FastAPI (future)

4. Why YAML for Task Definitions?

Decision: Use YAML instead of JSON or Python for task definitions.

Rationale:

Human-readable and editable
Supports multi-line strings (for judge instructions)
Comments for documentation
Less verbose than JSON
Declarative (non-executable) for security

Alternative Considered: Python classes

Rejected: Requires Python knowledge, harder to version control, potential security issues

5. Why Async/Await?

Decision: Use asyncio throughout the application.

Rationale:

Non-blocking I/O for API calls
Better performance for multi-model evaluation
Scales well for concurrent requests
Modern Python best practice for I/O-bound applications

Complexity Trade-off:

Slightly more complex than synchronous code
Worth it for performance gains (can evaluate 10 models in parallel)

6. Why Separate Executor and Judge?

Decision: Split evaluation into two phases: execution and judging.

Rationale:

Single Responsibility Principle
Can run without judge (--no-judge flag)
Easy to swap judge models
Clear separation of concerns
Allows for alternative evaluation methods

7. Why Rich CLI?

Decision: Use Rich library for terminal output.

Rationale:

Professional-looking output
Progress bars for long operations
Color-coded results (green=good, red=bad)
Tables for easy comparison
Better user experience

8. Why Track Costs?

Decision: Built-in cost tracking as a first-class feature.

Rationale:

LLM API costs can add up quickly
Users need visibility into spending
Helps compare models on cost-effectiveness
Enables budget constraints
Promotes responsible API usage

Technology Stack

Core Dependencies

Package	Version	Purpose
pydantic	>=2.0.0	Data validation and modeling
pyyaml	>=6.0	YAML parsing for task definitions
httpx	>=0.25.0	Async HTTP client for API calls
typer	>=0.9.0	CLI framework
rich	>=13.0.0	Terminal formatting and progress bars
python-dotenv	>=1.0.0	Environment variable management
fastapi	>=0.110.0	REST API for UI/backend
uvicorn	>=0.23.0	ASGI server
streamlit	>=1.30.0	Frontend UI

Development Dependencies

Package	Purpose
pytest	Unit and integration testing
pytest-asyncio	Async test support
pytest-cov	Code coverage reporting
pytest-mock	Mocking for tests
black	Code formatting
isort	Import sorting
mypy	Static type checking
flake8	Linting

Python Version

Required: Python 3.11+
Rationale: Uses modern async features and type hints

External Services

OpenRouter API: LLM completions
Models Used:
- Task execution: User-specified (Claude, GPT-4, Gemini, Llama, Qwen, etc.)
- Judge evaluation: Claude Sonnet 4.5 (default)

File Formats

Task Definitions: YAML
Model Pricing: YAML
Results Output: JSON
Input Data: Plain text (user-provided)

Architecture Patterns

Async I/O: Non-blocking API calls
Dependency Injection: Components receive dependencies
Repository Pattern: CostTracker for pricing data
Strategy Pattern: Different comparison metrics
Template Method: Prompt building
Decorator Pattern: Retry logic
Data Transfer Objects: Pydantic models
Command Pattern: CLI commands

Error Handling Strategy

API Errors: Specific exception types (Auth, RateLimit, BadRequest)
Retry Logic: Exponential backoff for transient errors
Validation Errors: Pydantic validation with clear messages
User Errors: Friendly error messages in CLI
Logging: Structured logging for debugging

Future Extensibility

The architecture supports future enhancements:

Custom Judge Models: Easy to swap judge implementation
Additional Providers: Can add direct API clients
Web Interface: Pydantic models ready for FastAPI
Database Storage: Can add persistence layer
Parallel Evaluation: Already async-ready
Custom Metrics: Extensible comparison logic
Streaming Results: Can add WebSocket support

Configuration

Environment Variables

OPENROUTER_API_KEY=your-key-here  # Required

Configuration Files

config/models.yaml: Model pricing database
tasks/*.yaml: Task definitions
.env: Environment variables (not committed)

Project Structure

llm-taskbench/
├── src/taskbench/           # Main package
│   ├── core/                # Core models and task parser
│   ├── api/                 # API client and retry logic
│   ├── evaluation/          # Executor, judge, cost tracker
│   ├── cli/                 # Command-line interface
│   └── utils/               # Utilities (validation, logging)
├── tasks/                   # Task definitions
├── config/                  # Configuration files
├── tests/                   # Test suite
├── docs/                    # Documentation
├── examples/                # Example data and results
└── results/                 # Evaluation outputs

Summary

LLM TaskBench is architected as a modular, extensible framework for task-specific LLM evaluation. The design prioritizes:

Type Safety: Pydantic models throughout
Performance: Async I/O for concurrent operations
Usability: Rich CLI with clear feedback
Reliability: Retry logic and error handling
Transparency: Cost tracking and detailed logging
Extensibility: Clean separation of concerns

The architecture supports the core workflow: define tasks declaratively, execute on multiple models efficiently, evaluate with LLM-as-judge objectively, and compare results comprehensively.

10. Model Recommender (`taskbench/evaluation/orchestrator.py`)

Purpose: Suggest appropriate models for a use-case.

Responsibilities:

Read use-case/task traits (e.g., long-context, cost priority).
Filter OpenRouter models (context window, pricing) and drop curated denylists.
Provide a candidate set to CLI/UI; supports auto-selection when --models auto.

Design Pattern: Strategy/filtering heuristics with env-driven defaults.

11. Folder-Based Use Case System

The framework now supports a folder-based use case architecture that replaces YAML-based task definitions with human-readable Markdown and automatic prompt generation.

Use Case Parser (`taskbench/usecase_parser.py`)

Purpose: Parse folder-based use cases from USE-CASE.md files.

Responsibilities:

Parse USE-CASE.md Markdown files
Extract goal, difficulty, evaluation notes, edge cases
Scan data/ and ground-truth/ folders
Match input files to expected outputs by naming patterns
Return structured ParsedUseCase object

Folder Structure:

sample-usecases/00-lecture-concept-extraction/
├── USE-CASE.md           # Human-readable description
├── data/                 # Input files
│   ├── lecture-01-python-basics.txt
│   └── lecture-02-ml-fundamentals.txt
├── ground-truth/         # Expected outputs
│   ├── lecture-01-concepts.csv
│   └── lecture-02-concepts.csv
├── generated-prompts.json  # Cached prompts (auto-generated)
└── prompts/              # Individual prompt files
    ├── task-prompt.txt
    ├── judge-prompt.txt
    ├── rubric.json
    └── analysis.json

Prompt Generator (`taskbench/prompt_generator.py`)

Purpose: Generate task prompts, judge prompts, and rubrics using LLM analysis.

Responsibilities:

Analyze USE-CASE.md content and ground truth samples
Generate task prompt optimized for the transformation
Generate judge prompt with evaluation criteria
Derive rubric with compliance checks and penalties
Cache generated prompts to avoid regeneration

Generation Flow:

Send use case content + ground truth sample to LLM
LLM returns analysis of transformation type and key fields
Generate task prompt with specific instructions
Generate judge prompt with scoring rubric
Save to generated-prompts.json and prompts/ folder

Results Organization

Purpose: Auto-organize evaluation results by use case.

Structure:

results/
├── 00-lecture-concept-extraction/
│   ├── 2025-12-26_233901_lecture-01-python-basics.json
│   └── 2025-12-26_235012_lecture-02-ml-fundamentals.json
├── 01-meeting-action-items/
│   └── 2025-12-26_234802_meeting-01-standup.json
└── _legacy/
    └── (old YAML-based results)

Naming Convention: {YYYY-MM-DD}_{HHMMSS}_{data-file-name}.json

Judge Enhancements

Robust Response Parsing:

Handles floats converted to ints for scores
Supports summary as fallback for reasoning
Supports final_score as fallback for overall_score
Handles scores nested in metrics dict
Converts object-based violations to strings

Timestamp Range Extraction:

Extracts first and last timestamps from input data
Provides full timestamp range context to judge
Shows both beginning and ending of long inputs
Prevents false "fabrication" accusations for valid timestamps

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

LLM TaskBench Architecture

Table of Contents

High-Level Overview

Recent Enhancements

Key Features

Design Philosophy

System Architecture

Component Descriptions

1. CLI Interface (taskbench/cli/main.py)

2. Core Models (taskbench/core/models.py)

3. Task Parser (taskbench/core/task.py)

4. API Client (taskbench/api/client.py)

5. Retry Logic (taskbench/api/retry.py)

6. Model Executor (taskbench/evaluation/executor.py)

7. LLM Judge (taskbench/evaluation/judge.py)

8. Model Comparison (taskbench/evaluation/judge.py)

9. Cost Tracker (taskbench/evaluation/cost.py)

Data Flow

Evaluation Flow

Task Definition Flow

Judge Evaluation Flow

Design Decisions

1. Why OpenRouter?

2. Why LLM-as-Judge?

3. Why Pydantic Models?

4. Why YAML for Task Definitions?

5. Why Async/Await?

6. Why Separate Executor and Judge?

7. Why Rich CLI?

8. Why Track Costs?

Technology Stack

Core Dependencies

Development Dependencies

Python Version

External Services

File Formats

Architecture Patterns

Error Handling Strategy

Future Extensibility

Configuration

Environment Variables

Configuration Files

Project Structure

Summary

10. Model Recommender (taskbench/evaluation/orchestrator.py)

11. Folder-Based Use Case System

Use Case Parser (taskbench/usecase_parser.py)

Prompt Generator (taskbench/prompt_generator.py)

Results Organization

Judge Enhancements

1. CLI Interface (`taskbench/cli/main.py`)

2. Core Models (`taskbench/core/models.py`)

3. Task Parser (`taskbench/core/task.py`)

4. API Client (`taskbench/api/client.py`)

5. Retry Logic (`taskbench/api/retry.py`)

6. Model Executor (`taskbench/evaluation/executor.py`)

7. LLM Judge (`taskbench/evaluation/judge.py`)

8. Model Comparison (`taskbench/evaluation/judge.py`)

9. Cost Tracker (`taskbench/evaluation/cost.py`)

10. Model Recommender (`taskbench/evaluation/orchestrator.py`)

Use Case Parser (`taskbench/usecase_parser.py`)

Prompt Generator (`taskbench/prompt_generator.py`)