- High-Level Overview
- System Architecture
- Component Descriptions
- Data Flow
- Design Decisions
- Technology Stack
LLM TaskBench is a task-specific LLM evaluation framework that enables developers and researchers to objectively compare language models on custom, domain-specific tasks. The framework combines agentic orchestration with LLM-as-judge evaluation to provide accurate, reproducible performance metrics.
- Folder-based use cases: Human-readable USE-CASE.md files with automatic prompt generation from ground truth analysis
- LLM-driven prompt generation: Framework analyzes use case + ground truth to generate task prompts, judge prompts, and rubrics
- Results organization: Auto-saved to
results/{usecase-name}/{timestamp}_{datafile}.json - Robust judge parsing: Handles varied LLM response formats (floats, nested metrics, object violations)
- Timestamp range extraction: Judge receives full context of input timestamp range for accurate evaluation
- Model recommendations: orchestrator suggests models from OpenRouter based on use-case traits
- Cost visibility: inline + generation lookup billing, per-run and per-use-case rollups
- Parallel execution by default with retry/backoff and rate limiting hooks
- Model-aware chunking: executor derives chunk sizes from selected models' context windows
- Folder-Based Use Cases: Define use cases in human-readable Markdown with automatic prompt generation
- Task-Specific Evaluation: Custom tasks with YAML or folder-based configuration
- Multi-Model Support: Evaluate multiple LLMs simultaneously via OpenRouter API
- LLM-as-Judge: Automated evaluation using Claude Sonnet 4.5 with robust response parsing
- Auto-Generated Prompts: LLM analyzes use case and ground truth to derive optimal prompts and rubrics
- Cost Tracking: Real-time cost calculation and tracking
- Rich CLI: User-friendly command-line interface with progress indicators
- Results Organization: Auto-organized by use case with timestamped filenames
- Declarative Task Definition: Tasks are defined in YAML, separating configuration from code
- Type Safety: Pydantic models ensure data validation throughout the pipeline
- Async-First: Asynchronous I/O for efficient API interactions
- Separation of Concerns: Clear boundaries between API client, executor, judge, and cost tracking
- Observable: Rich console output and detailed logging for transparency
flowchart TD
CLI["CLI Interface<br/>(taskbench/cli/main.py)<br/><br/>Commands: evaluate, models, validate"]
subgraph Orchestration["Orchestration Layer (taskbench/evaluation/*)"]
Executor["Executor<br/>- Builds prompts<br/>- Runs tasks<br/>- Collects results"]
Judge["Judge<br/>- Evaluates outputs<br/>- Scores models<br/>- Detects violations"]
CostTracker["CostTracker<br/>- Calculates costs<br/>- Tracks usage<br/>- Provides stats"]
Orchestrator["Model Recommender<br/>- Use-case aware<br/>- Cost/context filters"]
end
subgraph APIClient["API Client Layer (taskbench/api/*)"]
OpenRouter["OpenRouterClient<br/>- HTTP requests<br/>- Authentication<br/>- Error handling<br/>- JSON mode"]
Retry["Retry Logic<br/>- Exponential backoff<br/>- Rate limiting<br/>- Transient errors"]
end
subgraph CoreModels["Core Models (taskbench/core/*)"]
TaskDef["TaskDefinition<br/>ModelConfig"]
EvalResult["EvaluationResult<br/>CompletionResp."]
JudgeScore["JudgeScore<br/>TaskParser"]
end
subgraph External["External Services"]
OpenRouterAPI["OpenRouter API<br/>(aggregates multiple LLM providers)<br/><br/>• Anthropic (Claude)<br/>• OpenAI (GPT-4)<br/>• Google (Gemini)<br/>• Meta (Llama)<br/>• Alibaba (Qwen)<br/>• And more..."]
end
CLI --> Orchestration
Orchestration --> APIClient
CLI --> UI["FastAPI/Streamlit UI<br/>Runs & Use-cases"]
UI --> Orchestration
APIClient --> CoreModels
CoreModels --> External
style CLI fill:#e1f5ff
style Orchestration fill:#fff4e1
style APIClient fill:#f0e1ff
style CoreModels fill:#e1ffe1
style External fill:#ffe1e1
Purpose: Provides user-facing command-line interface.
Responsibilities:
- Parse command-line arguments using Typer
- Load environment variables and configuration
- Coordinate between evaluation components
- Display results with Rich formatting
- Handle errors and provide user feedback
Key Commands:
run: Run evaluation on a folder-based use case (primary command)list-usecases: List available use cases in a foldergenerate-prompts: Generate prompts from use case without runningevaluate: Run multi-model evaluation on a YAML task (legacy)models: List available models and pricingvalidate: Validate task definition YAML files
Design Pattern: Command pattern with async/await for I/O operations.
Purpose: Define type-safe data structures using Pydantic.
Key Models:
-
TaskDefinition: Represents a user-defined evaluation task
- Name, description, input/output types
- Evaluation criteria and constraints
- Examples and judge instructions
- Validation rules for input/output formats
-
CompletionResponse: API response from LLM completion
- Response content and metadata
- Token usage (input, output, total)
- Latency metrics
-
EvaluationResult: Single model evaluation result
- Model output and status
- Token usage and cost
- Timestamp and error information
-
JudgeScore: LLM-as-judge scoring result
- Multi-dimensional scores (accuracy, format, compliance)
- Violations list
- Detailed reasoning
-
ModelConfig: Model pricing and configuration
- Pricing per million tokens
- Context window size
- Provider information
Design Pattern: Data Transfer Objects (DTOs) with built-in validation.
Purpose: Load, validate, and save task definitions from YAML.
Responsibilities:
- Parse YAML files into TaskDefinition objects
- Validate task structure and constraints
- Check for logical errors (e.g., min >= max)
- Save tasks back to YAML format
Validation Rules:
- Required fields must be present
- Input/output types must be valid
- Constraints must be logically consistent
- Min/max pairs must satisfy min < max
Purpose: Interface with OpenRouter API for LLM completions.
Responsibilities:
- Manage HTTP connections with httpx
- Handle authentication and headers
- Parse API responses
- Calculate latency metrics
- Support both standard and JSON mode completions
Error Handling:
AuthenticationError: Invalid API key (401)RateLimitError: Rate limit exceeded (429)BadRequestError: Malformed request (400)OpenRouterError: Server errors (5xx)
Design Pattern: Async context manager for resource management.
Purpose: Handle transient errors and rate limiting.
Components:
-
RateLimiter: Token bucket rate limiting
- Tracks requests per minute
- Sleeps when limit would be exceeded
- Thread-safe with async locks
-
retry_with_backoff: Decorator for exponential backoff
- Retries transient errors (rate limits, timeouts, 5xx)
- Skips non-retryable errors (auth, bad requests)
- Exponential backoff with jitter
Design Pattern: Decorator pattern for cross-cutting concerns.
Purpose: Execute tasks on LLM models and collect results.
Responsibilities:
- Build comprehensive prompts from task definitions
- Make API calls with configured parameters
- Calculate costs using CostTracker
- Handle execution errors gracefully
- Display progress with Rich progress bars
Prompt Building Strategy:
- Task description and context
- Output format requirements (emphasized)
- CRITICAL CONSTRAINTS section (bold)
- Examples of good outputs
- Evaluation criteria
- Input data
- Final instructions
Design Pattern: Template method for prompt building.
Purpose: Evaluate model outputs using LLM-as-judge pattern.
Responsibilities:
- Build evaluation prompts for judge model
- Request JSON-formatted scores
- Parse and validate judge responses
- Categorize violations by type
- Generate comparison reports
Evaluation Dimensions:
- Accuracy Score (0-100): Content correctness
- Format Score (0-100): Format compliance
- Compliance Score (0-100): Constraint adherence
- Overall Score (0-100): Weighted combination
Violation Categories:
under_min: Below minimum requirementsover_max: Exceeds maximum limitsformat: Format specification violationsmissing_field: Required fields absentother: Miscellaneous issues
Purpose: Compare and rank evaluation results.
Responsibilities:
- Combine evaluation results with judge scores
- Sort models by overall score
- Calculate value metrics (score/cost ratio)
- Generate comparison tables
- Identify best overall and best value models
Design Pattern: Strategy pattern for different comparison metrics.
Purpose: Calculate and track evaluation costs.
Responsibilities:
- Load model pricing from YAML configuration
- Calculate costs from token usage
- Track cumulative costs across evaluations
- Provide cost breakdowns by model
- Generate cost statistics
Pricing Model:
- Input tokens: Price per 1M tokens
- Output tokens: Price per 1M tokens (usually higher)
- Total cost = (input_tokens/1M × input_price) + (output_tokens/1M × output_price)
Design Pattern: Repository pattern for pricing data.
sequenceDiagram
actor User
participant CLI
participant TaskParser
participant Executor
participant APIClient
participant CostTracker
participant Judge
participant Comparison
User->>CLI: taskbench evaluate task.yaml --models model1,model2
CLI->>TaskParser: load_from_yaml()
TaskParser-->>CLI: TaskDefinition
CLI->>TaskParser: validate_task()
TaskParser-->>CLI: (bool, List[errors])
CLI->>CLI: Load input data
loop For each model
CLI->>Executor: build_prompt(task, input)
Executor-->>CLI: prompt string
CLI->>APIClient: complete(prompt)
APIClient-->>CLI: CompletionResponse
CLI->>CostTracker: calculate_cost(tokens)
CostTracker-->>CLI: cost float
CLI->>Executor: create_result()
Executor-->>CLI: EvaluationResult
end
alt --judge enabled
loop For each result
CLI->>Judge: build_judge_prompt()
Judge-->>CLI: judge prompt
CLI->>APIClient: complete_with_json()
APIClient-->>CLI: JSON scores
CLI->>Judge: parse_scores()
Judge-->>CLI: JudgeScore
end
end
CLI->>Comparison: compare_results()
Comparison-->>CLI: comparison_data
CLI->>Comparison: generate_comparison_table()
Comparison-->>CLI: Rich.Table
CLI->>CLI: Save to JSON
CLI-->>User: Display results
flowchart LR
YAML[YAML File] --> Parser[TaskParser]
Parser --> Validation[Validation]
Validation --> TaskDef[TaskDefinition]
TaskDef --> Executor[Executor]
Executor --> Prompt[Prompt]
subgraph PromptComponents["Task components used in prompt"]
Desc[description:<br/>Main instruction]
Format[output_format:<br/>Format requirements]
Constraints[constraints:<br/>CRITICAL CONSTRAINTS]
Examples[examples:<br/>Example outputs]
Criteria[evaluation_criteria:<br/>What to aim for]
JudgeNote[judge_instructions:<br/>NOT used in task prompt<br/>only for judge]
end
TaskDef -.-> PromptComponents
style YAML fill:#e1f5ff
style TaskDef fill:#fff4e1
style Prompt fill:#e1ffe1
style JudgeNote fill:#ffe1e1
flowchart TD
Input["Input:<br/>• TaskDefinition<br/>• EvaluationResult<br/>• Original Input Data"]
Input --> BuildPrompt["Build judge prompt"]
subgraph PromptContents["Prompt includes:"]
PC1[Task description & criteria]
PC2[Constraints to check]
PC3[Original input for context]
PC4[Model output to evaluate]
PC5[Scoring rubric]
PC6[JSON response format]
end
BuildPrompt -.-> PromptContents
BuildPrompt --> SendAPI["Send to judge model<br/>(Claude Sonnet 4.5)<br/>with JSON mode"]
SendAPI --> ParseJSON["Parse JSON response"]
subgraph JSONSchema["JSON Response:"]
JS1[accuracy_score: 0-100]
JS2[format_score: 0-100]
JS3[compliance_score: 0-100]
JS4[overall_score: 0-100]
JS5[violations: list of issues]
JS6[reasoning: explanation]
end
ParseJSON -.-> JSONSchema
ParseJSON --> CreateScore["Create JudgeScore object<br/>with validation"]
CreateScore --> Return["Return score for comparison"]
style Input fill:#e1f5ff
style SendAPI fill:#fff4e1
style Return fill:#e1ffe1
Decision: Use OpenRouter as the unified API gateway.
Rationale:
- Single API for multiple providers (Anthropic, OpenAI, Google, Meta, etc.)
- Consistent interface across different models
- Built-in rate limiting and load balancing
- Cost-effective pricing
- No need to manage multiple API keys
Trade-offs:
- Dependency on third-party service
- Slight latency overhead vs. direct APIs
- Limited to models available on OpenRouter
Decision: Use Claude Sonnet 4.5 as the evaluation judge.
Rationale:
- Scales to custom tasks without manual evaluation
- Provides detailed, explainable scores
- Detects subtle violations humans might miss
- Consistent evaluation criteria across runs
- Faster and cheaper than human evaluation
Trade-offs:
- Judge model adds cost (mitigated by using temperature=0.3)
- Judge can have biases or errors
- Requires careful prompt engineering for judge instructions
Validation: Research shows LLM-as-judge correlates well with human judgments for many tasks.
Decision: Use Pydantic for all data structures.
Rationale:
- Runtime type checking and validation
- Automatic JSON serialization/deserialization
- Clear documentation through type hints
- IDE autocomplete and type checking
- Validation errors provide clear messages
Benefits:
- Catches errors early (at data ingestion)
- Self-documenting code
- Easy to extend with validators
- Seamless integration with FastAPI (future)
Decision: Use YAML instead of JSON or Python for task definitions.
Rationale:
- Human-readable and editable
- Supports multi-line strings (for judge instructions)
- Comments for documentation
- Less verbose than JSON
- Declarative (non-executable) for security
Alternative Considered: Python classes
- Rejected: Requires Python knowledge, harder to version control, potential security issues
Decision: Use asyncio throughout the application.
Rationale:
- Non-blocking I/O for API calls
- Better performance for multi-model evaluation
- Scales well for concurrent requests
- Modern Python best practice for I/O-bound applications
Complexity Trade-off:
- Slightly more complex than synchronous code
- Worth it for performance gains (can evaluate 10 models in parallel)
Decision: Split evaluation into two phases: execution and judging.
Rationale:
- Single Responsibility Principle
- Can run without judge (--no-judge flag)
- Easy to swap judge models
- Clear separation of concerns
- Allows for alternative evaluation methods
Decision: Use Rich library for terminal output.
Rationale:
- Professional-looking output
- Progress bars for long operations
- Color-coded results (green=good, red=bad)
- Tables for easy comparison
- Better user experience
Decision: Built-in cost tracking as a first-class feature.
Rationale:
- LLM API costs can add up quickly
- Users need visibility into spending
- Helps compare models on cost-effectiveness
- Enables budget constraints
- Promotes responsible API usage
| Package | Version | Purpose |
|---|---|---|
| pydantic | >=2.0.0 | Data validation and modeling |
| pyyaml | >=6.0 | YAML parsing for task definitions |
| httpx | >=0.25.0 | Async HTTP client for API calls |
| typer | >=0.9.0 | CLI framework |
| rich | >=13.0.0 | Terminal formatting and progress bars |
| python-dotenv | >=1.0.0 | Environment variable management |
| fastapi | >=0.110.0 | REST API for UI/backend |
| uvicorn | >=0.23.0 | ASGI server |
| streamlit | >=1.30.0 | Frontend UI |
| Package | Purpose |
|---|---|
| pytest | Unit and integration testing |
| pytest-asyncio | Async test support |
| pytest-cov | Code coverage reporting |
| pytest-mock | Mocking for tests |
| black | Code formatting |
| isort | Import sorting |
| mypy | Static type checking |
| flake8 | Linting |
- Required: Python 3.11+
- Rationale: Uses modern async features and type hints
- OpenRouter API: LLM completions
- Models Used:
- Task execution: User-specified (Claude, GPT-4, Gemini, Llama, Qwen, etc.)
- Judge evaluation: Claude Sonnet 4.5 (default)
- Task Definitions: YAML
- Model Pricing: YAML
- Results Output: JSON
- Input Data: Plain text (user-provided)
- Async I/O: Non-blocking API calls
- Dependency Injection: Components receive dependencies
- Repository Pattern: CostTracker for pricing data
- Strategy Pattern: Different comparison metrics
- Template Method: Prompt building
- Decorator Pattern: Retry logic
- Data Transfer Objects: Pydantic models
- Command Pattern: CLI commands
- API Errors: Specific exception types (Auth, RateLimit, BadRequest)
- Retry Logic: Exponential backoff for transient errors
- Validation Errors: Pydantic validation with clear messages
- User Errors: Friendly error messages in CLI
- Logging: Structured logging for debugging
The architecture supports future enhancements:
- Custom Judge Models: Easy to swap judge implementation
- Additional Providers: Can add direct API clients
- Web Interface: Pydantic models ready for FastAPI
- Database Storage: Can add persistence layer
- Parallel Evaluation: Already async-ready
- Custom Metrics: Extensible comparison logic
- Streaming Results: Can add WebSocket support
OPENROUTER_API_KEY=your-key-here # Required- config/models.yaml: Model pricing database
- tasks/*.yaml: Task definitions
- .env: Environment variables (not committed)
llm-taskbench/
├── src/taskbench/ # Main package
│ ├── core/ # Core models and task parser
│ ├── api/ # API client and retry logic
│ ├── evaluation/ # Executor, judge, cost tracker
│ ├── cli/ # Command-line interface
│ └── utils/ # Utilities (validation, logging)
├── tasks/ # Task definitions
├── config/ # Configuration files
├── tests/ # Test suite
├── docs/ # Documentation
├── examples/ # Example data and results
└── results/ # Evaluation outputs
LLM TaskBench is architected as a modular, extensible framework for task-specific LLM evaluation. The design prioritizes:
- Type Safety: Pydantic models throughout
- Performance: Async I/O for concurrent operations
- Usability: Rich CLI with clear feedback
- Reliability: Retry logic and error handling
- Transparency: Cost tracking and detailed logging
- Extensibility: Clean separation of concerns
The architecture supports the core workflow: define tasks declaratively, execute on multiple models efficiently, evaluate with LLM-as-judge objectively, and compare results comprehensively.
Purpose: Suggest appropriate models for a use-case.
Responsibilities:
- Read use-case/task traits (e.g., long-context, cost priority).
- Filter OpenRouter models (context window, pricing) and drop curated denylists.
- Provide a candidate set to CLI/UI; supports auto-selection when
--models auto.
Design Pattern: Strategy/filtering heuristics with env-driven defaults.
The framework now supports a folder-based use case architecture that replaces YAML-based task definitions with human-readable Markdown and automatic prompt generation.
Purpose: Parse folder-based use cases from USE-CASE.md files.
Responsibilities:
- Parse USE-CASE.md Markdown files
- Extract goal, difficulty, evaluation notes, edge cases
- Scan data/ and ground-truth/ folders
- Match input files to expected outputs by naming patterns
- Return structured ParsedUseCase object
Folder Structure:
sample-usecases/00-lecture-concept-extraction/
├── USE-CASE.md # Human-readable description
├── data/ # Input files
│ ├── lecture-01-python-basics.txt
│ └── lecture-02-ml-fundamentals.txt
├── ground-truth/ # Expected outputs
│ ├── lecture-01-concepts.csv
│ └── lecture-02-concepts.csv
├── generated-prompts.json # Cached prompts (auto-generated)
└── prompts/ # Individual prompt files
├── task-prompt.txt
├── judge-prompt.txt
├── rubric.json
└── analysis.json
Purpose: Generate task prompts, judge prompts, and rubrics using LLM analysis.
Responsibilities:
- Analyze USE-CASE.md content and ground truth samples
- Generate task prompt optimized for the transformation
- Generate judge prompt with evaluation criteria
- Derive rubric with compliance checks and penalties
- Cache generated prompts to avoid regeneration
Generation Flow:
- Send use case content + ground truth sample to LLM
- LLM returns analysis of transformation type and key fields
- Generate task prompt with specific instructions
- Generate judge prompt with scoring rubric
- Save to generated-prompts.json and prompts/ folder
Purpose: Auto-organize evaluation results by use case.
Structure:
results/
├── 00-lecture-concept-extraction/
│ ├── 2025-12-26_233901_lecture-01-python-basics.json
│ └── 2025-12-26_235012_lecture-02-ml-fundamentals.json
├── 01-meeting-action-items/
│ └── 2025-12-26_234802_meeting-01-standup.json
└── _legacy/
└── (old YAML-based results)
Naming Convention: {YYYY-MM-DD}_{HHMMSS}_{data-file-name}.json
Robust Response Parsing:
- Handles floats converted to ints for scores
- Supports
summaryas fallback forreasoning - Supports
final_scoreas fallback foroverall_score - Handles scores nested in
metricsdict - Converts object-based violations to strings
Timestamp Range Extraction:
- Extracts first and last timestamps from input data
- Provides full timestamp range context to judge
- Shows both beginning and ending of long inputs
- Prevents false "fabrication" accusations for valid timestamps