| title | CodeLens Environment | |
|---|---|---|
| emoji | 🔍 | |
| colorFrom | blue | |
| colorTo | green | |
| sdk | docker | |
| app_port | 7860 | |
| tags |
|
AI evaluation environment for benchmarking code review agents on 30 synthetic pull requests.
CodeLens is a high-fidelity evaluation environment where AI agents act as senior code reviewers. They analyze pull request diffs to identify bugs, security vulnerabilities, and architectural issues before providing a final verdict.
Designed for researchers and developers building the next generation of AI code assistants, CodeLens provides 30 realistic Python scenarios with ground-truth labels and deterministic, reproducible scoring.
Progress in AI coding assistants has largely focused on generation (writing code), but evaluation (reviewing code) is equally critical for software reliability. Manual code review is a high-cognitive-load, real-world task that requires:
- Precision: Identifying exactly where a bug exists.
- Context: Understanding how a local change affects the whole system.
- Security-First Mindset: Spotting non-obvious vulnerabilities like SQL injection or race conditions.
CodeLens transforms these human-centric skills into a measurable benchmark, allowing researchers to evaluate agents on their ability to act as high-fidelity gatekeepers of code quality.
Get up and running locally in under 2 minutes:
git clone https://github.com/ArshVermaGit/open-ev-code-handler.git
cd open-ev-code-handler
cp .env.example .env
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python scripts/migrate.py init
PYTHONPATH=. python app.py- Dashboard: http://localhost:7860/dashboard
- API Docs: http://localhost:7860/docs
CodeLens benchmarks agents across three critical engineering domains:
| Task | Difficulty | Scenarios | Max Steps | Focus Area |
|---|---|---|---|---|
bug_detection |
Easy | 10 | 10 | Off-by-one errors, null dereferences, race conditions, exception handling |
security_audit |
Medium | 10 | 15 | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
architectural_review |
Hard | 10 | 20 | N+1 queries, god classes, blocking async calls, circular imports |
Each step() and reset() call returns a typed Observation object:
| Field | Type | Description |
|---|---|---|
task_id |
TaskId (enum) |
One of bug_detection, security_audit, architectural_review |
scenario_hash |
str |
Deterministic identifier for the scenario |
pr_title |
str |
Title of the synthetic pull request |
pr_description |
str |
Description/context for the PR |
diff |
str |
Full unified diff (all files concatenated) |
files_changed |
List[FileChanged] |
Structured file patches with metadata |
step_count |
int |
Current step number (0-indexed) |
max_steps |
int |
Maximum steps allowed for this task |
noise_budget |
int |
Remaining false-positive credits (starts at 5) |
issues_flagged |
int |
Number of correctly matched issues so far |
done |
bool |
Whether the episode has terminated |
Agents submit typed Action objects with the following fields:
| Field | Type | Required For | Description |
|---|---|---|---|
action_type |
ActionType (enum) |
All actions | flag_issue, approve, request_changes, comment, ask_question |
body |
str |
All actions | Description or explanation text |
filename |
str |
flag_issue |
File containing the issue |
line_number |
int |
flag_issue |
Approximate line number of the issue |
category |
Category (enum) |
flag_issue |
bug, security, architecture, style, performance |
severity |
Severity (enum) |
flag_issue |
critical, high, medium, low, info |
verdict |
Verdict (enum) |
approve / request_changes |
lgtm, request_changes, needs_discussion |
Each step() returns a typed Reward object:
| Field | Type | Description |
|---|---|---|
value |
float |
Normalised score (0.0–1.0) |
reason |
str |
Human-readable explanation of the reward |
is_terminal |
bool |
True on the final step of an episode |
Reward shaping: Correct issue flags yield positive rewards scaled by severity (critical=1.0, high=0.8, medium=0.5, low=0.2). False positives and duplicates incur −0.05 penalties and consume noise budget. Episodes terminate when noise budget reaches zero, max steps are exceeded, or a terminal action (approve/request_changes) is submitted.
- Predictable State Management: The
reset()andstep()functions are strictly idempotent based on task/seed pairs, ensuring 100% reproducible episodes. - Dense Reward Signal: Unlike "win/loss" environments, CodeLens provides continuous feedback. Every action—from the first issue flagged to the final verdict—produces a typed
Rewardobject with human-readable rationale, accelerating agent learning (process supervision). - Novelty: The Reviewer Trust Mechanic: The Noise Budget (5 credits) simulates real-world developer trust. If an agent "hallucinates" too many non-existent bugs, it loses the budget and the episode is terminated, penalizing high-volume, low-precision behavior.
Score = 0.4 × coverage + 0.6 × avg_issue_score − 0.1 × false_positive_rate
Issues are scored on keyword accuracy (50%) and severity matching (50%).
Score = avg(per_issue_score) where each issue = 0.7 × severity_accuracy + 0.3 × keyword_coverage.
Severity accuracy is distance-weighted: misclassifying a CRITICAL issue as LOW incurs a major penalty.
Score = 0.6 × detection_rate + 0.2 × verdict_accuracy + 0.2 × detail_quality.
Detail quality rewards technical explanations that provide actionable developer feedback.
Every episode permits 5 false positive credits. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
Reproducible keyword-based baseline results across all 30 scenarios (10 seeds per task):
| Task | Mean Score | Best Score | Worst Score | Success Rate (>0.5) |
|---|---|---|---|---|
bug_detection |
0.3577 | 0.9167 | 0.0000 | 40% |
security_audit |
0.1850 | 1.0000 | 0.0000 | 20% |
architectural_review |
0.2930 | 0.6640 | 0.0000 | 40% |
| Overall | 0.2786 | — | — | 33% |
Agent:
KeywordAgent(heuristic, 35+ rules) — seescripts/baseline.pyReproduce:python scripts/evaluate.py --agent keyword --output results.json
These scores represent a deterministic lower bound. LLM-powered agents (e.g., GPT-4o, Claude) are expected to significantly outperform this baseline.
| Method | Endpoint | Auth | Description |
|---|---|---|---|
POST |
/reset |
Optional | Start a new evaluation episode |
POST |
/step/{id} |
Optional | Submit a review action (flag_issue, approve) |
GET |
/result/{id} |
Optional | Retrieve final scores and logs for an episode |
GET |
/leaderboard |
None | Paginated performance rankings |
POST |
/submit |
Optional | Persist an episode result to the leaderboard |
GET |
/stats |
None | Aggregate statistics across all agents |
GET |
/episodes/{id}/replay |
Optional | Full event-by-event history replay |
GET |
/dashboard |
None | Interactive Real-time Dashboard |
GET |
/health |
None | System status and health check |
Authentication is disabled by default. Set API_KEY_ENABLED=true in .env for production parity.
docker compose up -d
# View logs: docker compose logs -fdocker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latestdocker compose -f docker-compose.test.yml uppython scripts/baseline.py --task bug_detection --seed 3 --verbose# Keyword-based baseline
python scripts/evaluate.py --agent keyword --output results.json
# LLM-powered reviewer (e.g. Claude)
python scripts/evaluate.py --agent llm --api-key $ANTHROPIC_API_KEYCodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:
import requests
API = "http://localhost:7860"
# Start new episode
resp = requests.post(f"{API}/reset", json={"task_id": "bug_detection", "seed": 0})
episode_id = resp.json()["episode_id"]
done = False
while not done:
# Your agent logic analyzes the diff
action = {
"action_type": "flag_issue",
"body": "Identified a vulnerability line 14",
"filename": "api/search.py",
"line_number": 14,
"severity": "critical",
"category": "security"
}
result = requests.post(f"{API}/step/{episode_id}", json=action).json()
done = result["done"]
# Get final results
final = requests.get(f"{API}/result/{episode_id}").json()
print(f"Final Score: {final['final_score']}")open-ev-code-handler/
├── app.py # FastAPI application (9 endpoints)
├── codelens_env/ # Core evaluation logic
│ ├── database.py # SQLModel persistence layer
│ ├── env.py # Episode state machine
│ ├── models.py # Pydantic v2 data models
│ ├── scenarios.py # 30 Synthetic PR scenarios
│ └── graders/ # Grader implementations (Bug, Sec, Arch)
├── scripts/ # CLI tools (baseline, evaluate, migrate)
├── static/ # Compiled dashboard assets
├── tests/ # 155+ Parametrized tests
├── Dockerfile # Multi-stage, non-root build
├── docker-compose.yml # Production orchestration
└── openenv.yaml # CodeLens v2 specification
# Setup
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Automated Tests
PYTHONPATH=. pytest tests/ -v --cov=codelens_env
# Linter Check
pylint codelens_env/ app.py
# Scenario Sanity Check
PYTHONPATH=. python scripts/validate.pyCodeLens is authored and maintained by:
Please see CONTRIBUTING.md for details on authoring new scenarios and submission standards.
This project is licensed under the MIT License.