CodeLens Environment

title

emoji

🔍

colorFrom

blue

colorTo

green

sdk

docker

app_port

7860

CodeLens Environment

AI evaluation environment for benchmarking code review agents on 30 synthetic pull requests.

CodeLens is a high-fidelity evaluation environment where AI agents act as senior code reviewers. They analyze pull request diffs to identify bugs, security vulnerabilities, and architectural issues before providing a final verdict.

Designed for researchers and developers building the next generation of AI code assistants, CodeLens provides 30 realistic Python scenarios with ground-truth labels and deterministic, reproducible scoring.

💡 Motivation

Progress in AI coding assistants has largely focused on generation (writing code), but evaluation (reviewing code) is equally critical for software reliability. Manual code review is a high-cognitive-load, real-world task that requires:

Precision: Identifying exactly where a bug exists.
Context: Understanding how a local change affects the whole system.
Security-First Mindset: Spotting non-obvious vulnerabilities like SQL injection or race conditions.

CodeLens transforms these human-centric skills into a measurable benchmark, allowing researchers to evaluate agents on their ability to act as high-fidelity gatekeepers of code quality.

Quick Start

Get up and running locally in under 2 minutes:

git clone https://github.com/ArshVermaGit/open-ev-code-handler.git
cd open-ev-code-handler
cp .env.example .env
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python scripts/migrate.py init
PYTHONPATH=. python app.py

Dashboard: http://localhost:7860/dashboard
API Docs: http://localhost:7860/docs

Evaluation Tasks

CodeLens benchmarks agents across three critical engineering domains:

Task	Difficulty	Scenarios	Max Steps	Focus Area
`bug_detection`	Easy	10	10	Off-by-one errors, null dereferences, race conditions, exception handling
`security_audit`	Medium	10	15	SQL injection, hardcoded secrets, path traversal, insecure deserialization
`architectural_review`	Hard	10	20	N+1 queries, god classes, blocking async calls, circular imports

🎯 Observation Space

Each step() and reset() call returns a typed Observation object:

Field	Type	Description
`task_id`	`TaskId` (enum)	One of `bug_detection`, `security_audit`, `architectural_review`
`scenario_hash`	`str`	Deterministic identifier for the scenario
`pr_title`	`str`	Title of the synthetic pull request
`pr_description`	`str`	Description/context for the PR
`diff`	`str`	Full unified diff (all files concatenated)
`files_changed`	`List[FileChanged]`	Structured file patches with metadata
`step_count`	`int`	Current step number (0-indexed)
`max_steps`	`int`	Maximum steps allowed for this task
`noise_budget`	`int`	Remaining false-positive credits (starts at 5)
`issues_flagged`	`int`	Number of correctly matched issues so far
`done`	`bool`	Whether the episode has terminated

🎮 Action Space

Agents submit typed Action objects with the following fields:

Field	Type	Required For	Description
`action_type`	`ActionType` (enum)	All actions	`flag_issue`, `approve`, `request_changes`, `comment`, `ask_question`
`body`	`str`	All actions	Description or explanation text
`filename`	`str`	`flag_issue`	File containing the issue
`line_number`	`int`	`flag_issue`	Approximate line number of the issue
`category`	`Category` (enum)	`flag_issue`	`bug`, `security`, `architecture`, `style`, `performance`
`severity`	`Severity` (enum)	`flag_issue`	`critical`, `high`, `medium`, `low`, `info`
`verdict`	`Verdict` (enum)	`approve` / `request_changes`	`lgtm`, `request_changes`, `needs_discussion`

Reward Signal

Each step() returns a typed Reward object:

Field	Type	Description
`value`	`float`	Normalised score (0.0–1.0)
`reason`	`str`	Human-readable explanation of the reward
`is_terminal`	`bool`	`True` on the final step of an episode

Reward shaping: Correct issue flags yield positive rewards scaled by severity (critical=1.0, high=0.8, medium=0.5, low=0.2). False positives and duplicates incur −0.05 penalties and consume noise budget. Episodes terminate when noise budget reaches zero, max steps are exceeded, or a terminal action (approve/request_changes) is submitted.

🧠 Environment Design Highlights

Predictable State Management: The reset() and step() functions are strictly idempotent based on task/seed pairs, ensuring 100% reproducible episodes.
Dense Reward Signal: Unlike "win/loss" environments, CodeLens provides continuous feedback. Every action—from the first issue flagged to the final verdict—produces a typed Reward object with human-readable rationale, accelerating agent learning (process supervision).
Novelty: The Reviewer Trust Mechanic: The Noise Budget (5 credits) simulates real-world developer trust. If an agent "hallucinates" too many non-existent bugs, it loses the budget and the episode is terminated, penalizing high-volume, low-precision behavior.

Scoring System

Bug Detection

Score = 0.4 × coverage + 0.6 × avg_issue_score − 0.1 × false_positive_rate Issues are scored on keyword accuracy (50%) and severity matching (50%).

Security Audit

Score = avg(per_issue_score) where each issue = 0.7 × severity_accuracy + 0.3 × keyword_coverage. Severity accuracy is distance-weighted: misclassifying a CRITICAL issue as LOW incurs a major penalty.

Architectural Review

Score = 0.6 × detection_rate + 0.2 × verdict_accuracy + 0.2 × detail_quality. Detail quality rewards technical explanations that provide actionable developer feedback.

Noise Budget

Every episode permits 5 false positive credits. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.

📊 Baseline Scores

Reproducible keyword-based baseline results across all 30 scenarios (10 seeds per task):

Task	Mean Score	Best Score	Worst Score	Success Rate (>0.5)
`bug_detection`	0.3577	0.9167	0.0000	40%
`security_audit`	0.1850	1.0000	0.0000	20%
`architectural_review`	0.2930	0.6640	0.0000	40%
Overall	0.2786	—	—	33%

Agent: KeywordAgent (heuristic, 35+ rules) — see scripts/baseline.py Reproduce: python scripts/evaluate.py --agent keyword --output results.json

These scores represent a deterministic lower bound. LLM-powered agents (e.g., GPT-4o, Claude) are expected to significantly outperform this baseline.

API Reference

Method	Endpoint	Auth	Description
`POST`	`/reset`	Optional	Start a new evaluation episode
`POST`	`/step/{id}`	Optional	Submit a review action (flag_issue, approve)
`GET`	`/result/{id}`	Optional	Retrieve final scores and logs for an episode
`GET`	`/leaderboard`	None	Paginated performance rankings
`POST`	`/submit`	Optional	Persist an episode result to the leaderboard
`GET`	`/stats`	None	Aggregate statistics across all agents
`GET`	`/episodes/{id}/replay`	Optional	Full event-by-event history replay
`GET`	`/dashboard`	None	Interactive Real-time Dashboard
`GET`	`/health`	None	System status and health check

Authentication is disabled by default. Set API_KEY_ENABLED=true in .env for production parity.

Running with Docker

Production Mode

docker compose up -d
# View logs: docker compose logs -f

Direct Pull

docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest

Automated Testing

docker compose -f docker-compose.test.yml up

Baseline Agent & Evaluation

Single Scenario Trial

python scripts/baseline.py --task bug_detection --seed 3 --verbose

Full Benchmark (All 30 Scenarios)

# Keyword-based baseline
python scripts/evaluate.py --agent keyword --output results.json

# LLM-powered reviewer (e.g. Claude)
python scripts/evaluate.py --agent llm --api-key $ANTHROPIC_API_KEY

Writing Your Own Agent

CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:

import requests

API = "http://localhost:7860"

# Start new episode
resp = requests.post(f"{API}/reset", json={"task_id": "bug_detection", "seed": 0})
episode_id = resp.json()["episode_id"]

done = False
while not done:
    # Your agent logic analyzes the diff
    action = {
        "action_type": "flag_issue",
        "body": "Identified a vulnerability line 14",
        "filename": "api/search.py",
        "line_number": 14,
        "severity": "critical",
        "category": "security"
    }

    result = requests.post(f"{API}/step/{episode_id}", json=action).json()
    done = result["done"]

# Get final results
final = requests.get(f"{API}/result/{episode_id}").json()
print(f"Final Score: {final['final_score']}")

Project Structure

open-ev-code-handler/
├── app.py                      # FastAPI application (9 endpoints)
├── codelens_env/               # Core evaluation logic
│   ├── database.py             # SQLModel persistence layer
│   ├── env.py                  # Episode state machine
│   ├── models.py               # Pydantic v2 data models
│   ├── scenarios.py            # 30 Synthetic PR scenarios
│   └── graders/                # Grader implementations (Bug, Sec, Arch)
├── scripts/                    # CLI tools (baseline, evaluate, migrate)
├── static/                     # Compiled dashboard assets
├── tests/                      # 155+ Parametrized tests
├── Dockerfile                  # Multi-stage, non-root build
├── docker-compose.yml          # Production orchestration
└── openenv.yaml               # CodeLens v2 specification

Development

# Setup
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Automated Tests
PYTHONPATH=. pytest tests/ -v --cov=codelens_env

# Linter Check
pylint codelens_env/ app.py

# Scenario Sanity Check
PYTHONPATH=. python scripts/validate.py

Authors & Maintainers

CodeLens is authored and maintained by:

Arsh Verma — GitHub
Divyansh Rawat — GitHub

Contributing & License

Please see CONTRIBUTING.md for details on authoring new scenarios and submission standards.

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github/workflows		.github/workflows
assets		assets
codelens_env		codelens_env
dashboard		dashboard
files		files
scripts		scripts
static/dashboard		static/dashboard
tests		tests
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
GET_STARTED.md		GET_STARTED.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
inference.py		inference.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
results.json		results.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeLens Environment

💡 Motivation

Quick Start

Evaluation Tasks

🎯 Observation Space

🎮 Action Space

Reward Signal

🧠 Environment Design Highlights

Scoring System

Bug Detection

Security Audit

Architectural Review

Noise Budget

📊 Baseline Scores

API Reference

Running with Docker

Production Mode

Direct Pull

Automated Testing

Baseline Agent & Evaluation

Single Scenario Trial

Full Benchmark (All 30 Scenarios)

Writing Your Own Agent

Project Structure

Development

Authors & Maintainers

Contributing & License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeLens Environment

💡 Motivation

Quick Start

Evaluation Tasks

🎯 Observation Space

🎮 Action Space

Reward Signal

🧠 Environment Design Highlights

Scoring System

Bug Detection

Security Audit

Architectural Review

Noise Budget

📊 Baseline Scores

API Reference

Running with Docker

Production Mode

Direct Pull

Automated Testing

Baseline Agent & Evaluation

Single Scenario Trial

Full Benchmark (All 30 Scenarios)

Writing Your Own Agent

Project Structure

Development

Authors & Maintainers

Contributing & License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages