Skip to content

feat: evaluation module (Goal 5) with IoU-based matching and CI#25

Open
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/eval-module-goal5
Open

feat: evaluation module (Goal 5) with IoU-based matching and CI#25
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/eval-module-goal5

Conversation

@bhuvan-somisetty

@bhuvan-somisetty bhuvan-somisetty commented May 15, 2026

Copy link
Copy Markdown

Every pipeline in this repo detects audio events and outputs captions, but none of them can answer the question that matters most: how accurate are those captions? This PR adds the missing feedback loop.

What changed

src/eval/evaluator.py - a standalone evaluation module with no ML dependencies. You give it two lists of events (what the pipeline detected, and what a human annotator marked as correct), and it tells you how well the pipeline did.

The matching logic is label-aware IoU: a detected [alarm] event only matches a ground truth [alarm] event if they overlap in time by at least 50% (configurable). Greedy assignment ensures each event is counted at most once - so a single detection can't inflate the TP count by matching multiple ground truth entries.

Metrics computed:

Metric What it tells you
Precision Of all captions the pipeline fired, how many were correct
Recall Of all real events in the clip, how many the pipeline caught
F1 Harmonic mean - the single number to optimise against
Overcaption rate FP / total detected - how often the pipeline fires for nothing
Per-label breakdown All four metrics, isolated per caption label

The per-label breakdown matters for this project specifically. A [gunshot] that slips through (low recall) is a very different problem from a [music] that fires constantly (high overcaption). Treating them as one aggregate number hides where the pipeline actually needs work.

load_ground_truth(path) reads annotation files in a simple JSON format that any annotator can produce without special tooling:

[
  {"label": "[alarm]",    "start_s": 4.32,  "end_s": 6.72},
  {"label": "[gunshot]",  "start_s": 12.00, "end_s": 12.96},
  {"label": "[firecrackers]", "start_s": 31.5,  "end_s": 33.0}
]

EvalReport.to_dict() serialises results to a plain dict (JSON-safe), so reports can be written to disk alongside the SRT output for each pipeline run.

CI

.github/workflows/ci.yml runs pytest on Python 3.10, 3.11, and 3.12 for every push and pull request. Tests install from requirements-dev.txt (just pytest) - no TensorFlow, no OpenCV, no MediaPipe needed in CI. This keeps the feedback loop fast and means any contributor can run pip install -r requirements-dev.txt && pytest to verify their changes locally before pushing.

This is the first CI configuration for this repo. All future PRs will get automatic test feedback.

Tests

31 tests cover IoU calculation edge cases, greedy matching with competing detections, all zero-case scenarios (empty detected, empty GT, both empty), the overcaption rate formula, per-label isolation, threshold sensitivity, JSON serialisation, and the annotation loader including string-coercion of timestamps.

31 passed in 0.11s

Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant