feat: evaluation module (Goal 5) with IoU-based matching and CI by bhuvan-somisetty · Pull Request #25 · PlanetRead/Intelligent-cc-generation

bhuvan-somisetty · 2026-05-15T18:50:15Z

Every pipeline in this repo detects audio events and outputs captions, but none of them can answer the question that matters most: how accurate are those captions? This PR adds the missing feedback loop.

What changed

src/eval/evaluator.py - a standalone evaluation module with no ML dependencies. You give it two lists of events (what the pipeline detected, and what a human annotator marked as correct), and it tells you how well the pipeline did.

The matching logic is label-aware IoU: a detected [alarm] event only matches a ground truth [alarm] event if they overlap in time by at least 50% (configurable). Greedy assignment ensures each event is counted at most once - so a single detection can't inflate the TP count by matching multiple ground truth entries.

Metrics computed:

Metric	What it tells you
Precision	Of all captions the pipeline fired, how many were correct
Recall	Of all real events in the clip, how many the pipeline caught
F1	Harmonic mean - the single number to optimise against
Overcaption rate	FP / total detected - how often the pipeline fires for nothing
Per-label breakdown	All four metrics, isolated per caption label

The per-label breakdown matters for this project specifically. A [gunshot] that slips through (low recall) is a very different problem from a [music] that fires constantly (high overcaption). Treating them as one aggregate number hides where the pipeline actually needs work.

load_ground_truth(path) reads annotation files in a simple JSON format that any annotator can produce without special tooling:

[
  {"label": "[alarm]",    "start_s": 4.32,  "end_s": 6.72},
  {"label": "[gunshot]",  "start_s": 12.00, "end_s": 12.96},
  {"label": "[firecrackers]", "start_s": 31.5,  "end_s": 33.0}
]

EvalReport.to_dict() serialises results to a plain dict (JSON-safe), so reports can be written to disk alongside the SRT output for each pipeline run.

CI

.github/workflows/ci.yml runs pytest on Python 3.10, 3.11, and 3.12 for every push and pull request. Tests install from requirements-dev.txt (just pytest) - no TensorFlow, no OpenCV, no MediaPipe needed in CI. This keeps the feedback loop fast and means any contributor can run pip install -r requirements-dev.txt && pytest to verify their changes locally before pushing.

This is the first CI configuration for this repo. All future PRs will get automatic test feedback.

Tests

31 tests cover IoU calculation edge cases, greedy matching with competing detections, all zero-case scenarios (empty detected, empty GT, both empty), the overcaption rate formula, per-label isolation, threshold sensitivity, JSON serialisation, and the annotation loader including string-coercion of timestamps.

31 passed in 0.11s

Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>

feat: add evaluation module (Goal 5) with IoU matching and CI workflow

cdc8bab

Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>

bhuvan-somisetty mentioned this pull request May 16, 2026

feat: syllabic caption localizer for 9 Indian languages with RCI accessibility scoring #29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evaluation module (Goal 5) with IoU-based matching and CI#25

feat: evaluation module (Goal 5) with IoU-based matching and CI#25
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/eval-module-goal5

bhuvan-somisetty commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bhuvan-somisetty commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

CI

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bhuvan-somisetty commented May 15, 2026 •

edited

Loading