feat: evaluation module (Goal 5) with IoU-based matching and CI#25
Open
bhuvan-somisetty wants to merge 1 commit into
Open
feat: evaluation module (Goal 5) with IoU-based matching and CI#25bhuvan-somisetty wants to merge 1 commit into
bhuvan-somisetty wants to merge 1 commit into
Conversation
Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Every pipeline in this repo detects audio events and outputs captions, but none of them can answer the question that matters most: how accurate are those captions? This PR adds the missing feedback loop.
What changed
src/eval/evaluator.py- a standalone evaluation module with no ML dependencies. You give it two lists of events (what the pipeline detected, and what a human annotator marked as correct), and it tells you how well the pipeline did.The matching logic is label-aware IoU: a detected
[alarm]event only matches a ground truth[alarm]event if they overlap in time by at least 50% (configurable). Greedy assignment ensures each event is counted at most once - so a single detection can't inflate the TP count by matching multiple ground truth entries.Metrics computed:
The per-label breakdown matters for this project specifically. A
[gunshot]that slips through (low recall) is a very different problem from a[music]that fires constantly (high overcaption). Treating them as one aggregate number hides where the pipeline actually needs work.load_ground_truth(path)reads annotation files in a simple JSON format that any annotator can produce without special tooling:[ {"label": "[alarm]", "start_s": 4.32, "end_s": 6.72}, {"label": "[gunshot]", "start_s": 12.00, "end_s": 12.96}, {"label": "[firecrackers]", "start_s": 31.5, "end_s": 33.0} ]EvalReport.to_dict()serialises results to a plain dict (JSON-safe), so reports can be written to disk alongside the SRT output for each pipeline run.CI
.github/workflows/ci.ymlrunspyteston Python 3.10, 3.11, and 3.12 for every push and pull request. Tests install fromrequirements-dev.txt(justpytest) - no TensorFlow, no OpenCV, no MediaPipe needed in CI. This keeps the feedback loop fast and means any contributor can runpip install -r requirements-dev.txt && pytestto verify their changes locally before pushing.This is the first CI configuration for this repo. All future PRs will get automatic test feedback.
Tests
31 tests cover IoU calculation edge cases, greedy matching with competing detections, all zero-case scenarios (empty detected, empty GT, both empty), the overcaption rate formula, per-label isolation, threshold sensitivity, JSON serialisation, and the annotation loader including string-coercion of timestamps.