Intelligent Audio-Visual CC Suggestion System (Goals 1 & 2) by Siddharth-732 · Pull Request #22 · PlanetRead/Intelligent-cc-generation

Siddharth-732 · 2026-05-12T17:28:38Z

closes #2

Summary

This PR implements in the middle phase of the Intelligent CC Suggestion Tool. The goal is to automatically identify meaningful non-speech audio events along with the visual in videos to determine whether a non-speech event is important enough to require a caption or not.

Key Features

Robust Audio Classification (Goal 1)

Semantic Label Grouping: Maps 500+ YAMNet classes into 20+ canonical CC categories (e.g., [Gunshot/Explosion], [Animal Sound]).
Multi-Label Probability Summing: Aggregates scores across similar acoustic classes to provide stable, confident labels.
Ambient Noise Filtering: Automatically suppresses low-value background sounds (e.g., "Inside, small room").

Intelligent Reaction Detection (Goal 2)

Hybrid Vision Tracking: Uses MediaPipe Pose for full-body tracking and automatically falls back to FaceMesh for close-up head/shoulder shots.
Scale-Invariant Movement: Calculates movement relative to the person's height, ensuring accuracy regardless of distance from the camera.
Baseline Comparison: Compares visual state against a pre-event baseline to catch subtle flinches, head-turns, or "startle" reactions.

How to Verify

To ensure a clean run, please follow these steps to set up a local environment:

System Requirements
Python 3.10 or 3.11
FFmpeg: Must be available in your system path.
Windows: winget install ffmpeg
Linux: sudo apt install ffmpeg
Environment Setup
# Clone and enter the directory
git clone [your-repo-link]
cd Intelligent-cc-generation

# Create a fresh virtual environment
python -m venv .venv

# Activate the environment
# Windows:
.\.venv\Scripts\activate

# Linux/Mac:
source .venv/bin/activate

# Install dependencies (includes specific numpy version fix)
pip install -r requirements.txt
Run the Detection Modules

Test Sound Event Detection (Goal 1): python test_goal_1.py "path/to/your/video.mp4"
Test Full Audio-Visual Reaction Sync (Goal 2): python test_goal_2.py "path/to/your/video.mp4"

Note to Maintainer: The test_goal_2.py script automatically utilizes the output from Goal 1 and analyzes the visual frames around those specific timestamps.

ScreenShot of Results

Screenshot explanation - after the first test, the pipeline will create a output folder that will contain the audio of the video as sample.wav, this output will then be analyzed and the output can be seen in the terminal, i.e it detects 3 events.

Copilot

Pull request overview

This PR introduces the “Goal 1” sound event detection pipeline for the Intelligent CC Suggestion Tool, adding YAMNet-based audio event detection plus basic audio utilities, extraction, documentation, and initial tests.

Changes:

Added cc_tool.audio package: FFmpeg-based audio extraction, waveform chunking/normalization, YAMNet inference, and simple adjacent-event merging.
Added a manual run script (test_goal_1.py), README setup/usage guidance, and Python dependencies.
Added initial pytest suites for audio utilities/extractor, plus decision/export tests that currently reference missing modules.

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`cc_tool/audio/detector.py`	Implements YAMNet loading, inference over chunks, and event merging.
`cc_tool/audio/extractor.py`	Adds FFmpeg subprocess wrapper for extracting 16kHz mono WAV.
`cc_tool/audio/models.py`	Adds `AudioEvent` dataclass and serialization helper.
`cc_tool/audio/utils.py`	Adds chunking and waveform normalization helpers.
`cc_tool/audio/__init__.py`	Exposes audio API via package exports.
`cc_tool/__init__.py`	Marks `cc_tool` as a package.
`tests/test_audio/test_detector.py`	Adds tests for chunking and normalization utilities.
`tests/test_audio/test_extractor.py`	Adds a basic missing-input test for extractor.
`tests/test_audio/__init__.py`	Marks audio tests as a package.
`tests/test_decision/test_combiner.py`	Adds tests for a decision combiner that is not present in this PR/repo.
`tests/test_decision/test_srt_writer.py`	Adds tests for an SRT writer/models that are not present in this PR/repo.
`tests/test_decision/__init__.py`	Marks decision tests as a package.
`test_goal_1.py`	Adds a manual run script, but is currently named like a pytest test module.
`requirements.txt`	Adds TensorFlow/TFHub and other dependencies (includes currently-unused `pydantic`).
`README.md`	Documents setup and example usage for Goal 1.
`.gitignore`	Ignores Python artifacts, venvs, and `outputs/` directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def chunk_audio(waveform, sample_rate, window_sec=1.0, stride_sec=0.5):
+    """Split audio into overlapping windows for YAMNet."""
+    window_samples = int(window_sec * sample_rate)
+    stride_samples = int(stride_sec * sample_rate)
+    total_samples = len(waveform)
+
+    chunks = []
+    start = 0
+    while start + window_samples <= total_samples:
+        end = start + window_samples
+        start_sec = start / sample_rate
+        end_sec = end / sample_rate
+        chunks.append((start_sec, end_sec, waveform[start:end]))
+        start += stride_samples
+    return chunks


+def normalize_waveform(waveform):
+    """Normalize audio to [-1.0, 1.0]."""
+    waveform = waveform.astype(np.float32)
+    max_val = np.max(np.abs(waveform))
+    if max_val > 0:
+        waveform = waveform / max_val
+    return waveform


+    def detect(self, wav_path):
+        self._load_model()
+        waveform, sr = sf.read(wav_path, dtype="float32")
+        if waveform.ndim > 1: waveform = waveform.mean(axis=1)
+        waveform = normalize_waveform(waveform)
+
+        chunks = chunk_audio(waveform, sr)
+        raw_events = []


+
+    if output_wav is None:
+        os.makedirs("outputs/audio", exist_ok=True)
+        output_wav = f"outputs/audio/{video_path.stem}.wav"


implement_sound_event_detection_module

56c45a0

Copilot AI review requested due to automatic review settings May 12, 2026 17:28

Copilot started reviewing on behalf of Siddharth-732 May 12, 2026 17:29 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

finalize Goal 2 with hybrid vision detection and semantic audio mapping

abb0e04

Siddharth-732 changed the title ~~Implement_sound_event_detection_module (Goal 1)~~ Intelligent Audio-Visual CC Suggestion System (Goals 1 & 2) May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intelligent Audio-Visual CC Suggestion System (Goals 1 & 2)#22

Intelligent Audio-Visual CC Suggestion System (Goals 1 & 2)#22
Siddharth-732 wants to merge 2 commits into
PlanetRead:mainfrom
Siddharth-732:feat/implement-sound-event-detection-module

Siddharth-732 commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Siddharth-732 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

How to Verify

ScreenShot of Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Siddharth-732 commented May 12, 2026 •

edited

Loading