diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..5d37a97 --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +# Demo videos (too large for GitHub) +demo_videos/*.mp4 diff --git a/README.md b/README.md new file mode 100644 index 0000000..1cac3a9 --- /dev/null +++ b/README.md @@ -0,0 +1,1451 @@ +# Intelligent Closed Caption Suggestion Tool + +An AI-powered Python backend and editor review tool for generating meaningful non-speech closed caption suggestions from raw video. + +The project focuses on detecting moments where a non-speech sound meaningfully affects the scene, speaker, or narrative, then suggesting concise SRT captions such as `[horn honks]`, `[glass breaks]`, or `[crowd cheering]`. The goal is to assist accessibility editors without over-captioning routine, ambient, or low-impact sounds. + +## Project Context + +- **Product name:** Intelligent Closed Caption (CC) Suggestion Tool +- **Organisation:** Planet Read +- **Domain:** Education, accessibility, media tooling +- **Category:** Backend, Machine Learning, AI, Computer Vision +- **Primary users:** Accessibility editors, subtitling teams, content review teams +- **Initial content focus:** Hindi and Indian regional-language videos +- **Primary output:** SRT files containing only non-speech CC annotations + +This project does not generate full dialogue subtitles in the first version. It analyzes raw videos and produces non-speech closed caption suggestions only. + +## Implementation Status + +The first runnable Python implementation has been started under [`main/`](main/). It includes the modular package, CLI, diagnostics, mock audio and vision backends, a CPU DSP audio backend, an OpenCV visual baseline, decision engine, multilingual caption labels, Streamlit UI client, and SRT/JSON/CSV exports. + +Current scaffold commands: + +```bash +cd main +python -m cc_suggester doctor +python -m cc_suggester analyze README.md --lang hi --device auto --out outputs +python -m cc_suggester labels +python -m pytest tests +``` + +The mock backends remain for deterministic tests, while `--audio-backend dsp` and `--vision-backend opencv` provide local real-processing baselines. YAMNet is wired as an optional TensorFlow Hub backend, and MediaPipe is wired as an optional pose-based reaction backend. PANNs, AST, BEATs, and richer MediaPipe face/expression scoring remain documented next steps in [`docs/implementation-plan.md`](docs/implementation-plan.md). + +For environments without system ffmpeg, the sample generator can create an OpenCV video plus sidecar WAV file so the DSP/OpenCV path can still be tested locally. + +## Interface Overview + +### Web UI Editor Review Workspace + +The Web UI is built as a modern editor workspace with **warm dark glassmorphism design** and full light/dark theme support. It features: + +- **Interactive video player** with event markers and draggable timeline +- **Real-time review panel** for editing and accepting/rejecting captions +- **Multilingual support** with live caption label switching +- **Device & backend controls** for audio/vision model selection +- **Comprehensive event table** with all confidence scores and reasoning + +#### Dark Mode (Default) — Hindi + +![Web UI Dark Mode with Hindi captions](mockups/hindi.png) + +The warm dark glassmorphism design features: +- Deep amber/charcoal background with warm gold accents +- Frosted glass panels with subtle warm-tinted borders +- Smooth theme toggle (☀/🌙) for light/dark switching + +#### Multilingual Support + +**Telugu:** +![Web UI Telugu](mockups/telugu.png) + +**Malayalam:** +![Web UI Malayalam](mockups/mallu.png) + +Caption labels update live across all panels when language is changed. + +#### Architecture & System Diagram + +```mermaid +flowchart TB + subgraph Inputs["Video Input"] + VIDEO["Raw Video\n(.mp4, .mov, .mkv)"] + end + + subgraph Audio["Audio Analysis"] + direction TB + EXTRACT["Audio Extraction\n(ffmpeg)"] + DSP["DSP Baseline\n(RMS, STFT, Onsets)"] + A_MODELS["Audio ML Backends\n(YAMNet / PANNs / AST)"] + SMOOTH["Event Smoothing\n(Merge, Filter, Normalize)"] + EXTRACT --> DSP --> A_MODELS --> SMOOTH + end + + subgraph Vision["Visual Reaction"] + direction TB + FRAMES["Frame Sampler\n(before / during / after)"] + FLOW["Optical Flow\n(OpenCV)"] + V_MODELS["Vision ML Backends\n(MediaPipe / MMPose)"] + REACT["Reaction Scoring"] + FRAMES --> FLOW --> REACT + FRAMES --> V_MODELS --> REACT + end + + subgraph Decision["Decision Engine"] + direction TB + SCORER["Scorer\n(audio + reaction + importance\n- ambient penalty)"] + LABELS["Caption Labels\n(Glossary per language)"] + SCORER --> LABELS + end + + subgraph Outputs["Exports"] + direction LR + SRT["SRT\n(accepted captions)"] + JSON["JSON\n(full debug report)"] + CSV["CSV\n(reviewer spreadsheet)"] + end + + subgraph Clients["User Interfaces"] + direction LR + CLI["CLI\n(ccs analyze / doctor / export)"] + WEB["Web UI\n(Streamlit editor workspace)"] + end + + VIDEO --> EXTRACT + SMOOTH --> AudioEvents["Audio Event\nCandidates"] + AudioEvents --> FRAMES + AudioEvents --> SCORER + REACT --> SCORER + LABELS --> SRT + SCORER --> JSON + SCORER --> CSV + + CLI --> VIDEO + CLI --> SCORER + WEB --> VIDEO + WEB --> SCORER + + style Inputs fill:#1e1308,stroke:#f59e0b,color:#f0e4cc + style Audio fill:#1e1308,stroke:#f59e0b,color:#f0e4cc + style Vision fill:#1e1308,stroke:#f59e0b,color:#f0e4cc + style Decision fill:#1e1308,stroke:#f59e0b,color:#f0e4cc + style Outputs fill:#1e1308,stroke:#f59e0b,color:#f0e4cc + style Clients fill:#1e1308,stroke:#f59e0b,color:#f0e4cc +``` + +### Demo Video & Sample Data + +Demo video and sample recordings are available on Google Drive: + +[![Demo Video & Sample Data](https://drive.google.com/drive/folders/1Ti5aqztP9VHas_5AbrH7utSn-G27HZXW?usp=sharing)](https://drive.google.com/drive/folders/1Ti5aqztP9VHas_5AbrH7utSn-G27HZXW?usp=sharing) + +> **Sample videos & recordings:** [Google Drive folder](https://drive.google.com/drive/folders/1Ti5aqztP9VHas_5AbrH7utSn-G27HZXW?usp=sharing) — contains demo videos, recordings, and sample SRT outputs. + +## Problem Statement + +Accessibility editors currently add non-speech closed caption annotations by hand. This is time-consuming and requires judgment: not every sound should be captioned. + +For example: + +- A horn that causes a speaker to turn around may need `[horn honks]`. +- Constant background traffic may not need any caption. +- A glass breaking off-screen may need `[glass breaks]` if it affects the scene. +- Background music may not need a caption unless it is narratively important. + +The tool should detect candidate sound events, inspect nearby visual reaction cues, decide whether the event is meaningful enough to caption, and export accepted suggestions into an SRT file. + +## Goals + +### Goal 1: Sound Event Detection Module + +Automatically detect and classify non-speech audio events in a given video file with confidence scores and timestamps. + +Expected behavior: + +- Accept a video file as input. +- Extract the audio track. +- Run audio analysis using a pluggable sound event detection backend. +- Detect events such as honking, explosions, laughter, music, glass breaking, alarms, applause, door slams, phone rings, and crowd reactions. +- Produce timestamped audio event candidates with confidence scores. + +Output example: + +```json +{ + "event_id": "horn_honk", + "label": "Horn honk", + "start_time": 12.4, + "end_time": 13.8, + "audio_confidence": 0.87 +} +``` + +### Goal 2: Speaker Reaction Detection Module + +Detect visible speaker or scene reactions to audio events using visual analysis of video frames. + +Expected behavior: + +- For each detected audio event, sample video frames before, during, and after the event. +- Detect visual reaction cues such as: + - head turn + - sudden posture shift + - startled body movement + - facial expression change + - mouth/eye/brow change + - speech pause or freeze + - scene-level movement spike +- Assign a reaction confidence score per event. +- Store visual analysis results alongside audio event data. + +Output example: + +```json +{ + "event_id": "horn_honk", + "start_time": 12.4, + "end_time": 13.8, + "reaction_confidence": 0.71, + "reaction_signals": { + "head_turn": 0.82, + "optical_flow_spike": 0.64, + "facial_expression_change": 0.55 + } +} +``` + +### Goal 3: CC Decision Engine and SRT Output + +Combine audio and visual signals to decide whether a caption is warranted, then export accepted captions to SRT. + +Expected behavior: + +- Combine audio confidence, visual reaction confidence, event importance, and ambient sound penalties. +- Reject low-impact ambient sounds. +- Generate short, editor-friendly CC labels. +- Export accepted captions to SRT. +- Export full debug results to JSON/CSV for review. + +Example accepted SRT: + +```srt +1 +00:00:12,400 --> 00:00:13,800 +[horn honks] + +2 +00:01:03,100 --> 00:01:04,600 +[glass breaks] +``` + +## Midpoint Milestone + +The midpoint milestone is completion of Goal 1 and Goal 2. + +At midpoint, the project should demonstrate: + +- CLI accepts a raw video input. +- Audio is extracted successfully. +- Non-speech sound events are detected with timestamps and confidence scores. +- Video frames are sampled around detected events. +- Visual reaction scores are computed and attached to each audio event. +- JSON debug output is generated. +- Basic SRT export may exist, but the final decision engine can still be simple. +- The pipeline runs on CPU and can optionally use GPU when available. +- The project is tested on a small sample set of Hindi and regional-language videos. + +## Final Expected Outcome + +The final project should provide: + +- A Python-based backend pipeline. +- A command-line interface. +- A web-based editor review UI. +- Pluggable audio and vision model backends. +- CPU/GPU device selection and diagnostics. +- Multilingual non-speech CC label export. +- SRT export for accepted captions. +- JSON/CSV debug reports for all candidates. +- Documentation for installation, usage, troubleshooting, and contribution. + +## Non-Goals for Version 1 + +The first version will not focus on: + +- Full dialogue transcription. +- Full dialogue translation. +- Dubbing. +- Speaker diarization. +- Live real-time captioning. +- Perfect automatic caption approval without editor review. +- Training a custom large model from scratch. + +These can become future extensions, but the core value is non-speech CC suggestion. + +## Supported Languages + +Version 1 should support caption label export in: + +- English +- Hindi +- Tamil +- Telugu +- Bengali +- Marathi +- Malayalam + +The default language is assumed to be the same language as the video. Since Version 1 only generates non-speech captions, default language handling can be simple: + +- User selects the video language in CLI or UI. +- If language is not selected, default to English. +- Later, add automatic spoken-language detection. + +Caption labels should be generated from a curated glossary first. Machine translation can be used later as a fallback, but editor-approved labels are safer because CC labels must be short, consistent, and natural. + +Example: + +| Event ID | English | Hindi | +| --- | --- | --- | +| `horn_honk` | `[horn honks]` | `[हॉर्न बजता है]` | +| `glass_break` | `[glass breaks]` | `[कांच टूटता है]` | +| `crowd_cheer` | `[crowd cheering]` | `[भीड़ जयकार करती है]` | + +## High-Level Architecture + +The project should be designed as reusable modules, not as logic embedded inside the CLI or Web UI. + +```text +Core pipeline modules + used by CLI + used by Web UI + later used by VLC plugin, API, or desktop app +``` + +The diagrams below use Mermaid, which renders directly in GitHub and many Markdown viewers. + +```mermaid +flowchart TB + subgraph Clients["User-Facing Clients"] + CLI["CLI\nccs analyze / doctor / export"] + WEB["Web UI\neditor review workspace"] + VLC["Future VLC Plugin"] + API["Future Local API"] + end + + subgraph Core["Reusable Core Pipeline"] + PIPE["Pipeline Orchestrator"] + CONFIG["Config + Thresholds"] + DIAG["Diagnostics + Friendly Errors"] + TYPES["Shared Data Models"] + end + + subgraph Audio["Audio Analysis"] + EXTRACT["Audio Extraction"] + DSP["DSP Features\nFFT / STFT / RMS / Onsets"] + A_BACKENDS["Audio Backends\nYAMNet / PANNs / AST / BEATs"] + EVENTS["Event Smoothing\nMerge / Filter / Normalize"] + end + + subgraph Vision["Visual Reaction Analysis"] + FRAMES["Frame Sampler"] + FLOW["Optical Flow"] + V_BACKENDS["Vision Backends\nMediaPipe / MMPose / MMAction2"] + REACT["Reaction Scoring"] + end + + subgraph Decision["Caption Decision"] + SCORE["Decision Scorer"] + RULES["Importance Rules\nAmbient Penalties"] + LABELS["Caption Labels\nGlossary + Translation"] + end + + subgraph Outputs["Exports"] + SRT["SRT"] + JSON["JSON Debug Report"] + CSV["CSV Review Report"] + end + + CLI --> PIPE + WEB --> PIPE + VLC --> API + API --> PIPE + + PIPE --> CONFIG + PIPE --> DIAG + PIPE --> TYPES + PIPE --> EXTRACT + EXTRACT --> DSP + DSP --> A_BACKENDS + A_BACKENDS --> EVENTS + EVENTS --> FRAMES + FRAMES --> FLOW + FRAMES --> V_BACKENDS + FLOW --> REACT + V_BACKENDS --> REACT + EVENTS --> SCORE + REACT --> SCORE + RULES --> SCORE + SCORE --> LABELS + LABELS --> SRT + SCORE --> JSON + SCORE --> CSV +``` + +Recommended repository structure: + +```text +cc-suggester/ + cc_suggester/ + core/ + pipeline.py + config.py + diagnostics.py + errors.py + types.py + + audio/ + extractor.py + dsp.py + vad.py + events.py + backends/ + base.py + yamnet.py + panns.py + ast.py + beats.py + + vision/ + frame_sampler.py + optical_flow.py + reactions.py + backends/ + base.py + mediapipe_face.py + mediapipe_pose.py + mmaction.py + + decision/ + scorer.py + rules.py + labels.py + + output/ + srt.py + json_report.py + csv_report.py + + translation/ + glossary.py + indictrans.py + + cli/ + app.py + + ui/ + streamlit_app.py + + configs/ + default.yaml + cpu.yaml + gpu.yaml + + label_maps/ + events.en.json + events.hi.json + events.ta.json + events.te.json + events.bn.json + events.mr.json + events.ml.json + + docs/ + architecture.md + cli.md + web-ui.md + models.md + troubleshooting.md + evaluation.md + vlc-plugin.md + + examples/ + README.md + + tests/ + unit/ + integration/ + + requirements.txt + requirements-ui.txt + requirements-dev.txt + requirements-translate.txt + README.md + CONTRIBUTING.md + LICENSE +``` + +The exact file names can change during implementation, but the separation of responsibilities should remain. + +## Data Model + +The pipeline should pass structured objects between modules. + +### Audio Event Candidate + +Represents a detected sound event before visual analysis. + +Fields: + +- `event_id` +- `label` +- `start_time` +- `end_time` +- `audio_confidence` +- `audio_backend` +- `raw_class_name` +- `debug_info` + +### Reaction Result + +Represents visual reaction evidence for an audio event. + +Fields: + +- `event_id` +- `start_time` +- `end_time` +- `reaction_confidence` +- `reaction_signals` +- `frames_sampled` +- `vision_backend` +- `debug_info` + +### Caption Suggestion + +Represents the final decision. + +Fields: + +- `event_id` +- `start_time` +- `end_time` +- `audio_confidence` +- `reaction_confidence` +- `decision_score` +- `accepted` +- `reason` +- `caption_text` +- `language` +- `requires_review` +- `debug_info` + +This structure allows the same result to be used by: + +- CLI output +- Web UI review panel +- SRT export +- JSON report +- CSV report +- future VLC integration + +```mermaid +classDiagram + class AudioEventCandidate { + string event_id + string label + float start_time + float end_time + float audio_confidence + string audio_backend + string raw_class_name + dict debug_info + } + + class ReactionResult { + string event_id + float start_time + float end_time + float reaction_confidence + dict reaction_signals + int frames_sampled + string vision_backend + dict debug_info + } + + class CaptionSuggestion { + string event_id + float start_time + float end_time + float audio_confidence + float reaction_confidence + float decision_score + bool accepted + string reason + string caption_text + string language + bool requires_review + dict debug_info + } + + AudioEventCandidate --> ReactionResult : analyzed visually at event timestamp + AudioEventCandidate --> CaptionSuggestion : contributes audio evidence + ReactionResult --> CaptionSuggestion : contributes visual evidence +``` + +## Pipeline Flow + +```text +Input video + -> validate input + -> extract metadata + -> extract audio + -> run DSP candidate detection + -> run sound event model backend + -> merge and smooth audio events + -> sample frames around event timestamps + -> run visual reaction analysis + -> combine audio and visual signals + -> generate caption suggestions + -> export SRT, JSON, CSV +``` + +```mermaid +flowchart TD + A["Raw video input"] --> B{"Valid video?"} + B -- "No" --> B_ERR["Friendly error\nsuggest inspect/doctor command"] + B -- "Yes" --> C["Extract metadata\nfps, duration, resolution"] + C --> D["Extract audio with ffmpeg"] + D --> E["Compute DSP features\nRMS, STFT, spectral flux"] + E --> F["Run audio backend\nYAMNet first, PANNs/AST later"] + F --> G["Smooth + merge detections"] + G --> H["Audio event candidates"] + H --> I["Sample frames around each event"] + I --> J["Run visual backends\nMediaPipe face/pose + optical flow"] + J --> K["Reaction confidence per event"] + H --> L["Decision engine"] + K --> L + L --> M{"Caption warranted?"} + M -- "No" --> N["Rejected candidate\nkept in JSON/CSV debug report"] + M -- "Yes" --> O["Accepted caption suggestion"] + O --> P["Language label mapping"] + P --> Q["Export SRT"] + L --> R["Export full JSON report"] + L --> S["Export CSV review report"] +``` + +## Audio Module Plan + +The audio module should combine explainable signal processing with model-based classification. + +### DSP Baseline + +Use lightweight mathematical features to find candidate regions and explain event salience: + +- RMS energy +- short-time Fourier transform +- log-mel spectrogram +- spectral flux +- onset strength +- zero-crossing rate +- peak detection +- duration filtering + +This layer is useful because it is: + +- fast +- CPU-friendly +- explainable +- helpful for debugging model outputs + +However, DSP should not be the final classifier. It can identify that something happened, but not reliably classify what happened. + +### Model Backends + +Recommended backend priority: + +1. **YAMNet** as the first baseline. +2. **PANNs** as a stronger optional backend. +3. **AST** for transformer-based audio classification experiments. +4. **BEATs** for advanced audio representation experiments. +5. **CLAP** later for open-vocabulary event matching. + +The backend interface should stay stable: + +```text +detect(audio_path, config) -> list of audio events +``` + +### Event Smoothing + +Raw model outputs should be post-processed: + +- merge adjacent detections of the same event +- remove very short low-confidence events +- suppress speech-like classes unless desired +- suppress constant ambient sounds +- normalize model labels into project event IDs + +Example: + +```text +Raw model labels: + Vehicle horn, car horn, honking + +Normalized event ID: + horn_honk +``` + +## Vision Module Plan + +The vision module should detect whether people or the scene visibly react to an audio event. + +### Frame Sampling + +For each audio event, sample frames from: + +- before the event +- during the event +- after the event + +Example: + +```text +event_start - 1.0s +event_start - 0.5s +event_start +event_midpoint +event_end +event_end + 0.5s +event_end + 1.0s +``` + +### Reaction Signals + +The reaction score can combine: + +- head turn magnitude +- pose shift magnitude +- sudden optical flow spike +- facial expression change +- mouth open or close change +- eye/brow movement +- speaker pause proxy +- scene movement spike + +### First Backend + +Use: + +- OpenCV for frame extraction and optical flow. +- MediaPipe Face Landmarker for facial landmarks and expression blendshapes. +- MediaPipe Pose Landmarker for body and head movement. + +This is suitable for the midpoint because it is interpretable and can run on CPU. + +### Future Backends + +Potential later backends: + +- MMPose for stronger pose estimation. +- MMAction2 for action recognition. +- Video-language models for heavier scene reasoning. + +These should remain optional because they may be GPU-heavy. + +## Decision Engine Plan + +The decision engine decides whether a sound event deserves a caption. + +A simple scoring formula: + +```text +decision_score = + audio_confidence + + reaction_confidence + + event_importance_prior + + speech_pause_bonus + - ambient_penalty +``` + +Example rules: + +- Caption high-impact events even if reaction is weak: + - gunshot + - explosion + - alarm + - siren + - glass breaking +- Require reaction or high confidence for common events: + - horn + - door slam + - phone ring + - applause +- Usually reject ambient continuous sounds: + - fan noise + - traffic hum + - low background music + - crowd murmur + +Every decision should include a human-readable reason. + +Example: + +```text +Accepted because the audio model detected horn_honk with high confidence and the speaker turned their head immediately after the event. +``` + +Example rejection: + +```text +Rejected because traffic noise was continuous, low-confidence, and no visible reaction was detected. +``` + +```mermaid +flowchart LR + A["Audio confidence"] --> E["Decision scorer"] + B["Reaction confidence"] --> E + C["Event importance prior"] --> E + D["Ambient sound penalty"] --> E + P["Speech pause / scene impact bonus"] --> E + + E --> F{"Decision score >= threshold?"} + F -- "Yes" --> G["Accept caption"] + F -- "Borderline" --> H["Needs editor review"] + F -- "No" --> I["Reject candidate"] + + G --> J["Generate caption text"] + H --> J + I --> K["Keep reason in debug output"] + J --> L["SRT / JSON / CSV"] +``` + +## CLI Plan + +The CLI should be useful for developers, batch processing, debugging, and reviewers who prefer terminal workflows. + +Recommended command shape: + +```bash +ccs analyze input.mp4 --lang hi --device auto +ccs analyze input.mp4 --audio-backend yamnet --vision-backend mediapipe --out outputs/ +ccs inspect input.mp4 +ccs doctor +ccs export outputs/result.json --format srt --lang ta +ccs web +``` + +### CLI Commands + +| Command | Purpose | +| --- | --- | +| `ccs analyze` | Run full pipeline on a video | +| `ccs audio` | Run only sound event detection | +| `ccs vision` | Run visual reaction analysis from existing audio events | +| `ccs export` | Convert JSON results to SRT/CSV | +| `ccs inspect` | Show video metadata and input validity | +| `ccs doctor` | Check environment, ffmpeg, models, CPU/GPU | +| `ccs web` | Launch the Web UI | + +### CLI Error Suggestions + +The CLI should explain errors and suggest next steps. + +Wrong command example: + +```text +No such command: analize +Did you mean: analyze? + +Try: + ccs analyze input.mp4 --device auto --lang hi +``` + +Missing video example: + +```text +Input file was not found: + videos/sample.mp4 + +Suggestions: +1. Check the path. +2. Run: + ccs inspect /path/to/video.mp4 +``` + +GPU failure example: + +```text +CUDA was requested, but no usable GPU was detected. + +Detected: +- torch.cuda.is_available(): false +- CUDA runtime: not found +- NVIDIA driver: not found + +Suggestions: +1. Retry on CPU: + ccs analyze input.mp4 --device cpu + +2. Check environment: + ccs doctor + +3. Install a CUDA-compatible PyTorch build if GPU acceleration is required. +``` + +## Device Handling + +The project should support: + +```text +device = auto | cpu | cuda +``` + +Behavior: + +- `auto`: use GPU if available, otherwise CPU. +- `cpu`: force CPU. +- `cuda`: require GPU; fail clearly if unavailable. + +Each run should save device metadata: + +- selected device +- actual device used +- model backend +- GPU name if available +- CUDA availability +- runtime +- fallback reason if CPU was used + +The UI should provide: + +- Auto/CPU/GPU toggle +- GPU diagnostics popup +- Retry on CPU button +- Copy diagnostic report button + +```mermaid +flowchart TD + A["User selects device mode"] --> B{"Mode"} + B -- "auto" --> C{"GPU available?"} + C -- "Yes" --> D["Use GPU"] + C -- "No" --> E["Fallback to CPU\nrecord fallback reason"] + B -- "cpu" --> F["Force CPU"] + B -- "cuda" --> G{"GPU available?"} + G -- "Yes" --> D + G -- "No" --> H["Stop with clear diagnostic"] + H --> I["Suggest: retry with --device cpu"] + H --> J["Suggest: run ccs doctor"] + D --> K["Save device metadata"] + E --> K + F --> K +``` + +## Web UI Plan + +The Web UI should be an editor review workspace, not a basic demo. + +Recommended initial framework: + +- Streamlit for the first implementation because it is fast to build and supports video display. +- Later, consider React/FastAPI if the UI needs more advanced timeline editing. + +### UI Layout + +```text +Top Bar + Product name + Device mode selector + Language selector + Audio backend selector + Vision backend selector + Run Doctor button + +Left Panel + Video dropdown/upload + Video metadata + Start Caption button + Export SRT button + Export JSON button + Export CSV button + +Center Panel + Video player + Play/Pause controls + Current timestamp + Draggable timeline + Event markers + Previous/Next event buttons + +Right Panel + Review SRT suggestions + Caption text editor + Accept/Reject toggle + Confidence scores + Decision reason + Warning/error badges + +Bottom Panel + Event table + Start/end timestamps + Event labels + Audio confidence + Reaction confidence + Decision score + Status +``` + +```mermaid +flowchart TB + subgraph Top["Top Bar"] + T1["Device Mode"] + T2["Language"] + T3["Audio Backend"] + T4["Vision Backend"] + T5["Run Doctor"] + end + + subgraph Left["Left Panel"] + L1["Video Dropdown / Upload"] + L2["Video Metadata"] + L3["Start Caption"] + L4["Export SRT / JSON / CSV"] + end + + subgraph Center["Center Panel"] + C1["Video Player"] + C2["Play / Pause"] + C3["Draggable Timeline"] + C4["Event Markers"] + C5["Previous / Next Event"] + end + + subgraph Right["Right Review Panel"] + R1["SRT Suggestions"] + R2["Editable Caption Text"] + R3["Accept / Reject"] + R4["Confidence Scores"] + R5["Decision Reason"] + R6["Error / Warning Badges"] + end + + subgraph Bottom["Bottom Panel"] + B1["Event Table"] + B2["Timestamps"] + B3["Audio + Reaction Scores"] + B4["Status"] + end + + L1 --> L3 + T1 --> L3 + T2 --> L3 + T3 --> L3 + T4 --> L3 + L3 --> C1 + L3 --> R1 + L3 --> B1 + C4 --> R1 + B1 --> R1 + R3 --> L4 + R2 --> L4 +``` + +### Timeline Behavior + +The timeline should show event markers: + +- Green: accepted caption +- Yellow: needs review +- Gray: rejected +- Blue: currently selected event + +Clicking a marker should: + +- seek the video to that timestamp +- open the suggestion in the right review panel +- highlight the corresponding event table row + +### Editor Review Flow + +1. User selects or uploads a video. +2. User selects language and device mode. +3. User clicks **Start Caption**. +4. Pipeline generates caption suggestions. +5. User reviews suggestions in the right panel. +6. User edits captions if needed. +7. User accepts or rejects suggestions. +8. User exports final SRT. + +```mermaid +sequenceDiagram + actor Editor + participant UI as Web UI + participant Pipeline as Core Pipeline + participant Review as Review State + participant Export as Exporter + + Editor->>UI: Select video, language, device + Editor->>UI: Click Start Caption + UI->>Pipeline: Run analysis with config + Pipeline-->>UI: Caption suggestions + diagnostics + UI->>Review: Load suggestions into timeline and panel + Editor->>Review: Jump to event marker + Editor->>Review: Edit caption text + Editor->>Review: Accept or reject suggestion + Editor->>UI: Export final SRT + UI->>Export: Export accepted captions + Export-->>Editor: SRT / JSON / CSV files +``` + +### UI Error Handling + +The UI should include: + +- toast notifications for recoverable errors +- modal popups for blocking errors +- expandable debug details +- retry on CPU button when GPU fails +- model download/setup hints +- export success messages + +## Output Files + +Each run should produce a run directory: + +```text +outputs/ + sample-video/ + captions.en.srt + captions.hi.srt + results.json + events.csv + diagnostics.json + config.yaml +``` + +### SRT + +Only accepted caption suggestions. + +### JSON + +Full structured output: + +- accepted suggestions +- rejected candidates +- confidence scores +- reaction signals +- decision reasons +- diagnostics + +### CSV + +Reviewer-friendly table for spreadsheets. + +## Installation Plan + +Start with requirements files: + +```text +requirements.txt +requirements-ui.txt +requirements-dev.txt +requirements-translate.txt +``` + +Recommended split: + +- `requirements.txt`: core CPU pipeline +- `requirements-ui.txt`: Streamlit/Web UI +- `requirements-dev.txt`: test, lint, formatting, docs +- `requirements-translate.txt`: IndicTrans2 or translation extras + +Example install flow: + +```bash +python -m venv .venv +source .venv/bin/activate +pip install -r requirements.txt +pip install -r requirements-ui.txt +ccs doctor +``` + +GPU installation should be documented separately because CUDA-compatible PyTorch/TensorFlow installation depends on the user's system. + +Docker should be added later for reproducibility, but requirements files are easier for first-time contributors. + +## Configuration Plan + +Use YAML config files for reproducible runs. + +Example settings: + +```yaml +device: auto +language: en +audio_backend: yamnet +vision_backend: mediapipe +audio_threshold: 0.45 +reaction_threshold: 0.35 +decision_threshold: 0.65 +min_event_duration: 0.25 +merge_gap: 0.40 +sample_window_before: 1.0 +sample_window_after: 1.0 +``` + +CLI flags should override config values. + +## Evaluation Plan + +The tool should be evaluated on a small sample set of Hindi and regional-language content. + +### Suggested Evaluation Data + +Initial languages: + +- Hindi +- Tamil +- Telugu +- Bengali +- Marathi +- Malayalam + +Video types: + +- educational videos +- conversational scenes +- public-service clips +- classroom-style videos +- documentary-style clips + +### Annotation Process + +Editors should review: + +- whether the suggested caption is needed +- whether the label is correct +- whether the timestamp is correct +- whether any important sound was missed +- whether any unnecessary sound was captioned + +### Metrics + +Track: + +- audio event precision +- audio event recall +- caption decision precision +- over-captioning rate +- missed-important-event rate +- timestamp quality +- editor acceptance rate +- average editor correction time + +### Feedback Fields + +Suggested review CSV columns: + +```text +video_id +event_id +start_time +end_time +caption_text +audio_confidence +reaction_confidence +decision_score +accepted_by_tool +accepted_by_editor +editor_corrected_label +editor_notes +``` + +## VLC Plugin Plan + +VLC integration is a useful extension, but it should not be part of the midpoint. + +Recommended phases: + +1. Generate SRT externally and let users load it in VLC. +2. Add a helper command that analyzes the current video path. +3. Build a VLC Lua extension that calls the local CLI or local API. +4. Let the extension load the generated SRT when analysis finishes. + +The VLC plugin should use the same core modules indirectly through CLI/API, not duplicate analysis logic. + +## Roadmap + +### Phase 1: Project Foundation + +- Define module interfaces. +- Create data models. +- Add config system. +- Add diagnostics and friendly errors. +- Add README and project documentation. + +### Phase 2: Goal 1 Audio Detection + +- Extract audio with ffmpeg. +- Add DSP baseline. +- Add YAMNet backend. +- Add event smoothing and label normalization. +- Export audio event JSON. + +### Phase 3: Goal 2 Visual Reaction Detection + +- Sample event-aligned frames. +- Add OpenCV optical flow features. +- Add MediaPipe face/pose backends. +- Compute reaction confidence. +- Attach reaction data to audio events. + +### Phase 4: Goal 3 Decision and Output + +- Add decision scorer. +- Add event importance rules. +- Add ambient rejection logic. +- Add SRT/JSON/CSV export. +- Add multilingual caption label glossary. + +### Phase 5: CLI Productization + +- Add complete CLI commands. +- Add typo suggestions. +- Add error recovery suggestions. +- Add `doctor` diagnostics. +- Add CPU/GPU fallback behavior. + +### Phase 6: Web Editor Review UI + +- Add video selector/upload. +- Add Start Caption button. +- Add review panel. +- Add timeline event markers. +- Add accept/reject/edit flow. +- Add export buttons. +- Add error popups and diagnostics panel. + +### Phase 7: Advanced Backends + +- Add PANNs backend. +- Add AST or BEATs backend. +- Add optional translation backend. +- Add stronger visual backends if needed. + +### Phase 8: Evaluation and Packaging + +- Evaluate on Hindi and regional-language videos. +- Collect editor feedback. +- Improve thresholds and label glossary. +- Add Docker. +- Add VLC integration prototype. + +```mermaid +flowchart LR + P1["Phase 1\nProject Foundation"] --> P2["Phase 2\nAudio Detection"] + P2 --> P3["Phase 3\nVisual Reaction Detection"] + P3 --> P4["Phase 4\nDecision + Output"] + P4 --> P5["Phase 5\nCLI Productization"] + P5 --> P6["Phase 6\nWeb Editor UI"] + P6 --> P7["Phase 7\nAdvanced Backends"] + P7 --> P8["Phase 8\nEvaluation + Packaging"] + + P2 -.->|Midpoint Goal 1| M["Midpoint\nGoals 1 + 2 complete"] + P3 -.->|Midpoint Goal 2| M +``` + +## Open Questions + +- Should caption labels be formal, conversational, or broadcaster-style in each language? +- Should laughter be treated as speech-adjacent or non-speech for this project? +- Should music be captioned only when it begins/stops or changes mood? +- How should overlapping non-speech events be represented in SRT? +- What timestamp tolerance is acceptable for editors? +- Will sample videos be provided with editor-approved ground truth? +- Should the Web UI support manual timestamp adjustment in Version 1? +- Should rejected events be shown by default or hidden under debug/review mode? + +## Contribution Guidelines + +Contributors should follow these principles: + +- Keep backend logic independent from UI. +- Add new models through backend interfaces. +- Preserve JSON output compatibility where possible. +- Include tests for decision rules and output formatting. +- Prefer readable, debuggable logic over opaque automation. +- Avoid over-captioning as a core product principle. +- Document new event labels in the multilingual glossary. + +## Proposed Tech Stack + +Core: + +- Python +- ffmpeg +- OpenCV +- NumPy +- SciPy/librosa-style audio features +- PyTorch and/or TensorFlow depending on model backend + +Audio: + +- DSP baseline +- YAMNet +- PANNs +- AST or BEATs as optional advanced backends + +Vision: + +- OpenCV +- MediaPipe Face Landmarker +- MediaPipe Pose Landmarker +- optional MMPose/MMAction2 later + +CLI: + +- Typer +- Rich + +Web UI: + +- Streamlit first +- optional FastAPI/React later + +Translation: + +- curated glossary first +- IndicTrans2 fallback later + +Testing: + +- pytest +- small synthetic fixtures +- sample video integration tests + +## Success Criteria + +The project is successful when: + +- It accepts raw videos without subtitles or transcripts. +- It detects non-speech audio events with timestamps. +- It estimates visible reaction confidence around each event. +- It avoids captioning low-impact ambient sounds. +- It exports clean SRT files. +- It provides useful debug information. +- It runs on CPU and can use GPU when available. +- It lets editors review, edit, accept, reject, and export suggestions. +- It supports English plus initial Indian regional-language caption labels. + +## Summary + +This project should be built as a modular open-source tool: + +- The backend pipeline does the real work. +- The CLI provides batch and debug workflows. +- The Web UI provides editor review and export workflows. +- Future VLC or API integrations reuse the same modules. + +The first implementation should prioritize a reliable, explainable pipeline using DSP, YAMNet, OpenCV, MediaPipe, and a rule-based decision engine. Stronger audio/video models can be added later through the pluggable backend system. +# Intelligent-Closed-Caption-CC-Suggestion-Tool diff --git a/demo_videos/drivelink b/demo_videos/drivelink new file mode 100644 index 0000000..8c9a4cc --- /dev/null +++ b/demo_videos/drivelink @@ -0,0 +1,3 @@ +https://drive.google.com/drive/folders/1Ti5aqztP9VHas_5AbrH7utSn-G27HZXW?usp=sharing + +demo videos and recording diff --git a/demo_videos/output/vid1.reviewed.en.srt b/demo_videos/output/vid1.reviewed.en.srt new file mode 100644 index 0000000..6f6b2f5 --- /dev/null +++ b/demo_videos/output/vid1.reviewed.en.srt @@ -0,0 +1,7 @@ +1 +00:00:08,640 --> 00:00:09,600 +[laughter] + +2 +00:00:55,680 --> 00:00:59,040 +[students applauding] diff --git a/demo_videos/output/vid10.reviewed.en.srt b/demo_videos/output/vid10.reviewed.en.srt new file mode 100644 index 0000000..30f1961 --- /dev/null +++ b/demo_videos/output/vid10.reviewed.en.srt @@ -0,0 +1,31 @@ +1 +00:00:18,720 --> 00:00:19,680 +[explosion] + +2 +00:00:19,200 --> 00:00:20,640 +[explosion] + +3 +00:00:31,200 --> 00:00:39,840 +[music] + +4 +00:00:51,360 --> 00:00:55,200 +[music] + +5 +00:00:58,560 --> 00:01:00,000 +[explosion] + +6 +00:01:09,600 --> 00:01:11,520 +[music] + +7 +00:01:36,960 --> 00:01:38,880 +[music] + +8 +00:02:04,800 --> 00:02:06,720 +[explosion] diff --git a/demo_videos/output/vid11.reviewed.en.srt b/demo_videos/output/vid11.reviewed.en.srt new file mode 100644 index 0000000..e4bb91c --- /dev/null +++ b/demo_videos/output/vid11.reviewed.en.srt @@ -0,0 +1,31 @@ +1 +00:00:10,560 --> 00:00:11,520 +[explosion] + +2 +00:00:10,560 --> 00:00:11,520 +[gunshot] + +3 +00:00:12,960 --> 00:00:13,920 +[gunshot] + +4 +00:00:12,960 --> 00:00:13,920 +[explosion] + +5 +00:00:14,400 --> 00:00:16,320 +[music] + +6 +00:00:23,520 --> 00:00:33,120 +[music] + +7 +00:01:31,200 --> 00:01:36,480 +[music] + +8 +00:01:54,720 --> 00:01:58,080 +[music] diff --git a/demo_videos/output/vid2.reviewed.en.srt b/demo_videos/output/vid2.reviewed.en.srt new file mode 100644 index 0000000..e837348 --- /dev/null +++ b/demo_videos/output/vid2.reviewed.en.srt @@ -0,0 +1,19 @@ +1 +00:00:02,880 --> 00:00:14,880 +[music] + +2 +00:00:15,360 --> 00:00:18,720 +[music] + +3 +00:01:12,960 --> 00:01:13,920 +[explosion] + +4 +00:01:12,960 --> 00:01:13,920 +[gunshot] + +5 +00:01:13,440 --> 00:01:14,400 +[explosion] diff --git a/demo_videos/output/vid3.reviewed.en.srt b/demo_videos/output/vid3.reviewed.en.srt new file mode 100644 index 0000000..ef6c046 --- /dev/null +++ b/demo_videos/output/vid3.reviewed.en.srt @@ -0,0 +1,31 @@ +1 +00:00:23,520 --> 00:00:24,480 +[gunshot] + +2 +00:00:26,400 --> 00:00:27,360 +[glass breaks] + +3 +00:00:47,520 --> 00:00:48,480 +[explosion] + +4 +00:00:50,400 --> 00:00:53,760 +[music] + +5 +00:00:59,040 --> 00:01:08,640 +[music] + +6 +00:01:11,040 --> 00:01:12,480 +[explosion] + +7 +00:01:16,320 --> 00:01:17,280 +[gunshot] + +8 +00:01:16,320 --> 00:01:17,760 +[explosion] diff --git a/demo_videos/output/vid4.reviewed.en.srt b/demo_videos/output/vid4.reviewed.en.srt new file mode 100644 index 0000000..5356448 --- /dev/null +++ b/demo_videos/output/vid4.reviewed.en.srt @@ -0,0 +1,7 @@ +1 +00:00:00,480 --> 00:00:04,320 +[music] + +2 +00:00:11,520 --> 00:00:19,680 +[music] diff --git a/demo_videos/output/vid5.reviewed.en.srt b/demo_videos/output/vid5.reviewed.en.srt new file mode 100644 index 0000000..01bfe5d --- /dev/null +++ b/demo_videos/output/vid5.reviewed.en.srt @@ -0,0 +1,7 @@ +1 +00:00:00,480 --> 00:00:05,280 +[music] + +2 +00:00:05,760 --> 00:00:12,960 +[music] diff --git a/main/.gitignore b/main/.gitignore new file mode 100644 index 0000000..5c2d003 --- /dev/null +++ b/main/.gitignore @@ -0,0 +1,12 @@ +__pycache__/ +*.py[cod] +.pytest_cache/ +.mypy_cache/ +.ruff_cache/ +.venv/ +outputs/ +*.wav +*.mp4 +*.mkv +*.mov +*.avi diff --git a/main/README.md b/main/README.md new file mode 100644 index 0000000..fc64fbd --- /dev/null +++ b/main/README.md @@ -0,0 +1,235 @@ +# cc-suggester + +Python implementation for the Intelligent Closed Caption Suggestion Tool. + +This package generates meaningful non-speech closed caption suggestions from video. The current implementation is a runnable foundation: it proves the modular pipeline, CLI, diagnostics, decision engine, multilingual labels, and SRT/JSON/CSV export flow before heavy ML backends are added. + +## Current Implementation Status + +Implemented now: + +- `cc_suggester.core`: pipeline orchestration, config, shared data models, diagnostics, media inspection, friendly errors +- `cc_suggester.audio`: audio backend interface, deterministic mock backend, DSP backend, event smoothing, ffmpeg extraction helper, advanced backend placeholders +- `cc_suggester.vision`: vision backend interface, deterministic mock backend, OpenCV backend, optional MediaPipe pose backend, frame-sampling and reaction helpers +- `cc_suggester.decision`: scoring rules, ambient penalties, multilingual caption glossary +- `cc_suggester.output`: SRT, JSON, CSV, and reviewed export helpers +- `cc_suggester.cli`: `analyze`, `audio`, `inspect`, `doctor`, `export`, `labels`, and `web` commands +- `cc_suggester.ui`: Streamlit editor review client with edited SRT/CSV/session downloads +- `tests`: tests for SRT output, label lookup, config/CLI behavior, DSP detection, and reviewed exports + +Not implemented yet: + +- Real YAMNet/PANNs/AST/BEATs semantic audio backend +- MediaPipe face-landmark/expression reaction scoring +- Advanced Streamlit timeline editing and persisted review sessions +- Real evaluation dataset and editor feedback loop +- Docker and VLC integration + +The full roadmap is documented in [`../docs/implementation-plan.md`](../docs/implementation-plan.md). + +## Setup + +The current scaffold uses only the Python standard library for the core pipeline. + +```bash +cd main +python -m venv .venv +source .venv/bin/activate +pip install -r requirements.txt +``` + +For development tests: + +```bash +pip install -r requirements-dev.txt +``` + +For the Web UI: + +```bash +pip install -r requirements-ui.txt +``` + +For the OpenCV vision backend: + +```bash +pip install -r requirements-vision.txt +``` + +`requirements-vision.txt` also includes MediaPipe for the optional pose-based reaction backend. + +## CLI Usage + +Run diagnostics: + +```bash +python -m cc_suggester doctor +``` + +Inspect a video: + +```bash +python -m cc_suggester inspect path/to/video.mp4 +``` + +Run the current mock pipeline: + +```bash +python -m cc_suggester analyze path/to/video.mp4 --lang hi --device auto --out outputs/ +``` + +Run the CPU DSP audio baseline: + +```bash +python -m cc_suggester analyze path/to/video.mp4 --audio-backend dsp --vision-backend mock --lang en +``` + +Run only audio detection: + +```bash +python -m cc_suggester audio path/to/video.mp4 --audio-backend dsp --out outputs/ +``` + +Run only visual reaction scoring from an audio report: + +```bash +python -m cc_suggester vision path/to/video.mp4 outputs/video/audio_events.json --vision-backend opencv +``` + +Run the optional YAMNet backend after installing audio dependencies: + +```bash +pip install -r requirements-audio.txt +python -m cc_suggester audio path/to/video.mp4 --audio-backend yamnet --out outputs/ +``` + +For offline environments, point YAMNet to a local TensorFlow Hub model directory: + +```bash +python -m cc_suggester audio path/to/video.mp4 \ + --audio-backend yamnet \ + --yamnet-model /path/to/local/yamnet +``` + +Export another language from an existing JSON report: + +```bash +python -m cc_suggester export outputs/video/results.json --format srt --lang ml +``` + +Show Web UI guidance: + +```bash +python -m cc_suggester web +``` + +List supported labels: + +```bash +python -m cc_suggester labels +``` + +The installed package will expose the same CLI as `ccs`: + +```bash +ccs analyze path/to/video.mp4 --lang hi --device auto +``` + +## Output Files + +Each analysis run creates a directory under `outputs/`: + +```text +outputs/ + video-name/ + captions..srt + results.json + events.csv + diagnostics.json + config.json +``` + +`captions..srt` contains only accepted captions. `results.json` and `events.csv` include accepted, rejected, and review-needed candidates for debugging and editor review. + +The Streamlit UI can also export reviewed SRT, CSV, and JSON session content from the current editor choices. This means edited caption text and manual accept/reject/review decisions drive the downloaded files. + +## Backend Strategy + +Backends are intentionally pluggable. + +Audio backends implement: + +```text +detect(video_path, metadata, config) -> list[AudioEventCandidate] +``` + +Vision backends implement: + +```text +analyze(video_path, metadata, audio_events, config) -> list[ReactionResult] +``` + +The DSP audio backend and OpenCV vision backend are available as local baselines. YAMNet is implemented as an optional TensorFlow Hub backend and requires `requirements-audio.txt`. MediaPipe is implemented as an optional pose-based reaction backend and requires `requirements-vision.txt`. Mock backends should remain available for tests and demos. + +## Verification + +Run syntax checks: + +```bash +python -m compileall cc_suggester +``` + +Run tests: + +```bash +python -m pytest tests +``` + +Run CLI smoke checks: + +```bash +python -m cc_suggester doctor +python -m cc_suggester analize +python -m cc_suggester analyze README.md --lang hi --device auto --out outputs +python -m cc_suggester export outputs/README/results.json --format srt --lang ml --out outputs/README/captions.ml.srt +python -m cc_suggester labels +python -m cc_suggester vision tests/fixtures/sample_classroom.mp4 outputs/sample_classroom/audio_events.json --vision-backend opencv +``` + +The `analize` command is intentionally useful as a smoke check for friendly typo suggestions. + +## Real Sample Video Fixture + +Generate a tiny deterministic MP4 fixture for local integration testing: + +```bash +python scripts/generate_sample_video.py +``` + +Then run: + +```bash +python -m cc_suggester inspect tests/fixtures/sample_classroom.mp4 +python -m cc_suggester analyze tests/fixtures/sample_classroom.mp4 --audio-backend dsp --vision-backend mock --lang hi +``` + +If `ffmpeg` is available, the MP4 includes embedded audio. If `ffmpeg` is unavailable but OpenCV is installed, the script writes a video-only MP4 plus a sidecar WAV file: + +```bash +python -m cc_suggester analyze tests/fixtures/sample_classroom.mp4 \ + --audio-backend dsp \ + --vision-backend opencv \ + --audio-path tests/fixtures/sample_classroom.wav \ + --lang hi +``` + +## Immediate Next Sprint + +1. Test YAMNet with an installed TensorFlow/TensorFlow Hub environment and a cached/local model. +2. Test MediaPipe in an environment with `requirements-vision.txt` installed and tune pose thresholds. +3. Add face-landmark/expression scoring to the MediaPipe backend. +4. Add more decision-rule and backend dependency tests. +5. Add timeline markers and persisted review sessions to the Streamlit editor. +6. Add evaluation scripts for editor feedback. + +After that, add evaluation scripts and package the CPU pipeline with Docker. diff --git a/main/cc_suggester.egg-info/PKG-INFO b/main/cc_suggester.egg-info/PKG-INFO new file mode 100644 index 0000000..276fdb9 --- /dev/null +++ b/main/cc_suggester.egg-info/PKG-INFO @@ -0,0 +1,254 @@ +Metadata-Version: 2.4 +Name: cc-suggester +Version: 0.1.0 +Summary: AI-assisted non-speech closed caption suggestion pipeline. +Author: Planet Read project contributor +Requires-Python: >=3.10 +Description-Content-Type: text/markdown +Provides-Extra: audio +Requires-Dist: numpy>=1.26; extra == "audio" +Requires-Dist: tensorflow>=2.16; extra == "audio" +Requires-Dist: tensorflow-hub>=0.16; extra == "audio" +Provides-Extra: ui +Requires-Dist: streamlit>=1.34; extra == "ui" +Provides-Extra: vision +Requires-Dist: opencv-python>=4.8; extra == "vision" +Requires-Dist: mediapipe>=0.10; extra == "vision" +Provides-Extra: dev +Requires-Dist: pytest>=8.0; extra == "dev" + +# cc-suggester + +Python implementation for the Intelligent Closed Caption Suggestion Tool. + +This package generates meaningful non-speech closed caption suggestions from video. The current implementation is a runnable foundation: it proves the modular pipeline, CLI, diagnostics, decision engine, multilingual labels, and SRT/JSON/CSV export flow before heavy ML backends are added. + +## Current Implementation Status + +Implemented now: + +- `cc_suggester.core`: pipeline orchestration, config, shared data models, diagnostics, media inspection, friendly errors +- `cc_suggester.audio`: audio backend interface, deterministic mock backend, DSP backend, event smoothing, ffmpeg extraction helper, advanced backend placeholders +- `cc_suggester.vision`: vision backend interface, deterministic mock backend, OpenCV backend, optional MediaPipe pose backend, frame-sampling and reaction helpers +- `cc_suggester.decision`: scoring rules, ambient penalties, multilingual caption glossary +- `cc_suggester.output`: SRT, JSON, CSV, and reviewed export helpers +- `cc_suggester.cli`: `analyze`, `audio`, `inspect`, `doctor`, `export`, `labels`, and `web` commands +- `cc_suggester.ui`: Streamlit editor review client with edited SRT/CSV/session downloads +- `tests`: tests for SRT output, label lookup, config/CLI behavior, DSP detection, and reviewed exports + +Not implemented yet: + +- Real YAMNet/PANNs/AST/BEATs semantic audio backend +- MediaPipe face-landmark/expression reaction scoring +- Advanced Streamlit timeline editing and persisted review sessions +- Real evaluation dataset and editor feedback loop +- Docker and VLC integration + +The full roadmap is documented in [`../docs/implementation-plan.md`](../docs/implementation-plan.md). + +## Setup + +The current scaffold uses only the Python standard library for the core pipeline. + +```bash +cd main +python -m venv .venv +source .venv/bin/activate +pip install -r requirements.txt +``` + +For development tests: + +```bash +pip install -r requirements-dev.txt +``` + +For the Web UI: + +```bash +pip install -r requirements-ui.txt +``` + +For the OpenCV vision backend: + +```bash +pip install -r requirements-vision.txt +``` + +`requirements-vision.txt` also includes MediaPipe for the optional pose-based reaction backend. + +## CLI Usage + +Run diagnostics: + +```bash +python -m cc_suggester doctor +``` + +Inspect a video: + +```bash +python -m cc_suggester inspect path/to/video.mp4 +``` + +Run the current mock pipeline: + +```bash +python -m cc_suggester analyze path/to/video.mp4 --lang hi --device auto --out outputs/ +``` + +Run the CPU DSP audio baseline: + +```bash +python -m cc_suggester analyze path/to/video.mp4 --audio-backend dsp --vision-backend mock --lang en +``` + +Run only audio detection: + +```bash +python -m cc_suggester audio path/to/video.mp4 --audio-backend dsp --out outputs/ +``` + +Run only visual reaction scoring from an audio report: + +```bash +python -m cc_suggester vision path/to/video.mp4 outputs/video/audio_events.json --vision-backend opencv +``` + +Run the optional YAMNet backend after installing audio dependencies: + +```bash +pip install -r requirements-audio.txt +python -m cc_suggester audio path/to/video.mp4 --audio-backend yamnet --out outputs/ +``` + +For offline environments, point YAMNet to a local TensorFlow Hub model directory: + +```bash +python -m cc_suggester audio path/to/video.mp4 \ + --audio-backend yamnet \ + --yamnet-model /path/to/local/yamnet +``` + +Export another language from an existing JSON report: + +```bash +python -m cc_suggester export outputs/video/results.json --format srt --lang ml +``` + +Show Web UI guidance: + +```bash +python -m cc_suggester web +``` + +List supported labels: + +```bash +python -m cc_suggester labels +``` + +The installed package will expose the same CLI as `ccs`: + +```bash +ccs analyze path/to/video.mp4 --lang hi --device auto +``` + +## Output Files + +Each analysis run creates a directory under `outputs/`: + +```text +outputs/ + video-name/ + captions..srt + results.json + events.csv + diagnostics.json + config.json +``` + +`captions..srt` contains only accepted captions. `results.json` and `events.csv` include accepted, rejected, and review-needed candidates for debugging and editor review. + +The Streamlit UI can also export reviewed SRT, CSV, and JSON session content from the current editor choices. This means edited caption text and manual accept/reject/review decisions drive the downloaded files. + +## Backend Strategy + +Backends are intentionally pluggable. + +Audio backends implement: + +```text +detect(video_path, metadata, config) -> list[AudioEventCandidate] +``` + +Vision backends implement: + +```text +analyze(video_path, metadata, audio_events, config) -> list[ReactionResult] +``` + +The DSP audio backend and OpenCV vision backend are available as local baselines. YAMNet is implemented as an optional TensorFlow Hub backend and requires `requirements-audio.txt`. MediaPipe is implemented as an optional pose-based reaction backend and requires `requirements-vision.txt`. Mock backends should remain available for tests and demos. + +## Verification + +Run syntax checks: + +```bash +python -m compileall cc_suggester +``` + +Run tests: + +```bash +python -m pytest tests +``` + +Run CLI smoke checks: + +```bash +python -m cc_suggester doctor +python -m cc_suggester analize +python -m cc_suggester analyze README.md --lang hi --device auto --out outputs +python -m cc_suggester export outputs/README/results.json --format srt --lang ml --out outputs/README/captions.ml.srt +python -m cc_suggester labels +python -m cc_suggester vision tests/fixtures/sample_classroom.mp4 outputs/sample_classroom/audio_events.json --vision-backend opencv +``` + +The `analize` command is intentionally useful as a smoke check for friendly typo suggestions. + +## Real Sample Video Fixture + +Generate a tiny deterministic MP4 fixture for local integration testing: + +```bash +python scripts/generate_sample_video.py +``` + +Then run: + +```bash +python -m cc_suggester inspect tests/fixtures/sample_classroom.mp4 +python -m cc_suggester analyze tests/fixtures/sample_classroom.mp4 --audio-backend dsp --vision-backend mock --lang hi +``` + +If `ffmpeg` is available, the MP4 includes embedded audio. If `ffmpeg` is unavailable but OpenCV is installed, the script writes a video-only MP4 plus a sidecar WAV file: + +```bash +python -m cc_suggester analyze tests/fixtures/sample_classroom.mp4 \ + --audio-backend dsp \ + --vision-backend opencv \ + --audio-path tests/fixtures/sample_classroom.wav \ + --lang hi +``` + +## Immediate Next Sprint + +1. Test YAMNet with an installed TensorFlow/TensorFlow Hub environment and a cached/local model. +2. Test MediaPipe in an environment with `requirements-vision.txt` installed and tune pose thresholds. +3. Add face-landmark/expression scoring to the MediaPipe backend. +4. Add more decision-rule and backend dependency tests. +5. Add timeline markers and persisted review sessions to the Streamlit editor. +6. Add evaluation scripts for editor feedback. + +After that, add evaluation scripts and package the CPU pipeline with Docker. diff --git a/main/cc_suggester.egg-info/SOURCES.txt b/main/cc_suggester.egg-info/SOURCES.txt new file mode 100644 index 0000000..aefcf8f --- /dev/null +++ b/main/cc_suggester.egg-info/SOURCES.txt @@ -0,0 +1,61 @@ +README.md +pyproject.toml +cc_suggester/__init__.py +cc_suggester/__main__.py +cc_suggester.egg-info/PKG-INFO +cc_suggester.egg-info/SOURCES.txt +cc_suggester.egg-info/dependency_links.txt +cc_suggester.egg-info/entry_points.txt +cc_suggester.egg-info/requires.txt +cc_suggester.egg-info/top_level.txt +cc_suggester/audio/__init__.py +cc_suggester/audio/dsp.py +cc_suggester/audio/events.py +cc_suggester/audio/extractor.py +cc_suggester/audio/label_mapping.py +cc_suggester/audio/vad.py +cc_suggester/audio/wav.py +cc_suggester/audio/backends/__init__.py +cc_suggester/audio/backends/base.py +cc_suggester/audio/backends/dsp.py +cc_suggester/audio/backends/mock.py +cc_suggester/audio/backends/unavailable.py +cc_suggester/audio/backends/yamnet.py +cc_suggester/cli/__init__.py +cc_suggester/cli/app.py +cc_suggester/core/__init__.py +cc_suggester/core/config.py +cc_suggester/core/diagnostics.py +cc_suggester/core/errors.py +cc_suggester/core/media.py +cc_suggester/core/pipeline.py +cc_suggester/core/types.py +cc_suggester/decision/__init__.py +cc_suggester/decision/labels.py +cc_suggester/decision/rules.py +cc_suggester/decision/scorer.py +cc_suggester/output/__init__.py +cc_suggester/output/csv_report.py +cc_suggester/output/json_report.py +cc_suggester/output/review_export.py +cc_suggester/output/srt.py +cc_suggester/translation/__init__.py +cc_suggester/translation/glossary.py +cc_suggester/ui/__init__.py +cc_suggester/ui/streamlit_app.py +cc_suggester/vision/__init__.py +cc_suggester/vision/frame_sampler.py +cc_suggester/vision/optical_flow.py +cc_suggester/vision/reactions.py +cc_suggester/vision/backends/__init__.py +cc_suggester/vision/backends/base.py +cc_suggester/vision/backends/mediapipe.py +cc_suggester/vision/backends/mock.py +cc_suggester/vision/backends/opencv.py +tests/test_config_cli.py +tests/test_dsp_backend.py +tests/test_outputs.py +tests/test_real_video_integration.py +tests/test_review_export.py +tests/test_vision_pipeline.py +tests/test_yamnet_backend.py \ No newline at end of file diff --git a/main/cc_suggester.egg-info/dependency_links.txt b/main/cc_suggester.egg-info/dependency_links.txt new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/main/cc_suggester.egg-info/dependency_links.txt @@ -0,0 +1 @@ + diff --git a/main/cc_suggester.egg-info/entry_points.txt b/main/cc_suggester.egg-info/entry_points.txt new file mode 100644 index 0000000..3b01b60 --- /dev/null +++ b/main/cc_suggester.egg-info/entry_points.txt @@ -0,0 +1,2 @@ +[console_scripts] +ccs = cc_suggester.cli.app:main diff --git a/main/cc_suggester.egg-info/requires.txt b/main/cc_suggester.egg-info/requires.txt new file mode 100644 index 0000000..b339e44 --- /dev/null +++ b/main/cc_suggester.egg-info/requires.txt @@ -0,0 +1,15 @@ + +[audio] +numpy>=1.26 +tensorflow>=2.16 +tensorflow-hub>=0.16 + +[dev] +pytest>=8.0 + +[ui] +streamlit>=1.34 + +[vision] +opencv-python>=4.8 +mediapipe>=0.10 diff --git a/main/cc_suggester.egg-info/top_level.txt b/main/cc_suggester.egg-info/top_level.txt new file mode 100644 index 0000000..4ebf94e --- /dev/null +++ b/main/cc_suggester.egg-info/top_level.txt @@ -0,0 +1 @@ +cc_suggester diff --git a/main/cc_suggester/__init__.py b/main/cc_suggester/__init__.py new file mode 100644 index 0000000..d525093 --- /dev/null +++ b/main/cc_suggester/__init__.py @@ -0,0 +1,3 @@ +"""Intelligent Closed Caption Suggestion Tool.""" + +__version__ = "0.1.0" diff --git a/main/cc_suggester/__main__.py b/main/cc_suggester/__main__.py new file mode 100644 index 0000000..ec49b7e --- /dev/null +++ b/main/cc_suggester/__main__.py @@ -0,0 +1,7 @@ +"""Run the CLI with ``python -m cc_suggester``.""" + +from cc_suggester.cli.app import main + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/main/cc_suggester/audio/__init__.py b/main/cc_suggester/audio/__init__.py new file mode 100644 index 0000000..08392d6 --- /dev/null +++ b/main/cc_suggester/audio/__init__.py @@ -0,0 +1 @@ +"""Audio event detection modules.""" diff --git a/main/cc_suggester/audio/backends/__init__.py b/main/cc_suggester/audio/backends/__init__.py new file mode 100644 index 0000000..7866ed7 --- /dev/null +++ b/main/cc_suggester/audio/backends/__init__.py @@ -0,0 +1,30 @@ +"""Audio backend registry.""" + +from cc_suggester.audio.backends.base import AudioBackend +from cc_suggester.audio.backends.dsp import DspAudioBackend +from cc_suggester.audio.backends.mock import MockAudioBackend +from cc_suggester.audio.backends.unavailable import UnavailableAudioBackend +from cc_suggester.audio.backends.yamnet import YamnetAudioBackend + + +def get_audio_backend(name: str) -> AudioBackend: + """Return an audio backend by name.""" + + normalized = name.lower().strip() + if normalized in {"mock", "demo"}: + return MockAudioBackend() + if normalized in {"dsp", "energy"}: + return DspAudioBackend() + if normalized == "yamnet": + return YamnetAudioBackend() + if normalized == "panns": + return UnavailableAudioBackend("panns", "Install PyTorch PANNs dependencies and add checkpoint loading.") + if normalized == "ast": + return UnavailableAudioBackend("ast", "Install AST dependencies and add an AudioSet checkpoint.") + if normalized == "beats": + return UnavailableAudioBackend("beats", "Install BEATs dependencies and add model checkpoint loading.") + if normalized == "clap": + return UnavailableAudioBackend("clap", "Install CLAP dependencies for open-vocabulary matching.") + raise ValueError( + f"Unknown audio backend '{name}'. Available: mock, dsp, yamnet, panns, ast, beats, clap." + ) diff --git a/main/cc_suggester/audio/backends/base.py b/main/cc_suggester/audio/backends/base.py new file mode 100644 index 0000000..09db990 --- /dev/null +++ b/main/cc_suggester/audio/backends/base.py @@ -0,0 +1,26 @@ +"""Audio backend interface.""" + +from __future__ import annotations + +from abc import ABC, abstractmethod +from pathlib import Path + +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.types import AudioEventCandidate, VideoMetadata + + +class AudioBackend(ABC): + """Interface implemented by sound event detection backends.""" + + name: str + requires_audio_file: bool = False + requires_valid_media: bool = False + + @abstractmethod + def detect( + self, + video_path: Path, + metadata: VideoMetadata, + config: PipelineConfig, + ) -> list[AudioEventCandidate]: + """Detect non-speech audio event candidates.""" diff --git a/main/cc_suggester/audio/backends/dsp.py b/main/cc_suggester/audio/backends/dsp.py new file mode 100644 index 0000000..e1bfa1b --- /dev/null +++ b/main/cc_suggester/audio/backends/dsp.py @@ -0,0 +1,148 @@ +"""CPU-friendly DSP audio backend. + +This backend performs simple energy/onset style detection from a mono WAV file. +It is not a semantic classifier like YAMNet or PANNs, but it is useful as a real +offline baseline and as a candidate-region generator. +""" + +from __future__ import annotations + +import math +import statistics +from dataclasses import dataclass +from pathlib import Path + +from cc_suggester.audio.backends.base import AudioBackend +from cc_suggester.audio.extractor import extract_audio +from cc_suggester.audio.wav import load_wav_mono_pcm +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.types import AudioEventCandidate, VideoMetadata + + +@dataclass(slots=True) +class EnergyWindow: + """RMS summary for a short audio window.""" + + start: float + end: float + rms_norm: float + + +class DspAudioBackend(AudioBackend): + """Detect non-speech candidate regions using RMS energy windows.""" + + name = "dsp" + requires_audio_file = True + requires_valid_media = True + + def detect( + self, + video_path: Path, + metadata: VideoMetadata, + config: PipelineConfig, + ) -> list[AudioEventCandidate]: + audio_path = self._audio_path_for(video_path, config) + windows = _read_energy_windows(audio_path) + if not windows: + return [] + + values = [window.rms_norm for window in windows] + median = statistics.median(values) + peak = max(values) + adaptive_threshold = max(0.015, median * 3.0, peak * 0.32) + + active = [window for window in windows if window.rms_norm >= adaptive_threshold] + groups = _group_windows(active, max_gap=0.35) + events: list[AudioEventCandidate] = [] + for index, group in enumerate(groups, start=1): + start = group[0].start + end = group[-1].end + duration = end - start + peak_norm = max(window.rms_norm for window in group) + confidence = _confidence(peak_norm, adaptive_threshold) + if confidence < config.audio_threshold: + continue + event_id = _event_id_for(duration, peak_norm) + events.append( + AudioEventCandidate( + event_id=event_id, + label=event_id.replace("_", " ").title(), + start_time=round(start, 3), + end_time=round(end, 3), + audio_confidence=confidence, + audio_backend=self.name, + raw_class_name="RMS energy candidate", + debug_info={ + "audio_path": str(audio_path), + "window_index": index, + "rms_peak": round(peak_norm, 6), + "rms_median": round(median, 6), + "adaptive_threshold": round(adaptive_threshold, 6), + "duration": round(duration, 3), + }, + ) + ) + return events + + def _audio_path_for(self, video_path: Path, config: PipelineConfig) -> Path: + if config.sidecar_audio_path is not None: + return Path(config.sidecar_audio_path) + if video_path.suffix.lower() == ".wav": + return video_path + run_dir = config.run_dir or config.output_dir / video_path.stem + return extract_audio(video_path, run_dir / "artifacts" / "audio.wav") + + +def _read_energy_windows( + audio_path: Path, + *, + window_seconds: float = 0.50, + hop_seconds: float = 0.25, +) -> list[EnergyWindow]: + wav = load_wav_mono_pcm(audio_path) + samples = wav.samples + if not samples: + return [] + + window_samples = max(1, int(wav.sample_rate * window_seconds)) + hop_samples = max(1, int(wav.sample_rate * hop_seconds)) + max_amplitude = float(2 ** (8 * wav.sample_width - 1)) + + windows: list[EnergyWindow] = [] + for start_index in range(0, max(0, len(samples) - window_samples + 1), hop_samples): + chunk = samples[start_index : start_index + window_samples] + if len(chunk) < window_samples: + break + start = start_index / wav.sample_rate + end = start + window_seconds + rms = math.sqrt(sum(sample * sample for sample in chunk) / len(chunk)) + windows.append(EnergyWindow(start=start, end=end, rms_norm=rms / max_amplitude)) + return windows + + +def _group_windows(windows: list[EnergyWindow], max_gap: float) -> list[list[EnergyWindow]]: + if not windows: + return [] + groups: list[list[EnergyWindow]] = [[windows[0]]] + for window in windows[1:]: + previous = groups[-1][-1] + if window.start - previous.end <= max_gap: + groups[-1].append(window) + else: + groups.append([window]) + return groups + + +def _confidence(peak_norm: float, threshold: float) -> float: + if threshold <= 0: + return 0.0 + ratio = peak_norm / threshold + return round(max(0.0, min(0.99, 0.35 + ratio * 0.28)), 3) + + +def _event_id_for(duration: float, peak_norm: float) -> str: + if duration <= 0.85 and peak_norm >= 0.08: + return "impact_sound" + if duration >= 3.0: + return "sustained_sound" + return "loud_sound" diff --git a/main/cc_suggester/audio/backends/mock.py b/main/cc_suggester/audio/backends/mock.py new file mode 100644 index 0000000..9a1956c --- /dev/null +++ b/main/cc_suggester/audio/backends/mock.py @@ -0,0 +1,68 @@ +"""Deterministic demo audio backend. + +This backend keeps the first scaffold runnable without large model downloads. It +will be replaced by YAMNet/PANNs/AST/BEATs implementations through the same +interface. +""" + +from __future__ import annotations + +from pathlib import Path + +from cc_suggester.audio.backends.base import AudioBackend +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.types import AudioEventCandidate, VideoMetadata + + +class MockAudioBackend(AudioBackend): + """Return classroom-style non-speech events for pipeline testing.""" + + name = "mock" + + def detect( + self, + video_path: Path, + metadata: VideoMetadata, + config: PipelineConfig, + ) -> list[AudioEventCandidate]: + duration = metadata.duration or 402.0 + anchors = [ + (0.34, "children_cheer", "Children cheering", 0.91), + (0.43, "school_bell", "School bell", 0.86), + (0.54, "applause", "Applause", 0.74), + (0.71, "chair_scrape", "Chair scrape", 0.58), + (0.81, "background_chatter", "Background chatter", 0.52), + ] + events: list[AudioEventCandidate] = [] + for ratio, event_id, label, confidence in anchors: + start = max(0.0, min(duration - 1.0, duration * ratio)) + end = min(duration, start + _duration_for(event_id)) + if confidence < config.audio_threshold: + continue + events.append( + AudioEventCandidate( + event_id=event_id, + label=label, + start_time=round(start, 3), + end_time=round(end, 3), + audio_confidence=confidence, + audio_backend=self.name, + raw_class_name=label, + debug_info={ + "source": "deterministic mock backend", + "input_name": video_path.name, + }, + ) + ) + return events + + +def _duration_for(event_id: str) -> float: + durations = { + "children_cheer": 2.1, + "school_bell": 1.6, + "applause": 3.3, + "chair_scrape": 1.1, + "background_chatter": 7.5, + } + return durations.get(event_id, 1.5) diff --git a/main/cc_suggester/audio/backends/unavailable.py b/main/cc_suggester/audio/backends/unavailable.py new file mode 100644 index 0000000..6c92886 --- /dev/null +++ b/main/cc_suggester/audio/backends/unavailable.py @@ -0,0 +1,37 @@ +"""Registered placeholders for planned advanced audio backends.""" + +from __future__ import annotations + +from pathlib import Path + +from cc_suggester.audio.backends.base import AudioBackend +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.errors import BackendUnavailableError +from cc_suggester.core.types import AudioEventCandidate, VideoMetadata + + +class UnavailableAudioBackend(AudioBackend): + """Backend placeholder that explains how to proceed.""" + + requires_audio_file = True + requires_valid_media = True + + def __init__(self, name: str, install_hint: str) -> None: + self.name = name + self.install_hint = install_hint + + def detect( + self, + video_path: Path, + metadata: VideoMetadata, + config: PipelineConfig, + ) -> list[AudioEventCandidate]: + raise BackendUnavailableError( + message=f"The {self.name} backend is registered but not implemented in this environment yet.", + code=f"{self.name}_not_installed", + suggestions=[ + "Use --audio-backend dsp for a real offline CPU baseline.", + "Use --audio-backend mock for deterministic demos/tests.", + self.install_hint, + ], + ) diff --git a/main/cc_suggester/audio/backends/yamnet.py b/main/cc_suggester/audio/backends/yamnet.py new file mode 100644 index 0000000..78e6df8 --- /dev/null +++ b/main/cc_suggester/audio/backends/yamnet.py @@ -0,0 +1,190 @@ +"""Optional YAMNet sound event detection backend.""" + +from __future__ import annotations + +import csv +import os +from pathlib import Path +from typing import Any, Sequence + +from cc_suggester.audio.backends.base import AudioBackend +from cc_suggester.audio.extractor import extract_audio +from cc_suggester.audio.label_mapping import normalize_sound_label +from cc_suggester.audio.wav import load_wav_mono_float32 +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.errors import BackendUnavailableError +from cc_suggester.core.types import AudioEventCandidate, VideoMetadata + + +DEFAULT_YAMNET_HANDLE = "https://tfhub.dev/google/yamnet/1" +YAMNET_SAMPLE_RATE = 16000 +YAMNET_FRAME_HOP_SECONDS = 0.48 +YAMNET_FRAME_DURATION_SECONDS = 0.96 + + +class YamnetAudioBackend(AudioBackend): + """Classify non-speech events using TensorFlow Hub YAMNet when installed.""" + + name = "yamnet" + requires_audio_file = True + requires_valid_media = True + + def detect( + self, + video_path: Path, + metadata: VideoMetadata, + config: PipelineConfig, + ) -> list[AudioEventCandidate]: + tf, hub, np = _import_dependencies() + audio_path = _audio_path_for(video_path, config) + waveform = load_wav_mono_float32(audio_path, target_sample_rate=YAMNET_SAMPLE_RATE) + if not waveform: + return [] + + model_handle = config.yamnet_model or os.environ.get("CCS_YAMNET_MODEL") or DEFAULT_YAMNET_HANDLE + try: + model = hub.load(model_handle) + except Exception as exc: + raise BackendUnavailableError( + message=f"YAMNet model could not be loaded from: {model_handle}", + code="yamnet_model_load_failed", + suggestions=[ + "Use --audio-backend dsp for an offline CPU baseline.", + "Set CCS_YAMNET_MODEL to a local TensorFlow Hub YAMNet model directory.", + "Ensure internet/model cache access is available if using the default TF Hub handle.", + ], + details={"model_handle": model_handle, "error": str(exc)}, + ) from exc + + waveform_tensor = tf.convert_to_tensor(waveform, dtype=tf.float32) + try: + scores, _embeddings, _spectrogram = model(waveform_tensor) + except Exception as exc: + raise BackendUnavailableError( + message="YAMNet inference failed.", + code="yamnet_inference_failed", + suggestions=[ + "Verify the input audio is mono 16 kHz WAV or extractable video audio.", + "Try --audio-backend dsp to confirm audio extraction works.", + ], + details={"error": str(exc)}, + ) from exc + + class_names = _load_class_names(model, config.yamnet_class_map_path) + scores_array = scores.numpy() if hasattr(scores, "numpy") else np.asarray(scores) + return _events_from_scores( + scores_array=scores_array, + class_names=class_names, + audio_path=audio_path, + config=config, + ) + + +def _audio_path_for(video_path: Path, config: PipelineConfig) -> Path: + if config.sidecar_audio_path is not None: + return Path(config.sidecar_audio_path) + if video_path.suffix.lower() == ".wav": + return video_path + run_dir = config.run_dir or config.output_dir / video_path.stem + return extract_audio(video_path, run_dir / "artifacts" / "audio.wav") + + +def _import_dependencies(): + try: + import numpy as np # type: ignore + import tensorflow as tf # type: ignore + import tensorflow_hub as hub # type: ignore + except Exception as exc: + raise BackendUnavailableError( + message="The YAMNet backend requires TensorFlow, TensorFlow Hub, and NumPy.", + code="yamnet_dependencies_missing", + suggestions=[ + "Install audio dependencies: pip install -r requirements-audio.txt", + "Use --audio-backend dsp for an offline CPU baseline.", + "Use --audio-backend mock for deterministic demos/tests.", + ], + details={"error": str(exc)}, + ) from exc + return tf, hub, np + + +def _load_class_names(model: Any, class_map_path: Path | None) -> list[str]: + path = class_map_path + if path is None and hasattr(model, "class_map_path"): + raw_path = model.class_map_path() + if hasattr(raw_path, "numpy"): + raw_path = raw_path.numpy() + if isinstance(raw_path, bytes): + raw_path = raw_path.decode("utf-8") + path = Path(str(raw_path)) + + if path is None: + return [] + + with Path(path).open("r", newline="", encoding="utf-8") as file_obj: + reader = csv.DictReader(file_obj) + class_names: list[str] = [] + for row in reader: + class_names.append(row.get("display_name") or row.get("name") or row.get("label") or "") + return class_names + + +def _events_from_scores( + *, + scores_array: Sequence[Sequence[float]], + class_names: list[str], + audio_path: Path, + config: PipelineConfig, +) -> list[AudioEventCandidate]: + events: list[AudioEventCandidate] = [] + for frame_index, frame_scores in enumerate(scores_array): + scored_classes = _top_scored_classes(frame_scores, class_names, top_k=config.yamnet_top_k) + event_scores: dict[str, tuple[float, str]] = {} + for class_name, score in scored_classes: + if score < config.audio_threshold: + continue + event_id = normalize_sound_label(class_name) + if event_id is None: + continue + existing = event_scores.get(event_id) + if existing is None or score > existing[0]: + event_scores[event_id] = (score, class_name) + + for event_id, (score, class_name) in event_scores.items(): + start = frame_index * YAMNET_FRAME_HOP_SECONDS + end = start + YAMNET_FRAME_DURATION_SECONDS + events.append( + AudioEventCandidate( + event_id=event_id, + label=event_id.replace("_", " ").title(), + start_time=round(start, 3), + end_time=round(end, 3), + audio_confidence=round(float(score), 3), + audio_backend="yamnet", + raw_class_name=class_name, + debug_info={ + "audio_path": str(audio_path), + "frame_index": frame_index, + "yamnet_frame_hop_seconds": YAMNET_FRAME_HOP_SECONDS, + "yamnet_frame_duration_seconds": YAMNET_FRAME_DURATION_SECONDS, + }, + ) + ) + return events + + +def _top_scored_classes( + frame_scores: Sequence[float], + class_names: list[str], + *, + top_k: int, +) -> list[tuple[str, float]]: + indexed = sorted(enumerate(frame_scores), key=lambda item: float(item[1]), reverse=True) + output: list[tuple[str, float]] = [] + for class_index, score in indexed[:top_k]: + if class_index < len(class_names): + class_name = class_names[class_index] + else: + class_name = f"class_{class_index}" + output.append((class_name, float(score))) + return output diff --git a/main/cc_suggester/audio/dsp.py b/main/cc_suggester/audio/dsp.py new file mode 100644 index 0000000..75e62b7 --- /dev/null +++ b/main/cc_suggester/audio/dsp.py @@ -0,0 +1,27 @@ +"""Placeholder DSP feature definitions for the first scaffold.""" + +from __future__ import annotations + +from dataclasses import dataclass + + +@dataclass(slots=True) +class DspFeatureSummary: + """Small explainability summary for future DSP extraction.""" + + rms_energy: float + spectral_flux: float + onset_strength: float + + +def describe_planned_features() -> list[str]: + """Return the DSP features planned for the first real audio backend.""" + + return [ + "RMS energy", + "short-time Fourier transform", + "log-mel spectrogram", + "spectral flux", + "onset strength", + "zero-crossing rate", + ] diff --git a/main/cc_suggester/audio/events.py b/main/cc_suggester/audio/events.py new file mode 100644 index 0000000..0e68134 --- /dev/null +++ b/main/cc_suggester/audio/events.py @@ -0,0 +1,33 @@ +"""Post-processing helpers for audio events.""" + +from __future__ import annotations + +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.types import AudioEventCandidate + + +def smooth_events( + events: list[AudioEventCandidate], + config: PipelineConfig, +) -> list[AudioEventCandidate]: + """Merge adjacent same-label events and remove very short events.""" + + filtered = [ + event + for event in sorted(events, key=lambda item: (item.start_time, item.end_time)) + if event.end_time - event.start_time >= config.min_event_duration + ] + if not filtered: + return [] + + merged: list[AudioEventCandidate] = [filtered[0]] + for event in filtered[1:]: + previous = merged[-1] + gap = event.start_time - previous.end_time + if event.event_id == previous.event_id and gap <= config.merge_gap: + previous.end_time = max(previous.end_time, event.end_time) + previous.audio_confidence = max(previous.audio_confidence, event.audio_confidence) + previous.debug_info["merged"] = True + else: + merged.append(event) + return merged diff --git a/main/cc_suggester/audio/extractor.py b/main/cc_suggester/audio/extractor.py new file mode 100644 index 0000000..7455428 --- /dev/null +++ b/main/cc_suggester/audio/extractor.py @@ -0,0 +1,53 @@ +"""Audio extraction helpers for real model backends.""" + +from __future__ import annotations + +import shutil +import subprocess +from pathlib import Path + +from cc_suggester.core.errors import AudioExtractionError, BackendUnavailableError + + +def extract_audio(video_path: Path, output_path: Path, sample_rate: int = 16000) -> Path: + """Extract mono WAV audio using ffmpeg.""" + + ffmpeg = shutil.which("ffmpeg") + if ffmpeg is None: + raise BackendUnavailableError( + message="ffmpeg was not found, so audio extraction cannot run.", + code="ffmpeg_missing", + suggestions=[ + "Install ffmpeg and ensure it is on PATH.", + "Run ccs doctor to verify the environment.", + ], + ) + output_path.parent.mkdir(parents=True, exist_ok=True) + command = [ + ffmpeg, + "-y", + "-i", + str(video_path), + "-ac", + "1", + "-ar", + str(sample_rate), + str(output_path), + ] + try: + subprocess.run(command, capture_output=True, check=True, text=True) + except subprocess.CalledProcessError as exc: + raise AudioExtractionError( + message="ffmpeg failed while extracting audio.", + code="audio_extraction_failed", + suggestions=[ + "Run ccs inspect on the input file.", + "Try a different video container or re-encode the video.", + "Run ccs doctor to verify ffmpeg availability.", + ], + details={ + "stderr": exc.stderr[-1000:] if exc.stderr else "", + "returncode": exc.returncode, + }, + ) from exc + return output_path diff --git a/main/cc_suggester/audio/label_mapping.py b/main/cc_suggester/audio/label_mapping.py new file mode 100644 index 0000000..12f28f5 --- /dev/null +++ b/main/cc_suggester/audio/label_mapping.py @@ -0,0 +1,45 @@ +"""Normalize model-specific sound labels into project event IDs.""" + +from __future__ import annotations + + +LABEL_RULES: tuple[tuple[str, tuple[str, ...]], ...] = ( + ("horn_honk", ("vehicle horn", "car horn", "honking", "horn")), + ("glass_break", ("glass", "shatter", "breaking")), + ("crowd_cheer", ("cheering", "cheer", "crowd cheering")), + ("applause", ("applause", "clapping")), + ("laughter", ("laughter", "laughing", "giggle")), + ("music", ("music", "song", "singing", "musical")), + ("alarm", ("alarm", "beep", "buzzer")), + ("siren", ("siren", "police car", "ambulance")), + ("explosion", ("explosion", "blast", "boom")), + ("gunshot", ("gunshot", "gunfire", "shooting")), + ("door_slam", ("door", "slam", "knock")), + ("phone_ring", ("telephone", "ringtone", "ringing", "phone")), + ("dog_bark", ("bark", "dog")), +) + + +def normalize_sound_label(label: str) -> str | None: + """Map an AudioSet/YAMNet label to an internal event ID.""" + + normalized = label.lower().replace("_", " ") + for event_id, needles in LABEL_RULES: + required = _required_tokens(event_id) + if required and all(_matches(normalized, token) for token in required): + return event_id + if any(needle in normalized for needle in needles): + return event_id + return None + + +def _matches(label: str, token: str) -> bool: + return token in label + + +def _required_tokens(event_id: str) -> tuple[str, ...]: + if event_id == "glass_break": + return ("glass",) + if event_id == "door_slam": + return ("door",) + return () diff --git a/main/cc_suggester/audio/vad.py b/main/cc_suggester/audio/vad.py new file mode 100644 index 0000000..6cff096 --- /dev/null +++ b/main/cc_suggester/audio/vad.py @@ -0,0 +1,9 @@ +"""Voice activity masking placeholder.""" + +from __future__ import annotations + + +def is_speech_masking_available() -> bool: + """Return whether a real VAD backend has been configured.""" + + return False diff --git a/main/cc_suggester/audio/wav.py b/main/cc_suggester/audio/wav.py new file mode 100644 index 0000000..95abd5d --- /dev/null +++ b/main/cc_suggester/audio/wav.py @@ -0,0 +1,89 @@ +"""Small WAV loading helpers shared by audio backends.""" + +from __future__ import annotations + +import struct +import wave +from dataclasses import dataclass +from pathlib import Path + + +@dataclass(slots=True) +class WavPcm: + """Decoded mono PCM WAV samples.""" + + sample_rate: int + sample_width: int + samples: list[int] + + +def load_wav_mono_pcm(path: Path) -> WavPcm: + """Load a WAV file as mono integer PCM samples.""" + + with wave.open(str(path), "rb") as wav: + sample_rate = wav.getframerate() + sample_width = wav.getsampwidth() + channels = wav.getnchannels() + frames = wav.readframes(wav.getnframes()) + samples = _decode_pcm(frames, sample_width, channels) + return WavPcm(sample_rate=sample_rate, sample_width=sample_width, samples=samples) + + +def load_wav_mono_float32(path: Path, target_sample_rate: int = 16000) -> list[float]: + """Load WAV samples normalized to [-1, 1] and resampled if required.""" + + wav = load_wav_mono_pcm(path) + max_amplitude = float(2 ** (8 * wav.sample_width - 1)) + floats = [max(-1.0, min(1.0, sample / max_amplitude)) for sample in wav.samples] + if wav.sample_rate != target_sample_rate: + floats = _resample_linear(floats, wav.sample_rate, target_sample_rate) + return floats + + +def _decode_pcm(frames: bytes, sample_width: int, channels: int) -> list[int]: + if sample_width == 1: + values = [byte - 128 for byte in frames] + elif sample_width == 2: + count = len(frames) // 2 + values = list(struct.unpack(f"<{count}h", frames[: count * 2])) + elif sample_width == 4: + count = len(frames) // 4 + values = list(struct.unpack(f"<{count}i", frames[: count * 4])) + elif sample_width == 3: + values = [_decode_24bit(frames[index : index + 3]) for index in range(0, len(frames) - 2, 3)] + else: + return [] + + if channels <= 1: + return values + + mono: list[int] = [] + for index in range(0, len(values) - channels + 1, channels): + mono.append(int(sum(values[index : index + channels]) / channels)) + return mono + + +def _decode_24bit(chunk: bytes) -> int: + padded = chunk + (b"\xff" if chunk[2] & 0x80 else b"\x00") + return struct.unpack(" list[float]: + if not samples or source_rate <= 0 or target_rate <= 0: + return samples + if source_rate == target_rate: + return samples + + target_len = max(1, int(len(samples) * target_rate / source_rate)) + if target_len == 1: + return [samples[0]] + + scale = (len(samples) - 1) / (target_len - 1) + output: list[float] = [] + for index in range(target_len): + source_pos = index * scale + left = int(source_pos) + right = min(left + 1, len(samples) - 1) + fraction = source_pos - left + output.append(samples[left] * (1.0 - fraction) + samples[right] * fraction) + return output diff --git a/main/cc_suggester/cli/__init__.py b/main/cc_suggester/cli/__init__.py new file mode 100644 index 0000000..429eb26 --- /dev/null +++ b/main/cc_suggester/cli/__init__.py @@ -0,0 +1 @@ +"""Command-line interface.""" diff --git a/main/cc_suggester/cli/app.py b/main/cc_suggester/cli/app.py new file mode 100644 index 0000000..044280a --- /dev/null +++ b/main/cc_suggester/cli/app.py @@ -0,0 +1,313 @@ +"""CLI entrypoint for the Intelligent CC Suggestion Tool.""" + +from __future__ import annotations + +import argparse +import difflib +import sys +from pathlib import Path +from typing import Sequence + +from cc_suggester import __version__ +from cc_suggester.core.config import ( + SUPPORTED_DEVICES, + SUPPORTED_LANGUAGES, + PipelineConfig, + load_config, + merge_config, +) +from cc_suggester.core.diagnostics import run_diagnostics +from cc_suggester.core.errors import CCSuggesterError +from cc_suggester.core.media import inspect_video +from cc_suggester.core.pipeline import analyze_video, detect_audio_events, export_from_report, score_visual_reactions +from cc_suggester.translation.glossary import supported_event_ids + + +COMMANDS = ("analyze", "audio", "vision", "inspect", "doctor", "export", "labels", "web") + + +class FriendlyParser(argparse.ArgumentParser): + """ArgumentParser that raises instead of exiting mid-flow.""" + + def error(self, message: str) -> None: + raise ValueError(message) + + +def main(argv: Sequence[str] | None = None) -> int: + """Run the CLI.""" + + args = list(sys.argv[1:] if argv is None else argv) + if args and not args[0].startswith("-") and args[0] not in COMMANDS: + return _unknown_command(args[0]) + + parser = _build_parser() + try: + namespace = parser.parse_args(args) + if not hasattr(namespace, "handler"): + parser.print_help() + return 0 + return int(namespace.handler(namespace)) + except CCSuggesterError as exc: + _print_friendly_error(exc) + return 2 + except ValueError as exc: + print(f"Command error: {exc}", file=sys.stderr) + print("\nTry:", file=sys.stderr) + print(" ccs --help", file=sys.stderr) + return 2 + + +def _build_parser() -> FriendlyParser: + parser = FriendlyParser( + prog="ccs", + description="Generate meaningful non-speech closed caption suggestions from video.", + ) + parser.add_argument("--version", action="version", version=f"ccs {__version__}") + subparsers = parser.add_subparsers(dest="command", parser_class=FriendlyParser) + + analyze = subparsers.add_parser("analyze", help="Run the full CC suggestion pipeline.") + analyze.add_argument("input", type=Path, help="Input video path.") + _add_pipeline_args(analyze) + analyze.set_defaults(handler=_handle_analyze) + + audio = subparsers.add_parser("audio", help="Run only audio event detection.") + audio.add_argument("input", type=Path, help="Input video or WAV path.") + audio.add_argument("--config", type=Path, default=None, help="JSON config file.") + audio.add_argument("--device", default=None, choices=SUPPORTED_DEVICES, help="Device mode.") + audio.add_argument("--audio-backend", default=None, help="Audio backend name.") + audio.add_argument("--out", default=None, type=Path, help="Output root directory.") + audio.add_argument("--audio-threshold", default=None, type=float, help="Audio event threshold.") + audio.add_argument("--audio-path", default=None, type=Path, help="Optional sidecar WAV audio path.") + audio.add_argument("--yamnet-model", default=None, help="YAMNet TF Hub handle or local model directory.") + audio.add_argument("--yamnet-class-map", default=None, type=Path, help="YAMNet class map CSV path.") + audio.add_argument("--yamnet-top-k", default=None, type=int, help="Top YAMNet classes to inspect per frame.") + audio.add_argument("--allow-demo-input", action="store_true", help="Allow non-video demo files.") + audio.set_defaults(handler=_handle_audio) + + vision = subparsers.add_parser("vision", help="Run visual reaction scoring from audio event JSON.") + vision.add_argument("input", type=Path, help="Input video path.") + vision.add_argument("audio_report", type=Path, help="audio_events.json or results.json path.") + vision.add_argument("--config", type=Path, default=None, help="JSON config file.") + vision.add_argument("--device", default=None, choices=SUPPORTED_DEVICES, help="Device mode.") + vision.add_argument("--vision-backend", default=None, help="Vision backend name.") + vision.add_argument("--out", default=None, type=Path, help="Output root directory.") + vision.add_argument("--allow-demo-input", action="store_true", help="Allow probe fallback for demo media.") + vision.set_defaults(handler=_handle_vision) + + inspect = subparsers.add_parser("inspect", help="Inspect video metadata.") + inspect.add_argument("input", type=Path, help="Input video path.") + inspect.set_defaults(handler=_handle_inspect) + + doctor = subparsers.add_parser("doctor", help="Check ffmpeg, Python, and CPU/GPU status.") + doctor.add_argument("--device", default="auto", choices=SUPPORTED_DEVICES, help="Device mode to validate.") + doctor.set_defaults(handler=_handle_doctor) + + export = subparsers.add_parser("export", help="Export SRT from a JSON result report.") + export.add_argument("report", type=Path, help="Pipeline results.json file.") + export.add_argument("--format", default="srt", choices=("srt",), help="Export format.") + export.add_argument("--lang", default="en", choices=SUPPORTED_LANGUAGES, help="Caption label language.") + export.add_argument("--out", type=Path, default=None, help="Output SRT path.") + export.set_defaults(handler=_handle_export) + + labels = subparsers.add_parser("labels", help="List supported languages and event label IDs.") + labels.set_defaults(handler=_handle_labels) + + web = subparsers.add_parser("web", help="Show how to launch the planned Web UI.") + web.set_defaults(handler=_handle_web) + return parser + + +def _handle_analyze(args: argparse.Namespace) -> int: + config = _config_from_args(args) + result = analyze_video(args.input, config) + accepted = sum(1 for item in result.suggestions if item.accepted) + review = sum(1 for item in result.suggestions if item.requires_review) + rejected = len(result.suggestions) - accepted - review + + print("Analysis complete.") + print(f"Input: {result.input_path}") + print(f"Output directory: {result.output_dir}") + print(f"Device used: {result.diagnostics.actual_device}") + print(f"Events: {len(result.audio_events)} detected, {accepted} accepted, {review} review, {rejected} rejected") + for name, path in result.files.items(): + print(f"{name}: {path}") + return 0 + + +def _handle_audio(args: argparse.Namespace) -> int: + base = load_config(args.config) if args.config else PipelineConfig() + config = merge_config( + base, + device=args.device, + audio_backend=args.audio_backend, + output_dir=args.out, + audio_threshold=args.audio_threshold, + sidecar_audio_path=args.audio_path, + yamnet_model=args.yamnet_model, + yamnet_class_map_path=args.yamnet_class_map, + yamnet_top_k=args.yamnet_top_k, + allow_demo_input=args.allow_demo_input or None, + ) + payload = detect_audio_events(args.input, config) + events = payload["audio_events"] + files = payload.get("files", {}) + print("Audio detection complete.") + print(f"Input: {payload['input_path']}") + print(f"Events: {len(events)}") + if isinstance(files, dict): + for name, path in files.items(): + print(f"{name}: {path}") + return 0 + + +def _handle_vision(args: argparse.Namespace) -> int: + base = load_config(args.config) if args.config else PipelineConfig() + config = merge_config( + base, + device=args.device, + vision_backend=args.vision_backend, + output_dir=args.out, + allow_demo_input=args.allow_demo_input or None, + ) + payload = score_visual_reactions(args.input, args.audio_report, config) + reactions = payload["reactions"] + files = payload.get("files", {}) + print("Visual reaction scoring complete.") + print(f"Input: {payload['input_path']}") + print(f"Audio report: {payload['audio_report_path']}") + print(f"Reactions: {len(reactions)}") + if isinstance(files, dict): + for name, path in files.items(): + print(f"{name}: {path}") + return 0 + + +def _handle_inspect(args: argparse.Namespace) -> int: + metadata = inspect_video(args.input) + print(f"Path: {metadata.path}") + print(f"Exists: {metadata.exists}") + print(f"Size: {metadata.size_bytes}") + print(f"Container: {metadata.container}") + print(f"Duration: {metadata.duration}") + print(f"FPS: {metadata.fps}") + print(f"Resolution: {_format_resolution(metadata.width, metadata.height)}") + print(f"Has audio: {metadata.has_audio}") + if metadata.probe_error: + print(f"Probe warning: {metadata.probe_error}") + return 0 + + +def _handle_doctor(args: argparse.Namespace) -> int: + config = PipelineConfig(device=args.device) + diagnostics = run_diagnostics(config) + print("Environment diagnostics") + print(f"Python: {diagnostics.python_version}") + print(f"ffmpeg: {diagnostics.ffmpeg_path or 'not found'}") + print(f"ffprobe: {diagnostics.ffprobe_path or 'not found'}") + print(f"Torch available: {diagnostics.torch_available}") + print(f"CUDA available: {diagnostics.cuda_available}") + print(f"Selected device: {diagnostics.selected_device}") + print(f"Actual device: {diagnostics.actual_device}") + print(f"GPU: {diagnostics.gpu_name or 'none'}") + if diagnostics.fallback_reason: + print(f"Fallback: {diagnostics.fallback_reason}") + for warning in diagnostics.warnings: + print(f"Warning: {warning}") + return 0 + + +def _handle_export(args: argparse.Namespace) -> int: + output_path = args.out or args.report.with_name(f"captions.{args.lang}.srt") + written = export_from_report(args.report, output_path, args.lang) + print(f"Exported {args.format.upper()}: {written}") + return 0 + + +def _handle_labels(args: argparse.Namespace) -> int: + print("Supported languages:") + print(" " + ", ".join(SUPPORTED_LANGUAGES)) + print("Supported event IDs:") + for event_id in supported_event_ids(): + print(f" {event_id}") + return 0 + + +def _handle_web(args: argparse.Namespace) -> int: + app_path = Path(__file__).resolve().parents[1] / "ui" / "streamlit_app.py" + mockup_path = Path(__file__).resolve().parents[3] / "mockups" / "web-ui.html" + print("The planned Web UI will use the same core pipeline modules as the CLI.") + print("Run:") + print(f" streamlit run {app_path}") + print("\nInteractive HTML mockup:") + print(f" {mockup_path}") + return 0 + + +def _unknown_command(command: str) -> int: + suggestion = difflib.get_close_matches(command, COMMANDS, n=1) + print(f"No such command: {command}", file=sys.stderr) + if suggestion: + print(f"Did you mean: {suggestion[0]}?", file=sys.stderr) + print("\nTry:", file=sys.stderr) + print(" ccs analyze input.mp4 --device auto --lang hi", file=sys.stderr) + print(" ccs doctor", file=sys.stderr) + return 2 + + +def _print_friendly_error(error: CCSuggesterError) -> None: + print(error.message, file=sys.stderr) + if error.suggestions: + print("\nSuggestions:", file=sys.stderr) + for index, suggestion in enumerate(error.suggestions, start=1): + print(f"{index}. {suggestion}", file=sys.stderr) + if error.details: + print("\nDetails:", file=sys.stderr) + for key, value in error.details.items(): + print(f"- {key}: {value}", file=sys.stderr) + + +def _format_resolution(width: int | None, height: int | None) -> str: + if width is None or height is None: + return "unknown" + return f"{width} x {height}" + + +def _add_pipeline_args(parser: argparse.ArgumentParser) -> None: + parser.add_argument("--config", type=Path, default=None, help="JSON config file.") + parser.add_argument("--lang", default=None, choices=SUPPORTED_LANGUAGES, help="Caption label language.") + parser.add_argument("--device", default=None, choices=SUPPORTED_DEVICES, help="Device mode.") + parser.add_argument("--audio-backend", default=None, help="Audio backend name.") + parser.add_argument("--vision-backend", default=None, help="Vision backend name.") + parser.add_argument("--out", default=None, type=Path, help="Output root directory.") + parser.add_argument("--audio-threshold", default=None, type=float, help="Audio event threshold.") + parser.add_argument("--audio-path", default=None, type=Path, help="Optional sidecar WAV audio path.") + parser.add_argument("--yamnet-model", default=None, help="YAMNet TF Hub handle or local model directory.") + parser.add_argument("--yamnet-class-map", default=None, type=Path, help="YAMNet class map CSV path.") + parser.add_argument("--yamnet-top-k", default=None, type=int, help="Top YAMNet classes to inspect per frame.") + parser.add_argument("--decision-threshold", default=None, type=float, help="Accept threshold.") + parser.add_argument("--review-threshold", default=None, type=float, help="Review threshold.") + parser.add_argument("--allow-demo-input", action="store_true", help="Allow non-video demo files.") + + +def _config_from_args(args: argparse.Namespace) -> PipelineConfig: + base = load_config(args.config) if args.config else PipelineConfig() + return merge_config( + base, + language=args.lang, + device=args.device, + audio_backend=args.audio_backend, + vision_backend=args.vision_backend, + output_dir=args.out, + audio_threshold=args.audio_threshold, + sidecar_audio_path=args.audio_path, + yamnet_model=args.yamnet_model, + yamnet_class_map_path=args.yamnet_class_map, + yamnet_top_k=args.yamnet_top_k, + decision_threshold=args.decision_threshold, + review_threshold=args.review_threshold, + allow_demo_input=args.allow_demo_input or None, + ) + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/main/cc_suggester/core/__init__.py b/main/cc_suggester/core/__init__.py new file mode 100644 index 0000000..ede8db5 --- /dev/null +++ b/main/cc_suggester/core/__init__.py @@ -0,0 +1 @@ +"""Core pipeline contracts and orchestration.""" diff --git a/main/cc_suggester/core/config.py b/main/cc_suggester/core/config.py new file mode 100644 index 0000000..2a908fd --- /dev/null +++ b/main/cc_suggester/core/config.py @@ -0,0 +1,80 @@ +"""Configuration for pipeline runs.""" + +from __future__ import annotations + +from dataclasses import asdict, dataclass +import json +from pathlib import Path +from typing import Any + + +SUPPORTED_LANGUAGES = ("en", "hi", "ta", "te", "bn", "mr", "ml") +SUPPORTED_DEVICES = ("auto", "cpu", "cuda") + + +@dataclass(slots=True) +class PipelineConfig: + """Runtime configuration shared by CLI, UI, and future integrations.""" + + language: str = "en" + device: str = "auto" + audio_backend: str = "mock" + vision_backend: str = "mock" + output_dir: Path = Path("outputs") + sidecar_audio_path: Path | None = None + yamnet_model: str | None = None + yamnet_class_map_path: Path | None = None + yamnet_top_k: int = 5 + audio_threshold: float = 0.45 + reaction_threshold: float = 0.35 + decision_threshold: float = 0.65 + review_threshold: float = 0.50 + min_event_duration: float = 0.25 + merge_gap: float = 0.40 + sample_window_before: float = 1.0 + sample_window_after: float = 1.0 + write_rejected_to_reports: bool = True + allow_demo_input: bool = False + run_dir: Path | None = None + + def __post_init__(self) -> None: + if self.language not in SUPPORTED_LANGUAGES: + supported = ", ".join(SUPPORTED_LANGUAGES) + raise ValueError(f"Unsupported language '{self.language}'. Supported: {supported}") + if self.device not in SUPPORTED_DEVICES: + supported = ", ".join(SUPPORTED_DEVICES) + raise ValueError(f"Unsupported device '{self.device}'. Supported: {supported}") + self.output_dir = Path(self.output_dir) + if self.run_dir is not None: + self.run_dir = Path(self.run_dir) + if self.sidecar_audio_path is not None: + self.sidecar_audio_path = Path(self.sidecar_audio_path) + if self.yamnet_class_map_path is not None: + self.yamnet_class_map_path = Path(self.yamnet_class_map_path) + + def to_dict(self) -> dict[str, Any]: + data = asdict(self) + data["output_dir"] = str(self.output_dir) + data["run_dir"] = str(self.run_dir) if self.run_dir else None + data["sidecar_audio_path"] = str(self.sidecar_audio_path) if self.sidecar_audio_path else None + data["yamnet_class_map_path"] = str(self.yamnet_class_map_path) if self.yamnet_class_map_path else None + return data + + +def load_config(path: Path) -> PipelineConfig: + """Load a JSON config file.""" + + path = Path(path) + payload = json.loads(path.read_text(encoding="utf-8")) + return PipelineConfig(**payload) + + +def merge_config(base: PipelineConfig, **overrides: Any) -> PipelineConfig: + """Return a config with non-None overrides applied.""" + + data = base.to_dict() + data.pop("run_dir", None) + for key, value in overrides.items(): + if value is not None: + data[key] = value + return PipelineConfig(**data) diff --git a/main/cc_suggester/core/diagnostics.py b/main/cc_suggester/core/diagnostics.py new file mode 100644 index 0000000..4cd6f1d --- /dev/null +++ b/main/cc_suggester/core/diagnostics.py @@ -0,0 +1,80 @@ +"""Environment and device diagnostics.""" + +from __future__ import annotations + +import platform +import shutil +from typing import Any + +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.errors import DeviceUnavailableError +from cc_suggester.core.types import DiagnosticsReport + + +def _torch_status() -> tuple[bool, bool, str | None]: + try: + import torch # type: ignore + except Exception: + return False, False, None + + cuda_available = bool(torch.cuda.is_available()) + gpu_name = None + if cuda_available: + try: + gpu_name = str(torch.cuda.get_device_name(0)) + except Exception: + gpu_name = "CUDA device" + return True, cuda_available, gpu_name + + +def run_diagnostics(config: PipelineConfig) -> DiagnosticsReport: + """Collect environment details and resolve the actual processing device.""" + + torch_available, cuda_available, gpu_name = _torch_status() + ffmpeg_path = shutil.which("ffmpeg") + ffprobe_path = shutil.which("ffprobe") + warnings: list[str] = [] + + if ffmpeg_path is None: + warnings.append("ffmpeg was not found; real video/audio extraction will fail.") + if ffprobe_path is None: + warnings.append("ffprobe was not found; metadata inspection will be limited.") + + actual_device = "cpu" + fallback_reason = None + if config.device == "cuda": + if not cuda_available: + details: dict[str, Any] = { + "torch_available": torch_available, + "cuda_available": cuda_available, + "gpu_name": gpu_name, + "ffmpeg_path": ffmpeg_path, + } + raise DeviceUnavailableError( + message="CUDA was requested, but no usable GPU was detected.", + code="cuda_unavailable", + suggestions=[ + "Retry with --device cpu.", + "Run ccs doctor to inspect the environment.", + "Install a CUDA-compatible PyTorch build if GPU acceleration is required.", + ], + details=details, + ) + actual_device = "cuda" + elif config.device == "auto" and cuda_available: + actual_device = "cuda" + elif config.device == "auto": + fallback_reason = "CUDA was not detected; using CPU." + + return DiagnosticsReport( + python_version=platform.python_version(), + ffmpeg_path=ffmpeg_path, + ffprobe_path=ffprobe_path, + selected_device=config.device, + actual_device=actual_device, + cuda_available=cuda_available, + gpu_name=gpu_name, + torch_available=torch_available, + fallback_reason=fallback_reason, + warnings=warnings, + ) diff --git a/main/cc_suggester/core/errors.py b/main/cc_suggester/core/errors.py new file mode 100644 index 0000000..c0fb55e --- /dev/null +++ b/main/cc_suggester/core/errors.py @@ -0,0 +1,38 @@ +"""Friendly error types surfaced by CLI and UI clients.""" + +from __future__ import annotations + +from dataclasses import dataclass, field + + +@dataclass(slots=True) +class CCSuggesterError(Exception): + """Base exception with user-facing suggestions.""" + + message: str + code: str = "ccs_error" + suggestions: list[str] = field(default_factory=list) + details: dict[str, object] = field(default_factory=dict) + + def __str__(self) -> str: + return self.message + + +class InputNotFoundError(CCSuggesterError): + """Raised when the requested input video does not exist.""" + + +class InvalidMediaError(CCSuggesterError): + """Raised when a file cannot be processed as required media.""" + + +class AudioExtractionError(CCSuggesterError): + """Raised when ffmpeg audio extraction fails.""" + + +class DeviceUnavailableError(CCSuggesterError): + """Raised when a required device, such as CUDA, is unavailable.""" + + +class BackendUnavailableError(CCSuggesterError): + """Raised when a requested model backend is not installed or registered.""" diff --git a/main/cc_suggester/core/media.py b/main/cc_suggester/core/media.py new file mode 100644 index 0000000..64d3d83 --- /dev/null +++ b/main/cc_suggester/core/media.py @@ -0,0 +1,179 @@ +"""Video metadata inspection utilities.""" + +from __future__ import annotations + +import json +import shutil +import subprocess +from pathlib import Path + +from cc_suggester.core.errors import InvalidMediaError +from cc_suggester.core.types import VideoMetadata + + +VIDEO_EXTENSIONS = {".mp4", ".mkv", ".mov", ".avi", ".webm", ".m4v"} +AUDIO_EXTENSIONS = {".wav", ".mp3", ".flac", ".aac", ".m4a", ".ogg"} + + +def inspect_video(path: Path) -> VideoMetadata: + """Inspect a video using ffprobe when available, with a safe fallback.""" + + path = Path(path) + exists = path.exists() + metadata = VideoMetadata( + path=path, + exists=exists, + size_bytes=path.stat().st_size if exists else None, + container=path.suffix.lstrip(".").lower() or None, + ) + if not exists: + return metadata + + ffprobe = shutil.which("ffprobe") + if ffprobe is None: + metadata.probe_error = "ffprobe not found" + _inspect_with_opencv(metadata) + return metadata + + command = [ + ffprobe, + "-v", + "error", + "-print_format", + "json", + "-show_format", + "-show_streams", + str(path), + ] + try: + completed = subprocess.run(command, capture_output=True, check=True, text=True) + payload = json.loads(completed.stdout or "{}") + except Exception as exc: + metadata.probe_error = str(exc) + return metadata + + fmt = payload.get("format", {}) + try: + metadata.duration = float(fmt["duration"]) if "duration" in fmt else None + except (TypeError, ValueError): + metadata.duration = None + + has_video = False + has_audio = False + for stream in payload.get("streams", []): + if stream.get("codec_type") == "audio": + has_audio = True + metadata.audio_codec = stream.get("codec_name") + sample_rate = stream.get("sample_rate") + try: + metadata.audio_sample_rate = int(sample_rate) if sample_rate else None + except (TypeError, ValueError): + metadata.audio_sample_rate = None + metadata.audio_channels = stream.get("channels") + if stream.get("codec_type") == "video": + has_video = True + metadata.video_codec = stream.get("codec_name") + metadata.width = stream.get("width") + metadata.height = stream.get("height") + rate = stream.get("avg_frame_rate") or stream.get("r_frame_rate") + metadata.fps = _parse_fraction(rate) + metadata.has_audio = has_audio + metadata.has_video = has_video + return metadata + + +def _inspect_with_opencv(metadata: VideoMetadata) -> None: + if metadata.path.suffix.lower() not in VIDEO_EXTENSIONS: + return + try: + import cv2 # type: ignore + except Exception: + return + + capture = cv2.VideoCapture(str(metadata.path)) + if not capture.isOpened(): + return + try: + metadata.has_video = True + metadata.width = int(capture.get(cv2.CAP_PROP_FRAME_WIDTH)) or None + metadata.height = int(capture.get(cv2.CAP_PROP_FRAME_HEIGHT)) or None + fps = float(capture.get(cv2.CAP_PROP_FPS) or 0) + frame_count = float(capture.get(cv2.CAP_PROP_FRAME_COUNT) or 0) + metadata.fps = fps or None + metadata.duration = frame_count / fps if fps > 0 and frame_count > 0 else None + metadata.video_codec = "opencv-readable" + finally: + capture.release() + + +def validate_media( + metadata: VideoMetadata, + *, + require_video: bool = True, + require_audio: bool = True, + allow_probe_failure: bool = False, +) -> None: + """Validate input media for real processing backends.""" + + if not metadata.exists: + raise InvalidMediaError( + message=f"Input file was not found: {metadata.path}", + code="input_not_found", + suggestions=["Check the path and run ccs inspect /path/to/video.mp4."], + ) + + suffix = metadata.path.suffix.lower() + if require_video and suffix not in VIDEO_EXTENSIONS: + raise InvalidMediaError( + message=f"Input does not look like a supported video file: {metadata.path}", + code="unsupported_video_type", + suggestions=[ + "Use MP4, MKV, MOV, AVI, WEBM, or M4V video input.", + "For demo-only testing, run with --allow-demo-input and mock backends.", + ], + details={"suffix": suffix or "none"}, + ) + + if metadata.probe_error and not allow_probe_failure and metadata.has_video is not True: + raise InvalidMediaError( + message="Video metadata could not be probed.", + code="probe_failed", + suggestions=[ + "Install ffprobe/ffmpeg and ensure they are on PATH.", + "Run ccs doctor to inspect the environment.", + ], + details={"probe_error": metadata.probe_error}, + ) + + if require_video and metadata.has_video is False: + raise InvalidMediaError( + message="No video stream was found in the input file.", + code="missing_video_stream", + suggestions=["Use a video file that contains a valid video stream."], + ) + + if require_audio and metadata.has_audio is False: + raise InvalidMediaError( + message="No audio stream was found in the input file.", + code="missing_audio_stream", + suggestions=[ + "Use a video file that contains audio.", + "Run ccs inspect /path/to/video.mp4 to confirm stream details.", + ], + ) + + +def _parse_fraction(value: str | None) -> float | None: + if not value: + return None + try: + numerator, denominator = value.split("/", maxsplit=1) + denominator_float = float(denominator) + if denominator_float == 0: + return None + return float(numerator) / denominator_float + except Exception: + try: + return float(value) + except Exception: + return None diff --git a/main/cc_suggester/core/pipeline.py b/main/cc_suggester/core/pipeline.py new file mode 100644 index 0000000..748483f --- /dev/null +++ b/main/cc_suggester/core/pipeline.py @@ -0,0 +1,294 @@ +"""Pipeline orchestration.""" + +from __future__ import annotations + +import json +from pathlib import Path + +from cc_suggester.audio.backends import get_audio_backend +from cc_suggester.audio.events import smooth_events +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.diagnostics import run_diagnostics +from cc_suggester.core.errors import BackendUnavailableError, InputNotFoundError +from cc_suggester.core.media import AUDIO_EXTENSIONS, inspect_video, validate_media +from cc_suggester.core.types import AudioEventCandidate, PipelineResult +from cc_suggester.decision.scorer import decide_captions +from cc_suggester.output.csv_report import write_csv_report +from cc_suggester.output.json_report import write_json_report +from cc_suggester.output.srt import write_srt +from cc_suggester.vision.backends import get_vision_backend + + +def analyze_video(video_path: Path, config: PipelineConfig) -> PipelineResult: + """Run the full caption suggestion pipeline.""" + + video_path = Path(video_path) + if not video_path.exists(): + raise InputNotFoundError( + message=f"Input file was not found: {video_path}", + code="input_not_found", + suggestions=[ + "Check the path and filename.", + "Run ccs inspect /path/to/video.mp4 to validate a video file.", + ], + details={"input_path": str(video_path)}, + ) + + metadata = inspect_video(video_path) + diagnostics = run_diagnostics(config) + run_dir = _run_dir(config.output_dir, video_path) + config.run_dir = run_dir + + try: + audio_backend = get_audio_backend(config.audio_backend) + vision_backend = get_vision_backend(config.vision_backend) + except ValueError as exc: + raise BackendUnavailableError( + message=str(exc), + code="backend_unavailable", + suggestions=[ + "Use --audio-backend mock and --vision-backend mock for the first scaffold.", + "Install the optional backend dependencies before selecting advanced backends.", + ], + ) from exc + + _validate_sidecar_audio(config) + if audio_backend.requires_valid_media or vision_backend.requires_valid_media: + is_audio_only_input = video_path.suffix.lower() in AUDIO_EXTENSIONS + require_audio = audio_backend.requires_audio_file and config.sidecar_audio_path is None + validate_media( + metadata, + require_video=vision_backend.requires_valid_media or not is_audio_only_input, + require_audio=require_audio, + allow_probe_failure=config.allow_demo_input or is_audio_only_input, + ) + + audio_events = smooth_events(audio_backend.detect(video_path, metadata, config), config) + reactions = vision_backend.analyze(video_path, metadata, audio_events, config) + suggestions = decide_captions(audio_events, reactions, config) + + result = PipelineResult( + input_path=video_path, + output_dir=run_dir, + metadata=metadata, + diagnostics=diagnostics, + audio_events=audio_events, + reactions=reactions, + suggestions=suggestions, + artifacts=_collect_artifacts(run_dir, config), + ) + result.files = _write_outputs(result, config) + return result + + +def detect_audio_events(video_path: Path, config: PipelineConfig) -> dict[str, object]: + """Run only the audio detection stage and write an audio JSON report.""" + + video_path = Path(video_path) + if not video_path.exists(): + raise InputNotFoundError( + message=f"Input file was not found: {video_path}", + code="input_not_found", + suggestions=["Check the path and run ccs inspect /path/to/video.mp4."], + ) + + metadata = inspect_video(video_path) + diagnostics = run_diagnostics(config) + run_dir = _run_dir(config.output_dir, video_path) + config.run_dir = run_dir + + try: + audio_backend = get_audio_backend(config.audio_backend) + except ValueError as exc: + raise BackendUnavailableError( + message=str(exc), + code="backend_unavailable", + suggestions=["Use --audio-backend mock or --audio-backend dsp."], + ) from exc + + _validate_sidecar_audio(config) + if audio_backend.requires_valid_media: + is_audio_only_input = video_path.suffix.lower() in AUDIO_EXTENSIONS + require_audio = audio_backend.requires_audio_file and config.sidecar_audio_path is None + validate_media( + metadata, + require_video=not is_audio_only_input, + require_audio=require_audio, + allow_probe_failure=config.allow_demo_input or is_audio_only_input, + ) + + events = smooth_events(audio_backend.detect(video_path, metadata, config), config) + payload: dict[str, object] = { + "input_path": str(video_path), + "output_dir": str(run_dir), + "metadata": metadata.to_dict(), + "diagnostics": diagnostics.to_dict(), + "audio_events": [event.to_dict() for event in events], + "artifacts": {name: str(path) for name, path in _collect_artifacts(run_dir, config).items()}, + } + run_dir.mkdir(parents=True, exist_ok=True) + report_path = write_json_report(payload, run_dir / "audio_events.json") + payload["files"] = {"audio_json": str(report_path)} + return payload + + +def score_visual_reactions( + video_path: Path, + audio_report_path: Path, + config: PipelineConfig, +) -> dict[str, object]: + """Run only visual reaction scoring from an existing audio event report.""" + + video_path = Path(video_path) + audio_report_path = Path(audio_report_path) + if not video_path.exists(): + raise InputNotFoundError( + message=f"Input file was not found: {video_path}", + code="input_not_found", + suggestions=["Check the path and run ccs inspect /path/to/video.mp4."], + ) + if not audio_report_path.exists(): + raise InputNotFoundError( + message=f"Audio event report was not found: {audio_report_path}", + code="audio_report_not_found", + suggestions=[ + "Run ccs audio first to generate audio_events.json.", + "Pass a valid path to a pipeline results.json file or audio_events.json file.", + ], + ) + + metadata = inspect_video(video_path) + diagnostics = run_diagnostics(config) + run_dir = _run_dir(config.output_dir, video_path) + config.run_dir = run_dir + + try: + vision_backend = get_vision_backend(config.vision_backend) + except ValueError as exc: + raise BackendUnavailableError( + message=str(exc), + code="backend_unavailable", + suggestions=["Use --vision-backend mock or --vision-backend opencv."], + ) from exc + + if vision_backend.requires_valid_media: + validate_media( + metadata, + require_video=True, + require_audio=False, + allow_probe_failure=config.allow_demo_input, + ) + + audio_events = _load_audio_events(audio_report_path) + reactions = vision_backend.analyze(video_path, metadata, audio_events, config) + payload: dict[str, object] = { + "input_path": str(video_path), + "audio_report_path": str(audio_report_path), + "output_dir": str(run_dir), + "metadata": metadata.to_dict(), + "diagnostics": diagnostics.to_dict(), + "audio_events": [event.to_dict() for event in audio_events], + "reactions": [reaction.to_dict() for reaction in reactions], + } + run_dir.mkdir(parents=True, exist_ok=True) + report_path = write_json_report(payload, run_dir / "vision_reactions.json") + payload["files"] = {"vision_json": str(report_path)} + return payload + + +def export_from_report(report_path: Path, output_path: Path, language: str) -> Path: + """Export SRT from a JSON report produced by the pipeline.""" + + import json + + from cc_suggester.core.types import CaptionSuggestion + from cc_suggester.decision.labels import caption_for + + payload = json.loads(Path(report_path).read_text(encoding="utf-8")) + suggestions: list[CaptionSuggestion] = [] + for item in payload.get("suggestions", []): + suggestions.append( + CaptionSuggestion( + event_id=item["event_id"], + start_time=float(item["start_time"]), + end_time=float(item["end_time"]), + audio_confidence=float(item["audio_confidence"]), + reaction_confidence=float(item["reaction_confidence"]), + decision_score=float(item["decision_score"]), + accepted=bool(item["accepted"]), + reason=str(item["reason"]), + caption_text=caption_for(str(item["event_id"]), language), + language=language, + requires_review=bool(item.get("requires_review", False)), + debug_info=item.get("debug_info", {}), + ) + ) + return write_srt(suggestions, output_path) + + +def _load_audio_events(report_path: Path) -> list[AudioEventCandidate]: + payload = json.loads(Path(report_path).read_text(encoding="utf-8")) + raw_events = payload.get("audio_events", []) + events: list[AudioEventCandidate] = [] + for item in raw_events: + events.append( + AudioEventCandidate( + event_id=str(item["event_id"]), + label=str(item.get("label") or item["event_id"]), + start_time=float(item["start_time"]), + end_time=float(item["end_time"]), + audio_confidence=float(item["audio_confidence"]), + audio_backend=str(item.get("audio_backend") or "unknown"), + raw_class_name=item.get("raw_class_name"), + debug_info=item.get("debug_info", {}), + ) + ) + return events + + +def _write_outputs(result: PipelineResult, config: PipelineConfig) -> dict[str, Path]: + result.output_dir.mkdir(parents=True, exist_ok=True) + files = { + "srt": result.output_dir / f"captions.{config.language}.srt", + "json": result.output_dir / "results.json", + "csv": result.output_dir / "events.csv", + "diagnostics": result.output_dir / "diagnostics.json", + "config": result.output_dir / "config.json", + } + result.files = files + write_srt(result.suggestions, files["srt"]) + write_csv_report(result.suggestions, files["csv"]) + write_json_report(result.to_dict(), files["json"]) + write_json_report(result.diagnostics.to_dict(), files["diagnostics"]) + write_json_report(config.to_dict(), files["config"]) + return files + + +def _run_dir(output_dir: Path, video_path: Path) -> Path: + stem = video_path.stem or "video" + safe_stem = "".join(char if char.isalnum() or char in {"-", "_"} else "-" for char in stem) + return Path(output_dir) / safe_stem + + +def _collect_artifacts(run_dir: Path, config: PipelineConfig) -> dict[str, Path]: + artifacts = { + "audio_wav": run_dir / "artifacts" / "audio.wav", + } + if config.sidecar_audio_path is not None: + artifacts["audio_wav"] = Path(config.sidecar_audio_path) + return {name: path for name, path in artifacts.items() if path.exists()} + + +def _validate_sidecar_audio(config: PipelineConfig) -> None: + if config.sidecar_audio_path is None: + return + if not Path(config.sidecar_audio_path).exists(): + raise InputNotFoundError( + message=f"Sidecar audio file was not found: {config.sidecar_audio_path}", + code="sidecar_audio_not_found", + suggestions=[ + "Check the --audio-path value.", + "Generate sample media with python scripts/generate_sample_video.py.", + ], + details={"audio_path": str(config.sidecar_audio_path)}, + ) diff --git a/main/cc_suggester/core/types.py b/main/cc_suggester/core/types.py new file mode 100644 index 0000000..35149e7 --- /dev/null +++ b/main/cc_suggester/core/types.py @@ -0,0 +1,138 @@ +"""Shared data models passed between pipeline modules.""" + +from __future__ import annotations + +from dataclasses import asdict, dataclass, field +from pathlib import Path +from typing import Any + + +JsonDict = dict[str, Any] + + +@dataclass(slots=True) +class VideoMetadata: + """Basic metadata discovered from an input video.""" + + path: Path + exists: bool + size_bytes: int | None = None + duration: float | None = None + fps: float | None = None + width: int | None = None + height: int | None = None + has_audio: bool | None = None + has_video: bool | None = None + audio_codec: str | None = None + video_codec: str | None = None + audio_sample_rate: int | None = None + audio_channels: int | None = None + container: str | None = None + probe_error: str | None = None + + def to_dict(self) -> JsonDict: + data = asdict(self) + data["path"] = str(self.path) + return data + + +@dataclass(slots=True) +class DiagnosticsReport: + """Environment and device diagnostics for a run.""" + + python_version: str + ffmpeg_path: str | None + ffprobe_path: str | None + selected_device: str + actual_device: str + cuda_available: bool + gpu_name: str | None = None + torch_available: bool = False + fallback_reason: str | None = None + warnings: list[str] = field(default_factory=list) + + def to_dict(self) -> JsonDict: + return asdict(self) + + +@dataclass(slots=True) +class AudioEventCandidate: + """Detected sound event before visual reaction analysis.""" + + event_id: str + label: str + start_time: float + end_time: float + audio_confidence: float + audio_backend: str + raw_class_name: str | None = None + debug_info: JsonDict = field(default_factory=dict) + + def to_dict(self) -> JsonDict: + return asdict(self) + + +@dataclass(slots=True) +class ReactionResult: + """Visual reaction evidence around an audio event.""" + + event_id: str + start_time: float + end_time: float + reaction_confidence: float + reaction_signals: JsonDict = field(default_factory=dict) + frames_sampled: int = 0 + vision_backend: str = "mock" + debug_info: JsonDict = field(default_factory=dict) + + def to_dict(self) -> JsonDict: + return asdict(self) + + +@dataclass(slots=True) +class CaptionSuggestion: + """Final caption decision for a candidate event.""" + + event_id: str + start_time: float + end_time: float + audio_confidence: float + reaction_confidence: float + decision_score: float + accepted: bool + reason: str + caption_text: str + language: str + requires_review: bool = False + debug_info: JsonDict = field(default_factory=dict) + + def to_dict(self) -> JsonDict: + return asdict(self) + + +@dataclass(slots=True) +class PipelineResult: + """Complete result returned by the pipeline.""" + + input_path: Path + output_dir: Path + metadata: VideoMetadata + diagnostics: DiagnosticsReport + audio_events: list[AudioEventCandidate] + reactions: list[ReactionResult] + suggestions: list[CaptionSuggestion] + files: dict[str, Path] = field(default_factory=dict) + artifacts: dict[str, Path] = field(default_factory=dict) + + def to_dict(self) -> JsonDict: + return { + "input_path": str(self.input_path), + "output_dir": str(self.output_dir), + "metadata": self.metadata.to_dict(), + "diagnostics": self.diagnostics.to_dict(), + "audio_events": [event.to_dict() for event in self.audio_events], + "reactions": [reaction.to_dict() for reaction in self.reactions], + "suggestions": [suggestion.to_dict() for suggestion in self.suggestions], + "files": {name: str(path) for name, path in self.files.items()}, + "artifacts": {name: str(path) for name, path in self.artifacts.items()}, + } diff --git a/main/cc_suggester/decision/__init__.py b/main/cc_suggester/decision/__init__.py new file mode 100644 index 0000000..9208dab --- /dev/null +++ b/main/cc_suggester/decision/__init__.py @@ -0,0 +1 @@ +"""Caption decision engine.""" diff --git a/main/cc_suggester/decision/labels.py b/main/cc_suggester/decision/labels.py new file mode 100644 index 0000000..8f16ec9 --- /dev/null +++ b/main/cc_suggester/decision/labels.py @@ -0,0 +1,198 @@ +"""Caption label glossary.""" + +from __future__ import annotations + + +LABELS: dict[str, dict[str, str]] = { + "children_cheer": { + "en": "[children cheering]", + "hi": "[बच्चे उत्साह से चिल्लाते हैं]", + "ta": "[குழந்தைகள் ஆரவாரம் செய்கின்றனர்]", + "te": "[పిల్లలు ఆనందంగా కేకలు వేస్తున్నారు]", + "bn": "[শিশুরা উল্লাস করছে]", + "mr": "[मुले आनंदाने ओरडत आहेत]", + "ml": "[കുട്ടികൾ ആഹ്ലാദിക്കുന്നു]", + }, + "crowd_cheer": { + "en": "[crowd cheering]", + "hi": "[भीड़ जयकार करती है]", + "ta": "[கூட்டம் ஆரவாரம் செய்கிறது]", + "te": "[జనం ఆనందంగా కేకలు వేస్తున్నారు]", + "bn": "[ভিড় উল্লাস করছে]", + "mr": "[गर्दी जल्लोष करते]", + "ml": "[ജനക്കൂട്ടം ആഹ്ലാദിക്കുന്നു]", + }, + "school_bell": { + "en": "[school bell rings]", + "hi": "[स्कूल की घंटी बजती है]", + "ta": "[பள்ளி மணி ஒலிக்கிறது]", + "te": "[పాఠశాల గంట మోగుతుంది]", + "bn": "[স্কুলের ঘণ্টা বাজছে]", + "mr": "[शाळेची घंटा वाजते]", + "ml": "[സ്കൂൾ മണി മുഴങ്ങുന്നു]", + }, + "applause": { + "en": "[students applauding]", + "hi": "[छात्र तालियां बजाते हैं]", + "ta": "[மாணவர்கள் கைத்தட்டுகின்றனர்]", + "te": "[విద్యార్థులు చప్పట్లు కొడుతున్నారు]", + "bn": "[ছাত্ররা হাততালি দিচ্ছে]", + "mr": "[विद्यार्थी टाळ्या वाजवत आहेत]", + "ml": "[വിദ്യാർത്ഥികൾ കൈയടിക്കുന്നു]", + }, + "chair_scrape": { + "en": "[chair scrapes]", + "hi": "[कुर्सी घिसटती है]", + "ta": "[நாற்காலி இழுக்கும் சத்தம்]", + "te": "[కుర్చీ లాగిన శబ్దం]", + "bn": "[চেয়ার ঘষার শব্দ]", + "mr": "[खुर्ची ओढल्याचा आवाज]", + "ml": "[കസേര വലിക്കുന്ന ശബ്ദം]", + }, + "background_chatter": { + "en": "[background chatter]", + "hi": "[पृष्ठभूमि में बातचीत]", + "ta": "[பின்னணி பேச்சு]", + "te": "[నేపథ్యంలో మాటలు]", + "bn": "[পেছনে কথাবার্তা]", + "mr": "[पार्श्वभूमीत गप्पा]", + "ml": "[പശ്ചാത്തല സംഭാഷണം]", + }, + "horn_honk": { + "en": "[horn honks]", + "hi": "[हॉर्न बजता है]", + "ta": "[ஹார்ன் ஒலிக்கிறது]", + "te": "[హారన్ మోగుతుంది]", + "bn": "[হর্ন বাজছে]", + "mr": "[हॉर्न वाजतो]", + "ml": "[ഹോൺ മുഴങ്ങുന്നു]", + }, + "glass_break": { + "en": "[glass breaks]", + "hi": "[कांच टूटता है]", + "ta": "[கண்ணாடி உடைகிறது]", + "te": "[గాజు పగులుతుంది]", + "bn": "[কাচ ভাঙছে]", + "mr": "[काच फुटते]", + "ml": "[ഗ്ലാസ് പൊട്ടുന്നു]", + }, + "laughter": { + "en": "[laughter]", + "hi": "[हंसी]", + "ta": "[சிரிப்பு]", + "te": "[నవ్వు]", + "bn": "[হাসি]", + "mr": "[हशा]", + "ml": "[ചിരി]", + }, + "music": { + "en": "[music]", + "hi": "[संगीत]", + "ta": "[இசை]", + "te": "[సంగీతం]", + "bn": "[সঙ্গীত]", + "mr": "[संगीत]", + "ml": "[സംഗീതം]", + }, + "alarm": { + "en": "[alarm ringing]", + "hi": "[अलार्म बजता है]", + "ta": "[அலாரம் ஒலிக்கிறது]", + "te": "[అలారం మోగుతుంది]", + "bn": "[অ্যালার্ম বাজছে]", + "mr": "[अलार्म वाजतो]", + "ml": "[അലാറം മുഴങ്ങുന്നു]", + }, + "siren": { + "en": "[siren wails]", + "hi": "[सायरन बजता है]", + "ta": "[சைரன் ஒலிக்கிறது]", + "te": "[సైరన్ మోగుతుంది]", + "bn": "[সাইরেন বাজছে]", + "mr": "[सायरेन वाजतो]", + "ml": "[സൈറൺ മുഴങ്ങുന്നു]", + }, + "explosion": { + "en": "[explosion]", + "hi": "[विस्फोट]", + "ta": "[வெடிப்பு]", + "te": "[పేలుడు]", + "bn": "[বিস্ফোরণ]", + "mr": "[स्फोट]", + "ml": "[സ്ഫോടനം]", + }, + "gunshot": { + "en": "[gunshot]", + "hi": "[गोली चलती है]", + "ta": "[துப்பாக்கிச் சத்தம்]", + "te": "[తుపాకీ శబ్దం]", + "bn": "[গুলির শব্দ]", + "mr": "[गोळीबाराचा आवाज]", + "ml": "[വെടിയൊച്ച]", + }, + "door_slam": { + "en": "[door slams]", + "hi": "[दरवाज़ा ज़ोर से बंद होता है]", + "ta": "[கதவு பலமாக மூடப்படுகிறது]", + "te": "[తలుపు బలంగా మూసుకుంటుంది]", + "bn": "[দরজা জোরে বন্ধ হয়]", + "mr": "[दरवाजा जोरात बंद होतो]", + "ml": "[വാതിൽ ശക്തമായി അടയുന്നു]", + }, + "phone_ring": { + "en": "[phone rings]", + "hi": "[फ़ोन बजता है]", + "ta": "[தொலைபேசி ஒலிக்கிறது]", + "te": "[ఫోన్ మోగుతుంది]", + "bn": "[ফোন বাজছে]", + "mr": "[फोन वाजतो]", + "ml": "[ഫോൺ മുഴങ്ങുന്നു]", + }, + "dog_bark": { + "en": "[dog barking]", + "hi": "[कुत्ता भौंकता है]", + "ta": "[நாய் குரைக்கிறது]", + "te": "[కుక్క మొరుగుతుంది]", + "bn": "[কুকুর ডাকছে]", + "mr": "[कुत्रा भुंकतो]", + "ml": "[നായ കുരയ്ക്കുന്നു]", + }, + "impact_sound": { + "en": "[sudden sound]", + "hi": "[अचानक आवाज़]", + "ta": "[திடீர் சத்தம்]", + "te": "[అకస్మాత్తుగా శబ్దం]", + "bn": "[হঠাৎ শব্দ]", + "mr": "[अचानक आवाज]", + "ml": "[പെട്ടെന്നുള്ള ശബ്ദം]", + }, + "loud_sound": { + "en": "[loud sound]", + "hi": "[तेज़ आवाज़]", + "ta": "[உரத்த சத்தம்]", + "te": "[పెద్ద శబ్దం]", + "bn": "[জোরে শব্দ]", + "mr": "[मोठा आवाज]", + "ml": "[വലിയ ശബ്ദം]", + }, + "sustained_sound": { + "en": "[continuous sound]", + "hi": "[लगातार आवाज़]", + "ta": "[தொடர்ச்சியான சத்தம்]", + "te": "[నిరంతర శబ్దం]", + "bn": "[অবিরত শব্দ]", + "mr": "[सतत आवाज]", + "ml": "[തുടർച്ചയായ ശബ്ദം]", + }, +} + + +def caption_for(event_id: str, language: str) -> str: + """Return a caption label for an event and language.""" + + values = LABELS.get(event_id, {}) + if language in values: + return values[language] + if "en" in values: + return values["en"] + return f"[{event_id.replace('_', ' ')}]" diff --git a/main/cc_suggester/decision/rules.py b/main/cc_suggester/decision/rules.py new file mode 100644 index 0000000..57add09 --- /dev/null +++ b/main/cc_suggester/decision/rules.py @@ -0,0 +1,64 @@ +"""Decision priors and ambient penalties.""" + +from __future__ import annotations + + +EVENT_IMPORTANCE_PRIOR: dict[str, float] = { + "glass_break": 0.30, + "explosion": 0.35, + "gunshot": 0.35, + "alarm": 0.30, + "siren": 0.28, + "school_bell": 0.20, + "children_cheer": 0.18, + "crowd_cheer": 0.18, + "horn_honk": 0.18, + "applause": 0.12, + "laughter": 0.10, + "music": 0.06, + "door_slam": 0.15, + "phone_ring": 0.14, + "dog_bark": 0.08, + "chair_scrape": 0.04, + "background_chatter": 0.02, + "impact_sound": 0.14, + "loud_sound": 0.10, + "sustained_sound": 0.04, +} + +AMBIENT_PENALTY: dict[str, float] = { + "background_chatter": 0.30, + "traffic_noise": 0.32, + "fan_noise": 0.35, + "background_music": 0.24, + "music": 0.18, + "crowd_murmur": 0.28, + "chair_scrape": 0.10, + "sustained_sound": 0.18, +} + +HIGH_IMPACT_EVENTS = { + "glass_break", + "explosion", + "gunshot", + "alarm", + "siren", +} + + +def importance_prior(event_id: str) -> float: + """Return event importance prior.""" + + return EVENT_IMPORTANCE_PRIOR.get(event_id, 0.08) + + +def ambient_penalty(event_id: str) -> float: + """Return event ambient penalty.""" + + return AMBIENT_PENALTY.get(event_id, 0.0) + + +def is_high_impact(event_id: str) -> bool: + """Return whether the event is high-impact by default.""" + + return event_id in HIGH_IMPACT_EVENTS diff --git a/main/cc_suggester/decision/scorer.py b/main/cc_suggester/decision/scorer.py new file mode 100644 index 0000000..69bea03 --- /dev/null +++ b/main/cc_suggester/decision/scorer.py @@ -0,0 +1,99 @@ +"""Combine audio and visual evidence into caption decisions.""" + +from __future__ import annotations + +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.types import AudioEventCandidate, CaptionSuggestion, ReactionResult +from cc_suggester.decision.labels import caption_for +from cc_suggester.decision.rules import ambient_penalty, importance_prior, is_high_impact + + +def decide_captions( + audio_events: list[AudioEventCandidate], + reactions: list[ReactionResult], + config: PipelineConfig, +) -> list[CaptionSuggestion]: + """Create final caption suggestions from audio and visual signals.""" + + reaction_by_key = { + (reaction.event_id, reaction.start_time, reaction.end_time): reaction + for reaction in reactions + } + suggestions: list[CaptionSuggestion] = [] + + for event in audio_events: + reaction = reaction_by_key.get((event.event_id, event.start_time, event.end_time)) + reaction_confidence = reaction.reaction_confidence if reaction else 0.0 + prior = importance_prior(event.event_id) + penalty = ambient_penalty(event.event_id) + score = _score( + audio_confidence=event.audio_confidence, + reaction_confidence=reaction_confidence, + prior=prior, + penalty=penalty, + ) + + accepted = score >= config.decision_threshold + requires_review = config.review_threshold <= score < config.decision_threshold + if is_high_impact(event.event_id) and event.audio_confidence >= 0.70: + accepted = True + requires_review = False + + reason = _reason_for(event, reaction_confidence, score, accepted, requires_review, penalty) + suggestions.append( + CaptionSuggestion( + event_id=event.event_id, + start_time=event.start_time, + end_time=event.end_time, + audio_confidence=event.audio_confidence, + reaction_confidence=reaction_confidence, + decision_score=round(score, 3), + accepted=accepted, + reason=reason, + caption_text=caption_for(event.event_id, config.language), + language=config.language, + requires_review=requires_review, + debug_info={ + "importance_prior": prior, + "ambient_penalty": penalty, + "high_impact": is_high_impact(event.event_id), + "reaction_signals": reaction.reaction_signals if reaction else {}, + }, + ) + ) + return suggestions + + +def _score( + audio_confidence: float, + reaction_confidence: float, + prior: float, + penalty: float, +) -> float: + raw = (0.52 * audio_confidence) + (0.34 * reaction_confidence) + prior - penalty + return max(0.0, min(1.0, raw)) + + +def _reason_for( + event: AudioEventCandidate, + reaction_confidence: float, + score: float, + accepted: bool, + requires_review: bool, + penalty: float, +) -> str: + if accepted: + if reaction_confidence >= 0.50: + return ( + f"Accepted because {event.event_id} has strong audio confidence " + "and visible scene reaction." + ) + return f"Accepted because {event.event_id} is important and audio confidence is high." + if requires_review: + return ( + f"Needs review because {event.event_id} is plausible but the combined " + f"decision score is borderline ({score:.2f})." + ) + if penalty > 0: + return f"Rejected because {event.event_id} appears ambient or low-impact." + return f"Rejected because combined audio and reaction evidence is weak ({score:.2f})." diff --git a/main/cc_suggester/output/__init__.py b/main/cc_suggester/output/__init__.py new file mode 100644 index 0000000..e8f7502 --- /dev/null +++ b/main/cc_suggester/output/__init__.py @@ -0,0 +1 @@ +"""Output writers for caption suggestions.""" diff --git a/main/cc_suggester/output/csv_report.py b/main/cc_suggester/output/csv_report.py new file mode 100644 index 0000000..322f1e3 --- /dev/null +++ b/main/cc_suggester/output/csv_report.py @@ -0,0 +1,57 @@ +"""CSV review report export.""" + +from __future__ import annotations + +import csv +import io +from pathlib import Path + +from cc_suggester.core.types import CaptionSuggestion + + +FIELDNAMES = [ + "event_id", + "start_time", + "end_time", + "caption_text", + "language", + "audio_confidence", + "reaction_confidence", + "decision_score", + "accepted", + "requires_review", + "reason", +] + + +def render_csv_report(suggestions: list[CaptionSuggestion]) -> str: + """Render a reviewer-friendly CSV report.""" + + buffer = io.StringIO() + writer = csv.DictWriter(buffer, fieldnames=FIELDNAMES) + writer.writeheader() + for suggestion in suggestions: + writer.writerow( + { + "event_id": suggestion.event_id, + "start_time": suggestion.start_time, + "end_time": suggestion.end_time, + "caption_text": suggestion.caption_text, + "language": suggestion.language, + "audio_confidence": suggestion.audio_confidence, + "reaction_confidence": suggestion.reaction_confidence, + "decision_score": suggestion.decision_score, + "accepted": suggestion.accepted, + "requires_review": suggestion.requires_review, + "reason": suggestion.reason, + } + ) + return buffer.getvalue() + + +def write_csv_report(suggestions: list[CaptionSuggestion], output_path: Path) -> Path: + """Write a reviewer-friendly CSV report.""" + + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text(render_csv_report(suggestions), encoding="utf-8") + return output_path diff --git a/main/cc_suggester/output/json_report.py b/main/cc_suggester/output/json_report.py new file mode 100644 index 0000000..e2662d2 --- /dev/null +++ b/main/cc_suggester/output/json_report.py @@ -0,0 +1,18 @@ +"""JSON report export.""" + +from __future__ import annotations + +import json +from pathlib import Path +from typing import Any + + +def write_json_report(payload: dict[str, Any], output_path: Path) -> Path: + """Write a UTF-8 JSON report.""" + + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text( + json.dumps(payload, indent=2, ensure_ascii=False, sort_keys=True), + encoding="utf-8", + ) + return output_path diff --git a/main/cc_suggester/output/review_export.py b/main/cc_suggester/output/review_export.py new file mode 100644 index 0000000..c42ec67 --- /dev/null +++ b/main/cc_suggester/output/review_export.py @@ -0,0 +1,148 @@ +"""Helpers for exporting manually reviewed caption suggestions.""" + +from __future__ import annotations + +from collections.abc import Mapping, Sequence +import json +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +from cc_suggester.core.types import CaptionSuggestion +from cc_suggester.output.csv_report import render_csv_report, write_csv_report +from cc_suggester.output.json_report import write_json_report +from cc_suggester.output.srt import render_srt, write_srt + + +VALID_REVIEW_STATUSES = {"accepted", "review", "rejected"} + + +@dataclass(frozen=True, slots=True) +class ReviewExport: + """In-memory reviewed export payload for UI download buttons.""" + + suggestions: list[CaptionSuggestion] + srt_text: str + csv_text: str + json_text: str + + +def build_review_export(rows: Sequence[Mapping[str, Any]], language: str) -> ReviewExport: + """Convert editable review rows into exportable SRT and CSV content.""" + + suggestions = suggestions_from_review_rows(rows, language) + return ReviewExport( + suggestions=suggestions, + srt_text=render_srt(suggestions), + csv_text=render_csv_report(suggestions), + json_text=json.dumps( + review_payload(suggestions, language), + indent=2, + ensure_ascii=False, + sort_keys=True, + ), + ) + + +def write_review_exports(rows: Sequence[Mapping[str, Any]], output_dir: Path, language: str) -> dict[str, Path]: + """Write reviewed SRT, CSV, and JSON files to a directory.""" + + export = build_review_export(rows, language) + output_dir.mkdir(parents=True, exist_ok=True) + files = { + "reviewed_srt": write_srt(export.suggestions, output_dir / f"reviewed_captions.{language}.srt"), + "reviewed_csv": write_csv_report(export.suggestions, output_dir / "reviewed_events.csv"), + } + files["reviewed_json"] = write_json_report( + review_payload(export.suggestions, language), + output_dir / "reviewed_results.json", + ) + return files + + +def review_payload(suggestions: Sequence[CaptionSuggestion], language: str) -> dict[str, Any]: + """Build a JSON-serializable reviewed session payload.""" + + return { + "language": language, + "suggestions": [suggestion.to_dict() for suggestion in suggestions], + "summary": { + "total": len(suggestions), + "accepted": sum(1 for item in suggestions if item.accepted), + "review": sum(1 for item in suggestions if item.requires_review), + "rejected": sum(1 for item in suggestions if not item.accepted and not item.requires_review), + }, + } + + +def suggestions_from_review_rows(rows: Sequence[Mapping[str, Any]], language: str) -> list[CaptionSuggestion]: + """Build caption suggestions from Web UI review rows.""" + + suggestions: list[CaptionSuggestion] = [] + for fallback_index, row in enumerate(rows, start=1): + status = _status_for(row) + caption_text = _string_for(row, ("caption", "caption_text"), default="").strip() + suggestions.append( + CaptionSuggestion( + event_id=_string_for(row, ("event_id",), default=f"event_{fallback_index}"), + start_time=_float_for(row, ("start", "start_time"), default=0.0), + end_time=_float_for(row, ("end", "end_time"), default=0.0), + audio_confidence=_float_for(row, ("audio", "audio_confidence"), default=0.0), + reaction_confidence=_float_for(row, ("reaction", "reaction_confidence"), default=0.0), + decision_score=_float_for(row, ("decision", "decision_score"), default=0.0), + accepted=status == "accepted", + requires_review=status == "review", + reason=_reason_for(row, status), + caption_text=caption_text, + language=language, + debug_info={ + "editor_status": status, + "review_index": row.get("index", fallback_index), + "source": "review_export", + }, + ) + ) + return suggestions + + +def _status_for(row: Mapping[str, Any]) -> str: + status = _string_for(row, ("status",), default="").strip().lower() + if not status: + if bool(row.get("accepted", False)): + status = "accepted" + elif bool(row.get("requires_review", False)): + status = "review" + else: + status = "rejected" + if status not in VALID_REVIEW_STATUSES: + valid = ", ".join(sorted(VALID_REVIEW_STATUSES)) + raise ValueError(f"Unknown review status '{status}'. Expected one of: {valid}.") + return status + + +def _reason_for(row: Mapping[str, Any], status: str) -> str: + existing = _string_for(row, ("reason",), default="").strip() + editor_note = f"Editor marked this event as {status}." + if not existing: + return editor_note + if existing.endswith(editor_note): + return existing + return f"{existing} {editor_note}" + + +def _string_for(row: Mapping[str, Any], keys: tuple[str, ...], default: str) -> str: + for key in keys: + if key in row and row[key] is not None: + return str(row[key]) + return default + + +def _float_for(row: Mapping[str, Any], keys: tuple[str, ...], default: float) -> float: + for key in keys: + if key not in row or row[key] is None: + continue + try: + return float(row[key]) + except (TypeError, ValueError): + return default + return default diff --git a/main/cc_suggester/output/srt.py b/main/cc_suggester/output/srt.py new file mode 100644 index 0000000..a51e57f --- /dev/null +++ b/main/cc_suggester/output/srt.py @@ -0,0 +1,46 @@ +"""SRT export.""" + +from __future__ import annotations + +from pathlib import Path + +from cc_suggester.core.types import CaptionSuggestion + + +def render_srt(suggestions: list[CaptionSuggestion]) -> str: + """Render accepted caption suggestions as SRT text.""" + + accepted = [item for item in suggestions if item.accepted] + lines: list[str] = [] + for index, suggestion in enumerate(accepted, start=1): + lines.extend( + [ + str(index), + f"{format_srt_time(suggestion.start_time)} --> {format_srt_time(suggestion.end_time)}", + suggestion.caption_text, + "", + ] + ) + return "\n".join(lines) + + +def write_srt(suggestions: list[CaptionSuggestion], output_path: Path) -> Path: + """Write accepted caption suggestions to an SRT file.""" + + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text(render_srt(suggestions), encoding="utf-8") + return output_path + + +def format_srt_time(seconds: float) -> str: + """Format seconds as SRT timestamp.""" + + safe_seconds = max(0.0, seconds) + hours = int(safe_seconds // 3600) + minutes = int((safe_seconds % 3600) // 60) + whole_seconds = int(safe_seconds % 60) + milliseconds = int(round((safe_seconds - int(safe_seconds)) * 1000)) + if milliseconds == 1000: + milliseconds = 0 + whole_seconds += 1 + return f"{hours:02d}:{minutes:02d}:{whole_seconds:02d},{milliseconds:03d}" diff --git a/main/cc_suggester/translation/__init__.py b/main/cc_suggester/translation/__init__.py new file mode 100644 index 0000000..1bc0acb --- /dev/null +++ b/main/cc_suggester/translation/__init__.py @@ -0,0 +1 @@ +"""Translation and multilingual label support.""" diff --git a/main/cc_suggester/translation/glossary.py b/main/cc_suggester/translation/glossary.py new file mode 100644 index 0000000..9e3c8f1 --- /dev/null +++ b/main/cc_suggester/translation/glossary.py @@ -0,0 +1,17 @@ +"""Glossary helpers for non-speech caption labels.""" + +from __future__ import annotations + +from cc_suggester.decision.labels import LABELS, caption_for + + +def supported_event_ids() -> list[str]: + """Return event IDs available in the curated glossary.""" + + return sorted(LABELS) + + +def get_caption(event_id: str, language: str) -> str: + """Return a caption from the curated glossary.""" + + return caption_for(event_id, language) diff --git a/main/cc_suggester/ui/__init__.py b/main/cc_suggester/ui/__init__.py new file mode 100644 index 0000000..4d40324 --- /dev/null +++ b/main/cc_suggester/ui/__init__.py @@ -0,0 +1 @@ +"""Web UI clients.""" diff --git a/main/cc_suggester/ui/streamlit_app.py b/main/cc_suggester/ui/streamlit_app.py new file mode 100644 index 0000000..af7a606 --- /dev/null +++ b/main/cc_suggester/ui/streamlit_app.py @@ -0,0 +1,199 @@ +"""Streamlit editor review UI. + +Run with: + streamlit run cc_suggester/ui/streamlit_app.py +""" + +from __future__ import annotations + +from pathlib import Path +import tempfile + +import streamlit as st + +from cc_suggester.core.config import SUPPORTED_DEVICES, SUPPORTED_LANGUAGES, PipelineConfig +from cc_suggester.core.errors import CCSuggesterError +from cc_suggester.core.pipeline import analyze_video +from cc_suggester.output.review_export import build_review_export + + +def main() -> None: + st.set_page_config( + page_title="Intelligent CC Suggestion Tool", + page_icon="CC", + layout="wide", + ) + st.title("Intelligent Closed Caption Suggestion Tool") + st.caption("Generate and review meaningful non-speech CC suggestions.") + + with st.sidebar: + st.header("Pipeline") + uploaded = st.file_uploader("Video file", type=["mp4", "mkv", "mov", "avi", "webm", "wav"]) + language = st.selectbox("Language", SUPPORTED_LANGUAGES, index=0) + device = st.selectbox("Device", SUPPORTED_DEVICES, index=0) + audio_backend = st.selectbox("Audio backend", ["mock", "dsp", "yamnet"], index=0) + vision_backend = st.selectbox("Vision backend", ["mock", "opencv", "mediapipe"], index=0) + decision_threshold = st.slider("Decision threshold", 0.0, 1.0, 0.65, 0.01) + review_threshold = st.slider("Review threshold", 0.0, 1.0, 0.50, 0.01) + allow_demo_input = st.checkbox("Allow demo/non-video input", value=audio_backend == "mock") + run = st.button("Start Caption", type="primary", use_container_width=True) + + if uploaded is None: + st.info("Upload a video to begin. Use mock backends for a fast demo, or DSP/OpenCV for real local processing.") + return + + input_path = _save_upload(uploaded) + left, right = st.columns([1.5, 1.0], gap="large") + with left: + st.subheader("Video Preview") + if input_path.suffix.lower() != ".wav": + st.video(str(input_path)) + else: + st.audio(str(input_path)) + + if run: + config = PipelineConfig( + language=language, + device=device, + audio_backend=audio_backend, + vision_backend=vision_backend, + output_dir=Path("outputs"), + decision_threshold=decision_threshold, + review_threshold=review_threshold, + allow_demo_input=allow_demo_input, + ) + try: + with st.spinner("Analyzing audio and visual reaction signals..."): + st.session_state["result"] = analyze_video(input_path, config) + except CCSuggesterError as exc: + st.error(exc.message) + if exc.suggestions: + st.markdown("**Suggestions**") + for suggestion in exc.suggestions: + st.write(f"- {suggestion}") + if exc.details: + with st.expander("Debug details"): + st.json(exc.details) + return + + result = st.session_state.get("result") + if not result: + return + + with right: + st.subheader("Run Summary") + accepted = [item for item in result.suggestions if item.accepted] + review = [item for item in result.suggestions if item.requires_review] + rejected = [item for item in result.suggestions if not item.accepted and not item.requires_review] + st.metric("Detected events", len(result.audio_events)) + st.metric("Accepted", len(accepted)) + st.metric("Needs review", len(review)) + st.metric("Rejected", len(rejected)) + st.write(f"Device used: `{result.diagnostics.actual_device}`") + if result.diagnostics.warnings: + with st.expander("Diagnostics warnings"): + for warning in result.diagnostics.warnings: + st.warning(warning) + + st.subheader("Review Suggestions") + rows = [] + for index, suggestion in enumerate(result.suggestions, start=1): + status = "accepted" if suggestion.accepted else "review" if suggestion.requires_review else "rejected" + row_key = f"{Path(result.input_path).stem}-{index}-{suggestion.event_id}-{suggestion.start_time:.3f}" + with st.expander( + f"{index}. {suggestion.caption_text} | {suggestion.start_time:.2f}s-{suggestion.end_time:.2f}s | {status}", + expanded=index == 1, + ): + edited = st.text_input( + "Caption text", + value=suggestion.caption_text, + key=f"caption-{row_key}", + ) + c1, c2, c3 = st.columns(3) + c1.metric("Audio", f"{suggestion.audio_confidence:.2f}") + c2.metric("Reaction", f"{suggestion.reaction_confidence:.2f}") + c3.metric("Decision", f"{suggestion.decision_score:.2f}") + st.write(suggestion.reason) + status_choice = st.radio( + "Editor decision", + ["accepted", "review", "rejected"], + index=["accepted", "review", "rejected"].index(status), + horizontal=True, + key=f"status-{row_key}", + ) + rows.append( + { + "index": index, + "event_id": suggestion.event_id, + "start": suggestion.start_time, + "end": suggestion.end_time, + "caption": edited, + "status": status_choice, + "audio": suggestion.audio_confidence, + "reaction": suggestion.reaction_confidence, + "decision": suggestion.decision_score, + "reason": suggestion.reason, + } + ) + + st.subheader("Exports") + st.dataframe(rows, use_container_width=True) + export_language = result.suggestions[0].language if result.suggestions else language + review_export = build_review_export(rows, export_language) + reviewed_accepted = sum(1 for item in review_export.suggestions if item.accepted) + reviewed_review = sum(1 for item in review_export.suggestions if item.requires_review) + reviewed_rejected = len(review_export.suggestions) - reviewed_accepted - reviewed_review + srt_name = f"{Path(result.input_path).stem}.reviewed.{export_language}.srt" + csv_name = f"{Path(result.input_path).stem}.reviewed.events.csv" + json_name = f"{Path(result.input_path).stem}.reviewed.session.json" + + c1, c2, c3 = st.columns(3) + c1.metric("Reviewed accepted", reviewed_accepted) + c2.metric("Still needs review", reviewed_review) + c3.metric("Rejected", reviewed_rejected) + + export_left, export_middle, export_right = st.columns(3) + export_left.download_button( + label="Download Reviewed SRT", + data=review_export.srt_text.encode("utf-8"), + file_name=srt_name, + mime="application/x-subrip", + type="primary", + use_container_width=True, + ) + export_middle.download_button( + label="Download Reviewed CSV", + data=review_export.csv_text.encode("utf-8"), + file_name=csv_name, + mime="text/csv", + use_container_width=True, + ) + export_right.download_button( + label="Download Review Session", + data=review_export.json_text.encode("utf-8"), + file_name=json_name, + mime="application/json", + use_container_width=True, + ) + + with st.expander("Raw pipeline exports"): + for name, path in result.files.items(): + if path.exists(): + st.download_button( + label=f"Download Original {name.upper()}", + data=path.read_bytes(), + file_name=path.name, + use_container_width=False, + ) + + +def _save_upload(uploaded) -> Path: + temp_dir = Path(tempfile.gettempdir()) / "cc_suggester_uploads" + temp_dir.mkdir(parents=True, exist_ok=True) + path = temp_dir / uploaded.name + path.write_bytes(uploaded.getbuffer()) + return path + + +if __name__ == "__main__": + main() diff --git a/main/cc_suggester/vision/__init__.py b/main/cc_suggester/vision/__init__.py new file mode 100644 index 0000000..bcd5db8 --- /dev/null +++ b/main/cc_suggester/vision/__init__.py @@ -0,0 +1 @@ +"""Visual reaction analysis modules.""" diff --git a/main/cc_suggester/vision/backends/__init__.py b/main/cc_suggester/vision/backends/__init__.py new file mode 100644 index 0000000..228cbe2 --- /dev/null +++ b/main/cc_suggester/vision/backends/__init__.py @@ -0,0 +1,22 @@ +"""Vision backend registry.""" + +from cc_suggester.vision.backends.base import VisionBackend +from cc_suggester.vision.backends.mediapipe import MediaPipeVisionBackend +from cc_suggester.vision.backends.mock import MockVisionBackend +from cc_suggester.vision.backends.opencv import OpenCvVisionBackend + + +def get_vision_backend(name: str) -> VisionBackend: + """Return a visual reaction backend by name.""" + + normalized = name.lower().strip() + if normalized in {"mock", "demo"}: + return MockVisionBackend() + if normalized in {"opencv", "cv2"}: + return OpenCvVisionBackend() + if normalized == "mediapipe": + return MediaPipeVisionBackend() + raise ValueError( + f"Unknown vision backend '{name}'. Available: mock, opencv, mediapipe. " + "Planned advanced backends: mmpose, mmaction." + ) diff --git a/main/cc_suggester/vision/backends/base.py b/main/cc_suggester/vision/backends/base.py new file mode 100644 index 0000000..49122c9 --- /dev/null +++ b/main/cc_suggester/vision/backends/base.py @@ -0,0 +1,26 @@ +"""Vision backend interface.""" + +from __future__ import annotations + +from abc import ABC, abstractmethod +from pathlib import Path + +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.types import AudioEventCandidate, ReactionResult, VideoMetadata + + +class VisionBackend(ABC): + """Interface implemented by visual reaction analysis backends.""" + + name: str + requires_valid_media: bool = False + + @abstractmethod + def analyze( + self, + video_path: Path, + metadata: VideoMetadata, + audio_events: list[AudioEventCandidate], + config: PipelineConfig, + ) -> list[ReactionResult]: + """Analyze visible reactions around audio event timestamps.""" diff --git a/main/cc_suggester/vision/backends/mediapipe.py b/main/cc_suggester/vision/backends/mediapipe.py new file mode 100644 index 0000000..e42b885 --- /dev/null +++ b/main/cc_suggester/vision/backends/mediapipe.py @@ -0,0 +1,187 @@ +"""Optional MediaPipe visual reaction backend.""" + +from __future__ import annotations + +import math +from pathlib import Path +from typing import Any + +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.errors import BackendUnavailableError +from cc_suggester.core.types import AudioEventCandidate, ReactionResult, VideoMetadata +from cc_suggester.vision.backends.base import VisionBackend + +_MODEL_URL = ( + "https://storage.googleapis.com/mediapipe-models/" + "pose_landmarker/pose_landmarker_lite/float16/latest/" + "pose_landmarker_lite.task" +) + + +def _ensure_model() -> Path: + import os + cache_dir = Path(os.path.expanduser("~/.cache/cc_suggester")) + cache_dir.mkdir(parents=True, exist_ok=True) + model_path = cache_dir / "pose_landmarker_lite.task" + if not model_path.exists(): + import urllib.request + urllib.request.urlretrieve(_MODEL_URL, model_path) + return model_path + + +class MediaPipeVisionBackend(VisionBackend): + """Estimate pose-based reaction signals around audio events.""" + + name = "mediapipe" + requires_valid_media = True + + def analyze( + self, + video_path: Path, + metadata: VideoMetadata, + audio_events: list[AudioEventCandidate], + config: PipelineConfig, + ) -> list[ReactionResult]: + cv2, mp, PoseLandmarker, VisionTaskRunningMode, BaseOptions = _import_dependencies() + capture = cv2.VideoCapture(str(video_path)) + if not capture.isOpened(): + raise BackendUnavailableError( + message="OpenCV could not open the input video for MediaPipe analysis.", + code="mediapipe_video_open_failed", + suggestions=[ + "Run ccs inspect on the input file.", + "Try --vision-backend opencv to confirm basic video decoding works.", + ], + ) + + fps = metadata.fps or capture.get(cv2.CAP_PROP_FPS) or 25.0 + results: list[ReactionResult] = [] + + from mediapipe.tasks.python.vision import PoseLandmarkerOptions + model_path = _ensure_model() + options = PoseLandmarkerOptions( + base_options=BaseOptions(model_asset_path=str(model_path)), + running_mode=VisionTaskRunningMode.IMAGE, + num_poses=1, + min_pose_detection_confidence=0.4, + output_segmentation_masks=False, + ) + pose = PoseLandmarker.create_from_options(options) + try: + for event in audio_events: + frames = _sample_frames(cv2, capture, fps, _event_offsets(event, config)) + landmarks = [_pose_landmarks(cv2, mp, pose, frame) for frame in frames] + landmarks = [item for item in landmarks if item is not None] + pose_motion = _landmark_motion(landmarks) + head_motion = _head_motion(landmarks) + visibility = len(landmarks) / max(1, len(frames)) + reaction_confidence = round( + max(0.0, min(0.95, (pose_motion * 3.0) + (head_motion * 2.0) + (visibility * 0.08))), + 3, + ) + results.append( + ReactionResult( + event_id=event.event_id, + start_time=event.start_time, + end_time=event.end_time, + reaction_confidence=reaction_confidence, + reaction_signals={ + "pose_motion": round(pose_motion, 4), + "head_motion": round(head_motion, 4), + "pose_visibility": round(visibility, 4), + }, + frames_sampled=len(frames), + vision_backend=self.name, + debug_info={"fps": fps, "landmark_frames": len(landmarks)}, + ) + ) + finally: + pose.close() + capture.release() + return results + + +def _import_dependencies(): + try: + import cv2 # type: ignore + import mediapipe as mp # type: ignore + from mediapipe.tasks.python.vision import PoseLandmarker + from mediapipe.tasks.python.vision.core.vision_task_running_mode import VisionTaskRunningMode + from mediapipe.tasks.python.core.base_options import BaseOptions + except Exception as exc: + raise BackendUnavailableError( + message="The MediaPipe vision backend requires mediapipe and opencv-python.", + code="mediapipe_not_installed", + suggestions=[ + "Install vision dependencies: pip install -r requirements-vision.txt", + "Use --vision-backend opencv for a CPU scene-motion baseline.", + "Use --vision-backend mock for deterministic demos/tests.", + ], + details={"error": str(exc)}, + ) from exc + return cv2, mp, PoseLandmarker, VisionTaskRunningMode, BaseOptions + + +def _event_offsets(event: AudioEventCandidate, config: PipelineConfig) -> list[float]: + midpoint = (event.start_time + event.end_time) / 2.0 + return [ + event.start_time - config.sample_window_before, + event.start_time, + midpoint, + event.end_time, + event.end_time + config.sample_window_after, + ] + + +def _sample_frames(cv2, capture, fps: float, offsets: list[float]) -> list[Any]: + frames: list[Any] = [] + for seconds in offsets: + if seconds < 0: + continue + capture.set(cv2.CAP_PROP_POS_FRAMES, int(seconds * fps)) + ok, frame = capture.read() + if ok and frame is not None: + frames.append(frame) + return frames + + +def _pose_landmarks(cv2, mp, pose, frame) -> list[tuple[float, float, float]] | None: + rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) + mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb) + result = pose.detect(mp_image) + if not result.pose_landmarks: + return None + return [(lm.x, lm.y, lm.visibility) for lm in result.pose_landmarks[0]] + + +def _landmark_motion(frames: list[list[tuple[float, float, float]]]) -> float: + if len(frames) < 2: + return 0.0 + indices = [0, 11, 12, 15, 16, 23, 24] + motions: list[float] = [] + for previous, current in zip(frames, frames[1:]): + distances = [] + for index in indices: + if index >= len(previous) or index >= len(current): + continue + if previous[index][2] < 0.35 or current[index][2] < 0.35: + continue + distances.append(_distance(previous[index], current[index])) + if distances: + motions.append(sum(distances) / len(distances)) + return max(motions) if motions else 0.0 + + +def _head_motion(frames: list[list[tuple[float, float, float]]]) -> float: + if len(frames) < 2: + return 0.0 + motions: list[float] = [] + for previous, current in zip(frames, frames[1:]): + if previous[0][2] < 0.35 or current[0][2] < 0.35: + continue + motions.append(_distance(previous[0], current[0])) + return max(motions) if motions else 0.0 + + +def _distance(first: tuple[float, float, float], second: tuple[float, float, float]) -> float: + return math.sqrt((first[0] - second[0]) ** 2 + (first[1] - second[1]) ** 2) diff --git a/main/cc_suggester/vision/backends/mock.py b/main/cc_suggester/vision/backends/mock.py new file mode 100644 index 0000000..da90e8b --- /dev/null +++ b/main/cc_suggester/vision/backends/mock.py @@ -0,0 +1,45 @@ +"""Deterministic demo visual reaction backend.""" + +from __future__ import annotations + +from pathlib import Path + +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.types import AudioEventCandidate, ReactionResult, VideoMetadata +from cc_suggester.vision.backends.base import VisionBackend + + +class MockVisionBackend(VisionBackend): + """Return plausible reaction scores for classroom-style events.""" + + name = "mock" + + def analyze( + self, + video_path: Path, + metadata: VideoMetadata, + audio_events: list[AudioEventCandidate], + config: PipelineConfig, + ) -> list[ReactionResult]: + return [_reaction_for(event) for event in audio_events] + + +def _reaction_for(event: AudioEventCandidate) -> ReactionResult: + reaction_map = { + "children_cheer": (0.82, {"raised_hands": 0.89, "face_change": 0.72, "motion_spike": 0.78}), + "school_bell": (0.61, {"head_turn": 0.67, "posture_shift": 0.52, "motion_spike": 0.64}), + "applause": (0.54, {"hand_motion": 0.79, "face_change": 0.38, "motion_spike": 0.68}), + "chair_scrape": (0.39, {"posture_shift": 0.35, "motion_spike": 0.42}), + "background_chatter": (0.16, {"ambient_scene": 0.73, "head_turn": 0.09}), + } + confidence, signals = reaction_map.get(event.event_id, (0.25, {})) + return ReactionResult( + event_id=event.event_id, + start_time=event.start_time, + end_time=event.end_time, + reaction_confidence=confidence, + reaction_signals=signals, + frames_sampled=7, + vision_backend="mock", + debug_info={"source": "deterministic mock backend"}, + ) diff --git a/main/cc_suggester/vision/backends/opencv.py b/main/cc_suggester/vision/backends/opencv.py new file mode 100644 index 0000000..5d372d7 --- /dev/null +++ b/main/cc_suggester/vision/backends/opencv.py @@ -0,0 +1,115 @@ +"""OpenCV visual reaction backend.""" + +from __future__ import annotations + +from pathlib import Path + +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.errors import BackendUnavailableError +from cc_suggester.core.types import AudioEventCandidate, ReactionResult, VideoMetadata +from cc_suggester.vision.backends.base import VisionBackend + + +class OpenCvVisionBackend(VisionBackend): + """Estimate scene reaction using frame differences around each event.""" + + name = "opencv" + requires_valid_media = True + + def analyze( + self, + video_path: Path, + metadata: VideoMetadata, + audio_events: list[AudioEventCandidate], + config: PipelineConfig, + ) -> list[ReactionResult]: + cv2 = _import_cv2() + capture = cv2.VideoCapture(str(video_path)) + if not capture.isOpened(): + raise BackendUnavailableError( + message="OpenCV could not open the input video.", + code="opencv_open_failed", + suggestions=[ + "Run ccs inspect on the input file.", + "Try re-encoding the video to MP4/H.264.", + ], + ) + + fps = metadata.fps or capture.get(cv2.CAP_PROP_FPS) or 25.0 + results: list[ReactionResult] = [] + try: + for event in audio_events: + frames = _sample_grayscale_frames( + cv2=cv2, + capture=capture, + fps=fps, + offsets=[ + event.start_time - config.sample_window_before, + event.start_time, + (event.start_time + event.end_time) / 2.0, + event.end_time, + event.end_time + config.sample_window_after, + ], + ) + motion = _motion_score(cv2, frames) + reaction_confidence = round(max(0.0, min(0.95, motion * 3.5)), 3) + results.append( + ReactionResult( + event_id=event.event_id, + start_time=event.start_time, + end_time=event.end_time, + reaction_confidence=reaction_confidence, + reaction_signals={ + "scene_motion_delta": round(motion, 4), + "frame_difference": round(motion, 4), + }, + frames_sampled=len(frames), + vision_backend=self.name, + debug_info={"fps": fps, "method": "grayscale frame difference"}, + ) + ) + finally: + capture.release() + return results + + +def _import_cv2(): + try: + import cv2 # type: ignore + except Exception as exc: + raise BackendUnavailableError( + message="The OpenCV vision backend requires opencv-python.", + code="opencv_not_installed", + suggestions=[ + "Install vision dependencies: pip install -r requirements-vision.txt", + "Use --vision-backend mock for deterministic demos/tests.", + ], + ) from exc + return cv2 + + +def _sample_grayscale_frames(cv2, capture, fps: float, offsets: list[float]) -> list[object]: + frames: list[object] = [] + for seconds in offsets: + if seconds < 0: + continue + capture.set(cv2.CAP_PROP_POS_FRAMES, int(seconds * fps)) + ok, frame = capture.read() + if not ok or frame is None: + continue + gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) + gray = cv2.resize(gray, (160, 90)) + frames.append(gray) + return frames + + +def _motion_score(cv2, frames: list[object]) -> float: + if len(frames) < 2: + return 0.0 + scores: list[float] = [] + for previous, current in zip(frames, frames[1:]): + diff = cv2.absdiff(previous, current) + mean_score = float(diff.mean()) / 255.0 + changed_fraction = float((diff > 18).sum()) / float(diff.size) + scores.append(max(mean_score * 8.0, changed_fraction * 5.0)) + return max(scores) if scores else 0.0 diff --git a/main/cc_suggester/vision/frame_sampler.py b/main/cc_suggester/vision/frame_sampler.py new file mode 100644 index 0000000..618abd2 --- /dev/null +++ b/main/cc_suggester/vision/frame_sampler.py @@ -0,0 +1,9 @@ +"""Frame sampling policy for event-aligned visual analysis.""" + +from __future__ import annotations + + +def sample_offsets(before: float = 1.0, after: float = 1.0) -> list[float]: + """Return relative frame offsets around an event start/end window.""" + + return [-before, -0.5, 0.0, 0.5, after] diff --git a/main/cc_suggester/vision/optical_flow.py b/main/cc_suggester/vision/optical_flow.py new file mode 100644 index 0000000..da501df --- /dev/null +++ b/main/cc_suggester/vision/optical_flow.py @@ -0,0 +1,14 @@ +"""Placeholder optical flow helpers.""" + +from __future__ import annotations + + +def describe_planned_signals() -> list[str]: + """Return visual motion signals planned for OpenCV implementation.""" + + return [ + "global optical-flow magnitude", + "localized motion spike", + "pre/post-event motion delta", + "camera shake suppression", + ] diff --git a/main/cc_suggester/vision/reactions.py b/main/cc_suggester/vision/reactions.py new file mode 100644 index 0000000..24c1b3d --- /dev/null +++ b/main/cc_suggester/vision/reactions.py @@ -0,0 +1,13 @@ +"""Reaction scoring helpers.""" + +from __future__ import annotations + +from cc_suggester.core.types import ReactionResult + + +def strongest_signal(reaction: ReactionResult) -> str | None: + """Return the strongest named reaction signal, if available.""" + + if not reaction.reaction_signals: + return None + return max(reaction.reaction_signals, key=lambda key: reaction.reaction_signals[key]) diff --git a/main/configs/default.json b/main/configs/default.json new file mode 100644 index 0000000..3a364d9 --- /dev/null +++ b/main/configs/default.json @@ -0,0 +1,17 @@ +{ + "language": "en", + "device": "auto", + "audio_backend": "mock", + "vision_backend": "mock", + "yamnet_model": null, + "yamnet_class_map_path": null, + "yamnet_top_k": 5, + "audio_threshold": 0.45, + "reaction_threshold": 0.35, + "decision_threshold": 0.65, + "review_threshold": 0.5, + "min_event_duration": 0.25, + "merge_gap": 0.4, + "sample_window_before": 1.0, + "sample_window_after": 1.0 +} diff --git a/main/pyproject.toml b/main/pyproject.toml new file mode 100644 index 0000000..b8dcaa0 --- /dev/null +++ b/main/pyproject.toml @@ -0,0 +1,27 @@ +[build-system] +requires = ["setuptools>=68"] +build-backend = "setuptools.build_meta" + +[project] +name = "cc-suggester" +version = "0.1.0" +description = "AI-assisted non-speech closed caption suggestion pipeline." +readme = "README.md" +requires-python = ">=3.10" +authors = [ + { name = "Planet Read project contributor" } +] +dependencies = [] + +[project.optional-dependencies] +audio = ["numpy>=1.26", "tensorflow>=2.16", "tensorflow-hub>=0.16"] +ui = ["streamlit>=1.34"] +vision = ["opencv-python>=4.8", "mediapipe>=0.10"] +dev = ["pytest>=8.0"] + +[project.scripts] +ccs = "cc_suggester.cli.app:main" + +[tool.setuptools.packages.find] +where = ["."] +include = ["cc_suggester*"] diff --git a/main/requirements-audio.txt b/main/requirements-audio.txt new file mode 100644 index 0000000..ab78471 --- /dev/null +++ b/main/requirements-audio.txt @@ -0,0 +1,6 @@ +# CPU DSP backend uses only the Python standard library plus ffmpeg. +# +# Optional YAMNet semantic backend: +numpy>=1.26 +tensorflow>=2.16 +tensorflow-hub>=0.16 diff --git a/main/requirements-dev.txt b/main/requirements-dev.txt new file mode 100644 index 0000000..039d26e --- /dev/null +++ b/main/requirements-dev.txt @@ -0,0 +1 @@ +pytest>=8.0 diff --git a/main/requirements-translate.txt b/main/requirements-translate.txt new file mode 100644 index 0000000..5dea570 --- /dev/null +++ b/main/requirements-translate.txt @@ -0,0 +1 @@ +# Placeholder for IndicTrans2 or other translation backend dependencies. diff --git a/main/requirements-ui.txt b/main/requirements-ui.txt new file mode 100644 index 0000000..15743b7 --- /dev/null +++ b/main/requirements-ui.txt @@ -0,0 +1 @@ +streamlit>=1.34 diff --git a/main/requirements-vision.txt b/main/requirements-vision.txt new file mode 100644 index 0000000..2e625e8 --- /dev/null +++ b/main/requirements-vision.txt @@ -0,0 +1,2 @@ +opencv-python>=4.8 +mediapipe>=0.10 diff --git a/main/requirements.txt b/main/requirements.txt new file mode 100644 index 0000000..cc67f4f --- /dev/null +++ b/main/requirements.txt @@ -0,0 +1,2 @@ +# Core scaffold intentionally uses only the Python standard library. +# Real model backends will add optional dependencies as they are implemented. diff --git a/main/scripts/generate_sample_video.py b/main/scripts/generate_sample_video.py new file mode 100644 index 0000000..1fec783 --- /dev/null +++ b/main/scripts/generate_sample_video.py @@ -0,0 +1,130 @@ +"""Generate tiny deterministic sample media for integration testing. + +The script uses Python's standard library to create a synthetic WAV file and +ffmpeg to combine it with a generated test video pattern when ffmpeg is +available. If ffmpeg is missing but OpenCV is installed, it writes a video-only +MP4 plus a sidecar WAV file. The sidecar path can be passed to the CLI with +``--audio-path``. + +Usage: + python scripts/generate_sample_video.py + python scripts/generate_sample_video.py --out tests/fixtures/sample.mp4 +""" + +from __future__ import annotations + +import argparse +import math +import shutil +import subprocess +import wave +from pathlib import Path + + +def main() -> int: + parser = argparse.ArgumentParser(description="Generate a sample MP4 with audible non-speech events.") + parser.add_argument("--out", type=Path, default=Path("tests/fixtures/sample_classroom.mp4")) + parser.add_argument("--duration", type=float, default=6.0) + args = parser.parse_args() + + args.out.parent.mkdir(parents=True, exist_ok=True) + wav_path = args.out.with_suffix(".wav") + _write_synthetic_wav(wav_path, duration=args.duration) + ffmpeg = shutil.which("ffmpeg") + if ffmpeg is not None: + _write_video_with_ffmpeg(ffmpeg, wav_path, args.out, duration=args.duration) + print(f"Generated embedded-audio sample video: {args.out}") + else: + _write_video_with_opencv(args.out, duration=args.duration) + print(f"Generated video-only sample: {args.out}") + print("ffmpeg was not found, so use the sidecar WAV with --audio-path.") + print(f"Generated source audio: {wav_path}") + return 0 + + +def _write_synthetic_wav(path: Path, *, duration: float, sample_rate: int = 16000) -> None: + samples = [] + total_samples = int(sample_rate * duration) + for index in range(total_samples): + seconds = index / sample_rate + base = math.sin(2 * math.pi * 180 * seconds) * 450 + event_one = math.sin(2 * math.pi * 880 * seconds) * 19000 if 1.15 <= seconds <= 1.55 else 0 + event_two = math.sin(2 * math.pi * 1240 * seconds) * 17000 if 3.25 <= seconds <= 3.70 else 0 + value = int(max(-32000, min(32000, base + event_one + event_two))) + samples.append(value) + + with wave.open(str(path), "wb") as wav: + wav.setnchannels(1) + wav.setsampwidth(2) + wav.setframerate(sample_rate) + wav.writeframes(b"".join(sample.to_bytes(2, "little", signed=True) for sample in samples)) + + +def _write_video_with_ffmpeg(ffmpeg: str, wav_path: Path, out_path: Path, *, duration: float) -> None: + command = [ + ffmpeg, + "-y", + "-f", + "lavfi", + "-i", + f"testsrc=size=640x360:rate=25:duration={duration}", + "-i", + str(wav_path), + "-shortest", + "-c:v", + "mpeg4", + "-q:v", + "5", + "-c:a", + "aac", + "-pix_fmt", + "yuv420p", + str(out_path), + ] + completed = subprocess.run(command, capture_output=True, text=True) + if completed.returncode != 0: + raise SystemExit(f"ffmpeg failed:\n{completed.stderr}") + + +def _write_video_with_opencv(out_path: Path, *, duration: float) -> None: + try: + import cv2 # type: ignore + import numpy as np # type: ignore + except Exception as exc: + raise SystemExit( + "Neither ffmpeg nor OpenCV video writing is available. " + "Install ffmpeg, or install opencv-python for sidecar fixture generation." + ) from exc + + width, height, fps = 640, 360, 25 + writer = cv2.VideoWriter( + str(out_path), + cv2.VideoWriter_fourcc(*"mp4v"), + fps, + (width, height), + ) + if not writer.isOpened(): + raise SystemExit("OpenCV could not open a VideoWriter for the requested output path.") + + total_frames = int(duration * fps) + for frame_index in range(total_frames): + t = frame_index / fps + frame = np.zeros((height, width, 3), dtype=np.uint8) + frame[:, :] = (235, 238, 230) + cv2.rectangle(frame, (0, 0), (width, 84), (92, 130, 178), -1) + cv2.putText(frame, "Demo classroom scene", (28, 52), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (255, 255, 255), 2) + cv2.rectangle(frame, (50, 120), (590, 300), (245, 245, 245), -1) + cv2.rectangle(frame, (72, 145), (210, 278), (88, 120, 150), -1) + cv2.rectangle(frame, (252, 145), (390, 278), (90, 150, 120), -1) + cv2.rectangle(frame, (432, 145), (570, 278), (150, 120, 90), -1) + if 1.15 <= t <= 1.55 or 3.25 <= t <= 3.70: + cv2.circle(frame, (320, 212), 44, (0, 215, 255), -1) + cv2.putText(frame, "SOUND EVENT", (230, 218), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (20, 45, 60), 2) + else: + cv2.circle(frame, (320, 212), 28, (190, 205, 220), -1) + writer.write(frame) + writer.release() + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/main/tests/test_config_cli.py b/main/tests/test_config_cli.py new file mode 100644 index 0000000..828bdac --- /dev/null +++ b/main/tests/test_config_cli.py @@ -0,0 +1,36 @@ +import json + +from cc_suggester.cli.app import main +from cc_suggester.core.config import PipelineConfig, load_config, merge_config + + +def test_load_and_merge_config(tmp_path): + config_path = tmp_path / "config.json" + config_path.write_text( + json.dumps({"language": "hi", "audio_backend": "mock", "vision_backend": "mock"}), + encoding="utf-8", + ) + + loaded = load_config(config_path) + merged = merge_config(loaded, language="ml", device="cpu") + + assert loaded.language == "hi" + assert merged.language == "ml" + assert merged.device == "cpu" + + +def test_cli_labels_command(capsys): + exit_code = main(["labels"]) + captured = capsys.readouterr() + + assert exit_code == 0 + assert "Supported languages" in captured.out + assert "horn_honk" in captured.out + + +def test_cli_unknown_command_suggests_analyze(capsys): + exit_code = main(["analize"]) + captured = capsys.readouterr() + + assert exit_code == 2 + assert "Did you mean: analyze?" in captured.err diff --git a/main/tests/test_dsp_backend.py b/main/tests/test_dsp_backend.py new file mode 100644 index 0000000..74dd95d --- /dev/null +++ b/main/tests/test_dsp_backend.py @@ -0,0 +1,41 @@ +import math +import wave +from pathlib import Path + +from cc_suggester.audio.backends.dsp import DspAudioBackend +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.types import VideoMetadata + + +def test_dsp_backend_detects_synthetic_loud_region(tmp_path: Path): + wav_path = tmp_path / "synthetic.wav" + _write_synthetic_wav(wav_path) + backend = DspAudioBackend() + config = PipelineConfig(audio_backend="dsp", vision_backend="mock", audio_threshold=0.40, run_dir=tmp_path) + metadata = VideoMetadata(path=wav_path, exists=True, has_audio=True, has_video=False, duration=3.0) + + events = backend.detect(wav_path, metadata, config) + + assert events + assert events[0].audio_backend == "dsp" + assert events[0].audio_confidence >= 0.40 + assert events[0].start_time < 1.5 + assert events[0].end_time > 1.0 + + +def _write_synthetic_wav(path: Path) -> None: + sample_rate = 16000 + samples = [] + for index in range(sample_rate * 3): + seconds = index / sample_rate + if 1.0 <= seconds <= 1.45: + value = int(math.sin(2 * math.pi * 880 * seconds) * 18000) + else: + value = int(math.sin(2 * math.pi * 220 * seconds) * 500) + samples.append(value) + + with wave.open(str(path), "wb") as wav: + wav.setnchannels(1) + wav.setsampwidth(2) + wav.setframerate(sample_rate) + wav.writeframes(b"".join(sample.to_bytes(2, "little", signed=True) for sample in samples)) diff --git a/main/tests/test_outputs.py b/main/tests/test_outputs.py new file mode 100644 index 0000000..6449649 --- /dev/null +++ b/main/tests/test_outputs.py @@ -0,0 +1,73 @@ +from cc_suggester.core.types import CaptionSuggestion +from cc_suggester.decision.labels import caption_for +from cc_suggester.output.csv_report import render_csv_report +from cc_suggester.output.srt import format_srt_time, write_srt + + +def test_format_srt_time(): + assert format_srt_time(0) == "00:00:00,000" + assert format_srt_time(62.345) == "00:01:02,345" + + +def test_caption_for_known_language(): + assert caption_for("horn_honk", "hi") == "[हॉर्न बजता है]" + assert caption_for("impact_sound", "ml") == "[പെട്ടെന്നുള്ള ശബ്ദം]" + assert caption_for("siren", "ta") == "[சைரன் ஒலிக்கிறது]" + + +def test_write_srt_only_accepts_accepted(tmp_path): + suggestions = [ + CaptionSuggestion( + event_id="horn_honk", + start_time=1.0, + end_time=2.0, + audio_confidence=0.9, + reaction_confidence=0.8, + decision_score=0.8, + accepted=True, + reason="accepted", + caption_text="[horn honks]", + language="en", + ), + CaptionSuggestion( + event_id="background_chatter", + start_time=3.0, + end_time=4.0, + audio_confidence=0.5, + reaction_confidence=0.1, + decision_score=0.2, + accepted=False, + reason="rejected", + caption_text="[background chatter]", + language="en", + ), + ] + output = tmp_path / "captions.srt" + write_srt(suggestions, output) + text = output.read_text(encoding="utf-8") + assert "[horn honks]" in text + assert "[background chatter]" not in text + + +def test_render_csv_report_includes_review_flags(): + suggestions = [ + CaptionSuggestion( + event_id="school_bell", + start_time=10.0, + end_time=11.0, + audio_confidence=0.7, + reaction_confidence=0.4, + decision_score=0.6, + accepted=False, + requires_review=True, + reason="borderline", + caption_text="[school bell rings]", + language="en", + ) + ] + + text = render_csv_report(suggestions) + + assert "school_bell" in text + assert "requires_review" in text + assert "True" in text diff --git a/main/tests/test_real_video_integration.py b/main/tests/test_real_video_integration.py new file mode 100644 index 0000000..2a0ab90 --- /dev/null +++ b/main/tests/test_real_video_integration.py @@ -0,0 +1,43 @@ +import subprocess +import sys +from pathlib import Path + +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.media import inspect_video +from cc_suggester.core.pipeline import analyze_video + + +def test_real_sample_video_inspect_and_analyze(tmp_path: Path): + sample_path = tmp_path / "sample_classroom.mp4" + sidecar_path = sample_path.with_suffix(".wav") + generator = Path(__file__).resolve().parents[1] / "scripts" / "generate_sample_video.py" + + subprocess.run( + [sys.executable, str(generator), "--out", str(sample_path)], + check=True, + capture_output=True, + text=True, + ) + + metadata = inspect_video(sample_path) + assert metadata.exists + assert metadata.has_video is True + assert metadata.duration is not None + + result = analyze_video( + sample_path, + PipelineConfig( + language="en", + audio_backend="dsp", + vision_backend="opencv", + output_dir=tmp_path / "outputs", + sidecar_audio_path=sidecar_path, + audio_threshold=0.40, + ), + ) + + assert result.files["srt"].exists() + assert result.files["json"].exists() + assert result.artifacts["audio_wav"].exists() + assert result.audio_events + assert any(suggestion.accepted for suggestion in result.suggestions) diff --git a/main/tests/test_review_export.py b/main/tests/test_review_export.py new file mode 100644 index 0000000..3bea189 --- /dev/null +++ b/main/tests/test_review_export.py @@ -0,0 +1,72 @@ +import json + +import pytest + +from cc_suggester.output.review_export import build_review_export, suggestions_from_review_rows + + +def test_review_export_uses_edited_statuses_and_caption_text(): + rows = [ + { + "index": 1, + "event_id": "horn_honk", + "start": 1.2, + "end": 2.4, + "caption": "[edited horn]", + "status": "accepted", + "audio": 0.9, + "reaction": 0.8, + "decision": 0.85, + "reason": "Pipeline accepted this event.", + }, + { + "index": 2, + "event_id": "traffic_noise", + "start": 5.0, + "end": 7.0, + "caption": "[traffic]", + "status": "rejected", + "audio": 0.5, + "reaction": 0.1, + "decision": 0.2, + "reason": "Ambient background noise.", + }, + ] + + export = build_review_export(rows, "en") + + assert len(export.suggestions) == 2 + assert export.suggestions[0].accepted is True + assert export.suggestions[0].caption_text == "[edited horn]" + assert export.suggestions[1].accepted is False + assert export.suggestions[1].requires_review is False + assert "[edited horn]" in export.srt_text + assert "[traffic]" not in export.srt_text + assert "traffic_noise" in export.csv_text + assert json.loads(export.json_text)["summary"]["accepted"] == 1 + + +def test_review_export_preserves_review_state(): + rows = [ + { + "event_id": "school_bell", + "start_time": 10, + "end_time": 11, + "caption_text": "[school bell]", + "status": "review", + } + ] + + suggestions = suggestions_from_review_rows(rows, "hi") + + assert suggestions[0].accepted is False + assert suggestions[0].requires_review is True + assert suggestions[0].language == "hi" + assert suggestions[0].debug_info["editor_status"] == "review" + + +def test_review_export_rejects_unknown_status(): + rows = [{"event_id": "horn_honk", "status": "maybe"}] + + with pytest.raises(ValueError, match="Unknown review status"): + suggestions_from_review_rows(rows, "en") diff --git a/main/tests/test_vision_pipeline.py b/main/tests/test_vision_pipeline.py new file mode 100644 index 0000000..7ec8472 --- /dev/null +++ b/main/tests/test_vision_pipeline.py @@ -0,0 +1,41 @@ +import subprocess +import sys +from pathlib import Path + +from cc_suggester.core.config import PipelineConfig +from cc_suggester.core.pipeline import detect_audio_events, score_visual_reactions + + +def test_score_visual_reactions_from_audio_report(tmp_path: Path): + sample_path = tmp_path / "sample_classroom.mp4" + sidecar_path = sample_path.with_suffix(".wav") + generator = Path(__file__).resolve().parents[1] / "scripts" / "generate_sample_video.py" + subprocess.run( + [sys.executable, str(generator), "--out", str(sample_path)], + check=True, + capture_output=True, + text=True, + ) + + audio_payload = detect_audio_events( + sample_path, + PipelineConfig( + audio_backend="dsp", + sidecar_audio_path=sidecar_path, + output_dir=tmp_path / "outputs", + audio_threshold=0.40, + ), + ) + + vision_payload = score_visual_reactions( + sample_path, + Path(audio_payload["files"]["audio_json"]), + PipelineConfig( + vision_backend="opencv", + output_dir=tmp_path / "outputs", + ), + ) + + assert vision_payload["reactions"] + assert Path(vision_payload["files"]["vision_json"]).exists() + assert vision_payload["reactions"][0]["vision_backend"] == "opencv" diff --git a/main/tests/test_yamnet_backend.py b/main/tests/test_yamnet_backend.py new file mode 100644 index 0000000..e249b64 --- /dev/null +++ b/main/tests/test_yamnet_backend.py @@ -0,0 +1,38 @@ +from pathlib import Path + +from cc_suggester.audio.backends.yamnet import _events_from_scores +from cc_suggester.audio.label_mapping import normalize_sound_label +from cc_suggester.core.config import PipelineConfig + + +def test_normalize_sound_label_common_yamnet_classes(): + assert normalize_sound_label("Vehicle horn, car horn, honking") == "horn_honk" + assert normalize_sound_label("Glass") == "glass_break" + assert normalize_sound_label("Applause") == "applause" + assert normalize_sound_label("Siren") == "siren" + assert normalize_sound_label("Speech") is None + + +def test_events_from_yamnet_scores_maps_classes_to_events(tmp_path: Path): + scores = [ + [0.91, 0.10, 0.05], + [0.05, 0.82, 0.02], + [0.02, 0.06, 0.78], + ] + class_names = [ + "Vehicle horn, car horn, honking", + "Glass", + "Applause", + ] + + events = _events_from_scores( + scores_array=scores, + class_names=class_names, + audio_path=tmp_path / "audio.wav", + config=PipelineConfig(audio_backend="yamnet", audio_threshold=0.40, yamnet_top_k=2), + ) + + assert [event.event_id for event in events] == ["horn_honk", "glass_break", "applause"] + assert events[0].audio_backend == "yamnet" + assert events[1].start_time == 0.48 + assert events[2].raw_class_name == "Applause" diff --git a/mockups/hindi.png b/mockups/hindi.png new file mode 100644 index 0000000..85f310d Binary files /dev/null and b/mockups/hindi.png differ diff --git a/mockups/mallu.png b/mockups/mallu.png new file mode 100644 index 0000000..cdf1c9a Binary files /dev/null and b/mockups/mallu.png differ diff --git a/mockups/telugu.png b/mockups/telugu.png new file mode 100644 index 0000000..fb30d99 Binary files /dev/null and b/mockups/telugu.png differ diff --git a/mockups/web-ui.html b/mockups/web-ui.html new file mode 100644 index 0000000..527ba75 --- /dev/null +++ b/mockups/web-ui.html @@ -0,0 +1,1435 @@ + + + + + + Intelligent CC Suggestion Tool - Interactive Web UI Mockup + + + +
+
+
+
CC
+
+

Intelligent Closed Caption Suggestion Tool

+

Non-speech sound detection, visual reaction scoring, and editor-reviewed SRT export

+
+
+ +
+
+ + +
+
+ + +
+
+ + +
+
+ + +
+ + +
+
+ +
+ + +
+
+

Video Review

+ 00:02:18 / 00:06:42 +
+ +
+
+
Current event: children_cheer
+
Paused
+
+
[children cheering]
+
+

Live signal explanation

+
+
Audio0.91
+
Reaction0.82
+
Decision0.88
+
+
+
+ +
+ + +
+
+
+
+ + +
+
+ + +
+ +
+ + + + + + + + + + + + + + +
StatusStartEndEventAudioReactionDecisionReason
+
+
+ +
+ + + + + +