Skip to content

charles1018/NemoScribe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

70 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬 NemoScribe

AI Subtitle Generator β€” Convert Video to SRT with NVIDIA NeMo Speech-to-Text

Fast, local, GPU-accelerated automatic transcription with word-level timestamps.
A free, offline alternative to cloud captioning services.

Python 3.10+ License: MIT CUDA NVIDIA NeMo Model GitHub stars

English | 繁體中文

Quick Start β€’ Installation β€’ Configuration β€’ Models β€’ Tuning Guide


NemoScribe is a command-line speech-to-text subtitle generator that converts video files (MP4, MKV, AVI, MOV, WebM) into accurately timed SRT subtitles β€” entirely on your own machine. Built on the NVIDIA NeMo ASR framework with Parakeet-TDT as the default model, it delivers state-of-the-art English automatic speech recognition (ASR) with word-level timestamps, and handles long audio (up to 3 hours) through chunked inference.

video.mp4 ─► FFmpeg ─► VAD speech detection ─► NeMo ASR (Parakeet-TDT) ─► ITN / LLM correction ─► video.srt

πŸ’‘ Why NemoScribe?

πŸ† State-of-the-art accuracy Parakeet-TDT-0.6B-v2 is a top-ranked English model on the HuggingFace Open ASR Leaderboard, ahead of much larger models including Whisper-large-v3
⚑ Blazing fast Up to ~240Γ— realtime on a consumer GPU β€” transcribe a full TV episode in well under a minute
πŸ”’ 100% local & private Your audio never leaves your machine. No cloud upload, no subscription, no per-minute fees
🎯 Accurate timestamps Word-level and segment-level timestamps straight from the model β€” no forced alignment hacks
🎭 Tuned for real content VAD presets and punctuation-based segmentation optimized for movies, TV drama, and dialogue-dense audio
πŸ€– Optional AI cleanup LLM post-processing (OpenAI / Anthropic) fixes character names, proper nouns, and homophones

Use cases: subtitling movies and TV shows, transcribing lectures and tutorials, captioning YouTube videos, interview and podcast transcription, accessibility captions (SDH/CC).

πŸ“‘ Table of Contents

✨ Features

  • Accurate Timestamps: Word-level and segment-level timestamps from NeMo ASR models
  • Long Audio Support: Process videos up to 3 hours with automatic chunking
  • Voice Activity Detection (VAD): Filter non-speech content to reduce hallucinations
  • Smart Segmentation: Split audio at silence boundaries, not mid-speech
  • Inverse Text Normalization (ITN): Convert spoken forms to written forms ("twenty five" β†’ "25")
  • LLM Post-processing: Fix character names and transcription errors using AI (OpenAI/Anthropic)
  • CUDA Optimized: CUDA graphs enabled by default for faster inference
  • Batch Processing: Process entire directories of videos

πŸ“‹ Requirements

Requirement Details
OS Windows 10/11, Linux
Python 3.10+ (3.12 recommended, avoid 3.13)
Package Manager uv (recommended)
CUDA Toolkit Default cu130 (13.0). PyTorch also supports 12.6/12.8.
FFmpeg Required for audio extraction
Hardware NVIDIA GPU with CUDA (recommended)

FFmpeg Installation

  • Windows: Download from gyan.dev, extract, add bin folder to PATH
  • Linux: sudo apt install ffmpeg

πŸ“¦ Installation

1. Install uv

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Linux/macOS
curl -LsSf https://astral.sh/uv/install.sh | sh

2. Clone the Repository

git clone https://github.com/charles1018/NemoScribe.git
cd NemoScribe

3. Install Dependencies

uv sync --python 3.12

The lockfile currently resolves NeMo ASR to nemo-toolkit[asr] 2.7.3 with PyTorch 2.11 CUDA 13.0 wheels. The version constraints stay on the NeMo 2.7.x line (>=2.7.3,<2.8) for patch-level compatibility.

4. Configure CUDA (Strongly Recommended)

By default, uv sync may install CPU-only PyTorch. GPU acceleration is strongly recommended for reasonable transcription speed. The project is pre-configured to use CUDA 13.0, so GPU users only need to run uv sync.

Note: PyTorch officially supports CUDA 12.6, 12.8, and 13.0. See PyTorch Get Started for details.

If you need a different CUDA version, modify pyproject.toml:

CUDA 13.0 (Default, Recommended):

[[tool.uv.index]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu130"
explicit = true

[tool.uv.sources]
torch = { index = "pytorch" }
torchvision = { index = "pytorch" }
torchaudio = { index = "pytorch" }

CUDA 12.8:

[[tool.uv.index]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu128"
explicit = true

CUDA 12.6:

[[tool.uv.index]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu126"
explicit = true

Then re-sync:

uv sync

Optional: LLM Post-processing

To enable AI-powered subtitle correction (fixes character names, proper nouns):

uv sync --extra llm

Then create a .env file with your API key:

cp .env.example .env
# Edit .env: OPENAI_API_KEY=sk-... or ANTHROPIC_API_KEY=sk-ant-...

5. Verify Setup

uv run python scripts/check_cuda.py
# Expected output: CUDA available: True

πŸš€ Quick Start

# Basic usage
uv run nemoscribe video_path="video.mp4"

# With VAD (useful for noisy audio, but not always best)
uv run nemoscribe video_path="video.mp4" vad.enabled=true

# Generate both VAD and no-VAD candidates (recommended when unsure)
uv run nemoscribe video_path="video.mp4" ab_test.vad=true

# Batch processing
uv run nemoscribe video_dir=/path/to/videos/ output_dir=/path/to/subtitles/

πŸ“– Advanced Tuning: For optimal parameter configurations for different scenarios (drama, news, technical tutorials), see TUNING_GUIDE.md.

🎯 Usage Examples

Subtitle Formatting

uv run nemoscribe video_path=video.mp4 \
  subtitle.max_chars_per_line=32 \
  subtitle.max_segment_duration=3.0 \
  subtitle.word_gap_threshold=0.5

# Disable word gap splitting
uv run nemoscribe video_path=video.mp4 subtitle.word_gap_threshold=null

Device and Precision

# Force CPU
uv run nemoscribe video_path=video.mp4 cuda=-1

# Specific GPU
uv run nemoscribe video_path=video.mp4 cuda=0

# Force float32 precision
uv run nemoscribe video_path=video.mp4 compute_dtype=float32

VAD Configuration

# Enable VAD with smart segmentation
uv run nemoscribe video_path=video.mp4 \
  vad.enabled=true \
  audio.smart_segmentation=true

# Adjust VAD sensitivity (optimized for drama/movie)
uv run nemoscribe video_path=video.mp4 \
  compute_dtype=float32 \
  vad.enabled=true \
  vad.onset=0.2 \
  vad.offset=0.1 \
  vad.min_duration_off=0.05 \
  vad.pad_onset=0.1 \
  vad.pad_offset=0.1 \
  decoding.rnnt_fused_batch_size=0 \
  decoding.segment_gap_threshold=20

Chicago Fire S12E01 validation on 2026-05-05 with NeMo 2.7.3 showed that compute_dtype=float32 and decoding.rnnt_fused_batch_size=0 are still the stable CUDA settings on an RTX 3070 Laptop GPU. In that sample, no-VAD produced slightly more complete output than VAD, so use ab_test.vad=true when you want the safer choice without manually running two commands.

VAD A/B Test

uv run nemoscribe video_path=video.mp4 \
  output_path=tmp_outputs/video.srt \
  compute_dtype=float32 \
  decoding.rnnt_fused_batch_size=0 \
  ab_test.vad=true

This writes both video.vad.srt and video.no_vad.srt using the same ASR settings. Use this when you want two candidate subtitles without manually running the command twice. VAD can reduce hallucinations on noisy audio, while no-VAD may preserve more dialogue on clean drama/movie audio.

ITN (Inverse Text Normalization)

# Enable ITN (requires nemo_text_processing)
uv run nemoscribe video_path=video.mp4 postprocessing.enable_itn=true

# For models with auto-capitalization
uv run nemoscribe video_path=video.mp4 \
  postprocessing.enable_itn=true \
  postprocessing.itn_input_case=cased

# Install ITN dependency
uv add nemo_text_processing

ITN Examples:

  • "twenty five dollars" β†’ "$25"
  • "january first twenty twenty five" β†’ "January 1, 2025"
  • "three point one four" β†’ "3.14"
  • "the meeting is at ten thirty am" β†’ "the meeting is at 10:30 a.m."

LLM Post-processing

Fix transcription errors (character names, proper nouns) using an LLM:

# Using OpenAI GPT-4o-mini (recommended: best cost/quality ratio, ~$0.06/episode)
uv run nemoscribe video_path=video.mp4 \
  vad.enabled=true \
  llm_postprocess.enabled=true \
  llm_postprocess.provider=openai \
  llm_postprocess.model=gpt-4o-mini

# Using Anthropic Claude 3.5 Sonnet (higher quality, ~$0.24/episode)
uv run nemoscribe video_path=video.mp4 \
  vad.enabled=true \
  llm_postprocess.enabled=true \
  llm_postprocess.provider=anthropic \
  llm_postprocess.model=claude-3-5-sonnet-20241022

What it fixes:

  • Character names: "Alias of us" β†’ "Kylie Estevez", "Herman" β†’ "Herrmann"
  • Proper nouns and technical terms
  • Homophones: their/there, to/too

Known limitations:

  • May over-correct ~10% of segments (mostly minor changes)
  • Semantic errors remain challenging
  • Requires API key and internet connection

Performance Measurement

uv run nemoscribe video_path=video.mp4 performance.calculate_rtfx=true
# Example output: RTFx=15.2x realtime (transcribed 600s in 39.5s)

πŸ”§ Configuration Reference

Main Options

Option Default Description
video_path - Path to input video file
video_dir - Path to directory containing videos
output_path auto Output SRT file path
output_dir auto Output directory for batch processing
pretrained_name nvidia/parakeet-tdt-0.6b-v2 Pretrained ASR model
model_path - Path to local .nemo checkpoint
cuda auto CUDA device ID (None=auto, negative=CPU)
compute_dtype auto float32, bfloat16, or float16
overwrite true Overwrite existing SRT files

Subtitle Formatting (subtitle.*)

Option Default Description
max_chars_per_line 42 Maximum characters per subtitle line
max_segment_duration 5.0 Maximum seconds per subtitle segment
word_gap_threshold 0.8 New segment if word gap >= this (seconds)

Audio Processing (audio.*)

Option Default Description
sample_rate 16000 Audio sample rate for ASR
max_chunk_duration 300.0 Max chunk size (5 min, safe for 8GB GPU)
chunk_overlap 2.0 Overlap between chunks (seconds)
smart_segmentation true Use VAD-based optimal split points
min_silence_for_split 0.3 Minimum silence duration for split point
prefer_longer_silence true Prefer splitting at longer silences

VAD Configuration (vad.*)

Option Default Description
enabled false Enable Voice Activity Detection
model vad_multilingual_frame_marblenet VAD model name
onset 0.3 Speech detection onset threshold (0-1)
offset 0.3 Speech detection offset threshold (0-1)
pad_onset 0.2 Padding before speech segments (seconds)
pad_offset 0.2 Padding after speech segments (seconds)
min_duration_on 0.2 Minimum speech segment duration
min_duration_off 0.2 Minimum non-speech gap to merge

Decoding Optimization (decoding.*)

Option Default Description
rnnt_fused_batch_size -1 CUDA graphs: -1=enabled, 0=disabled
rnnt_timestamp_type "all" Timestamp type: "char", "word", "segment", "all"
ctc_timestamp_type "all" CTC timestamp type
segment_separators [".", "?", "!"] Split segments at punctuation marks
segment_gap_threshold None Positive integer in frames; splits on large inter-word gaps and remains compatible with segment_separators

Internally, NemoScribe maps segment_separators to whichever NeMo decoding config field is available (segment_separators or the historical segment_seperators spelling).

Post-processing (postprocessing.*)

Option Default Description
enable_itn false Enable Inverse Text Normalization
itn_lang "en" Language for ITN
itn_input_case "lower_cased" Input case: "lower_cased" or "cased"

A/B Test (ab_test.*)

Option Default Description
vad false Generate both .vad.srt and .no_vad.srt candidates

LLM Post-processing (llm_postprocess.*)

Option Default Description
enabled false Enable LLM-based subtitle correction
provider "anthropic" LLM provider: "anthropic" or "openai"
model "claude-3-5-sonnet-20241022" Model name (provider-specific)
api_key None API key (None = read from environment)
batch_size 20 Segments per LLM request
max_retries 3 Max validation/retry attempts per batch
timeout 30 API request timeout (seconds)

Performance (performance.*)

Option Default Description
calculate_rtfx false Calculate Real-Time Factor (RTFx)
warmup_steps 1 Warmup iterations before timing

Logging (logging.*)

Option Default Description
verbose false Show all NeMo internal logs (useful for debugging)
suppress_repetitive_logs true Suppress repetitive NeMo logs during chunk processing

πŸ€– Recommended Models

Model Speed Accuracy Features
nvidia/parakeet-tdt-0.6b-v2 Fast Best (EN) Default. 1.69% WER, auto-punctuation
nvidia/parakeet-tdt-0.6b-v3 Fast Excellent Multilingual (25 languages), auto language detection
nvidia/parakeet-tdt-1.1b Medium Best Highest accuracy, no auto-punctuation
nvidia/parakeet-ctc-1.1b Fastest Good Fastest inference
nvidia/canary-1b-v2 Medium Good Multilingual, supports translation

Model Selection Guide

  • English subtitles: parakeet-tdt-0.6b-v2 (default, best out-of-box experience)
  • Multilingual: parakeet-tdt-0.6b-v3 (25 languages, auto-detection)
  • Highest accuracy: parakeet-tdt-1.1b (lowest WER, but no punctuation)
  • Fastest speed: parakeet-ctc-1.1b
  • Translation: canary-1b-v2 (25 languages, transcription + translation)

Note: parakeet-tdt-1.1b produces lowercase output without punctuation. The script automatically uses word-level timestamps to generate fine-grained subtitles.

Known model limitation: parakeet-tdt-0.6b-v2 may drop repeated words and false starts (disfluencies) β€” a known regression vs the 1.1b model (discussion #8). If verbatim disfluencies matter (e.g. stuttered or repeated lines), consider parakeet-tdt-1.1b.

Tip: You can try the default model without a GPU via the free hosted API on build.nvidia.com.

🎡 Long Audio Support

The script uses audio chunking to handle videos of any length:

  • Automatically splits long audio into smaller chunks (default: 5 minutes)
  • Chunks overlap (default: 2 seconds) to ensure accurate boundaries
  • Merges subtitles from all chunks, handling duplicates automatically
  • Long-audio attention tweaks are gated by audio.long_audio_threshold (default disables; lower to enable)

GPU Memory Recommendations:

GPU VRAM max_chunk_duration
8GB 120–300 (default 300)
16GB 600
24GB+ 0 (no chunking)

Note: On 8GB GPUs with compute_dtype=float32, dialogue-dense content can OOM at the default 300s because smart segmentation may produce longer continuous-speech chunks (verified on Yellowstone S03E01 with an RTX 3070 8GB). If you hit CUDA out of memory, set audio.max_chunk_duration=120.

πŸ• Timestamp Priority

The script obtains timestamps in this priority order:

  1. Segment-level: Direct segment timestamps from model (most accurate)
  2. Word-level: Word timestamps grouped by line length/duration/gaps
  3. Fallback: Estimated by speech rate (~150 words/min) when no timestamps available

Auto Fallback: If average segment length exceeds max_segment_duration * 2 (e.g., models without punctuation), the script automatically switches to word-level timestamps.

πŸ“ Project Structure

nemoscribe/
β”œβ”€β”€ __init__.py        # Package entry, version info
β”œβ”€β”€ __main__.py        # python -m nemoscribe support
β”œβ”€β”€ cli.py             # CLI parsing and entry point
β”œβ”€β”€ config.py          # All dataclass configurations
β”œβ”€β”€ audio.py           # Audio processing with ffmpeg
β”œβ”€β”€ vad.py             # Voice Activity Detection
β”œβ”€β”€ transcriber.py     # ASR model and transcription
β”œβ”€β”€ srt.py             # SRT formatting and output
β”œβ”€β”€ postprocess.py     # ITN, segment merging
β”œβ”€β”€ llm_postprocess.py # LLM-based subtitle correction
└── log_utils.py       # Log filtering

πŸ“Ή Supported Video Formats

.mp4, .mkv, .avi, .mov, .webm, .m4v

πŸ“ Example Output

1
00:00:00,120 --> 00:00:03,450
Welcome to our show today.

2
00:00:03,680 --> 00:00:07,200
We have an exciting episode planned for you.

3
00:00:07,450 --> 00:00:11,800
Let's get started with our first topic.

πŸ§ͺ Testing

# Run all tests
uv run python tests/test_improvements.py

# Run specific test
uv run python tests/test_improvements.py --test vad
uv run python tests/test_improvements.py --test itn
uv run python tests/test_improvements.py --test segmentation
uv run python tests/test_improvements.py --test metrics

# Available tests: baseline, vad, itn, decoding, nemo_api, segmentation, merging, performance, ab_test, metrics, srt, srt_edge, path, cli, cli_list, llm, llm_cli, llm_validation, llm_parsing, llm_fallback, llm_validation_fallback, full

Test Coverage

  • baseline_config: Default configuration backward compatibility
  • vad_config: VAD configuration correctness
  • itn_functions: ITN normalization functionality
  • decoding_config: Decoding configuration (CUDA graphs)
  • nemo_api_compatibility: NeMo decoding config alias compatibility
  • smart_segmentation: Smart segmentation logic
  • segment_merging: Overlapping segment merging
  • performance_config: Performance configuration
  • ab_test_config: VAD A/B test configuration and output path helpers
  • quality_metrics: WER/CER calculation
  • srt_formatting: SRT formatting
  • srt_edge_cases: SRT edge case handling (empty segments, special characters)
  • path_validation: Path validation and security checks
  • cli_config_override: CLI configuration override functionality
  • llm_config: LLM post-processing configuration defaults
  • llm_cli_override: LLM CLI parameter overrides
  • llm_validation: Batch result similarity validation
  • llm_parsing: JSON response parsing and prompt building
  • llm_fallback: Graceful fallback when disabled or no API key
  • llm_validation_fallback: Invalid LLM corrections fall back to the original batch
  • full_config: Complete configuration combination

πŸ“Š Quality Metrics

Calculate transcription quality using NeMo's official tools:

from tests.test_improvements import calculate_transcription_quality

result = calculate_transcription_quality(
    hypothesis="transcribed text",
    reference="ground truth text"
)
print(f"WER: {result['wer']:.2%}")
print(f"CER: {result['cer']:.2%}")

Output includes: wer, cer, insertion_rate, deletion_rate, substitution_rate

πŸ†˜ Troubleshooting

CUDA Out of Memory

Reduce chunk size:

uv run nemoscribe video_path=video.mp4 audio.max_chunk_duration=180.0

Timestamps Not Accurate

Use a model with timestamp support (parakeet-tdt-* recommended) and adjust segmentation parameters:

uv run nemoscribe video_path=video.mp4 \
  subtitle.max_segment_duration=3.0 \
  subtitle.word_gap_threshold=0.5

Model Download Slow

Models are automatically downloaded from HuggingFace/NGC on first use. For slow connections:

# Use HuggingFace mirror (China mainland)
export HF_ENDPOINT=https://hf-mirror.com

❓ FAQ

How is NemoScribe different from OpenAI Whisper? NemoScribe uses NVIDIA NeMo's Parakeet-TDT models, which rank above Whisper-large-v3 in English word error rate on the HuggingFace Open ASR Leaderboard while being far smaller and faster (up to ~240Γ— realtime on GPU). Whisper supports more languages out of the box; for English video subtitling, Parakeet-TDT typically gives better accuracy per second of compute, plus native word-level timestamps.

Does it work offline? Yes. After the first model download, transcription runs fully offline on your machine. Only the optional LLM post-processing step requires an internet connection.

Can it generate subtitles for languages other than English? Yes β€” switch to nvidia/parakeet-tdt-0.6b-v3 (25 languages with auto-detection) or nvidia/canary-1b-v2 (transcription + translation) via pretrained_name=....

Do I need a GPU? A CUDA-capable NVIDIA GPU is strongly recommended (8GB VRAM is enough). CPU mode works (cuda=-1) but is much slower.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository at github.com/charles1018/NemoScribe
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

For bug reports and feature requests, please open an issue.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

NemoScribe is built upon the following open-source projects:

  • NVIDIA NeMo - Neural Modules toolkit for conversational AI (Apache 2.0 License)
  • Parakeet-TDT - NVIDIA's state-of-the-art ASR model (CC-BY-4.0 License)

We thank NVIDIA for making these excellent tools and models available to the community.

πŸ“š References

Model Resources

Resource Description
nvidia/parakeet-tdt-0.6b-v2 Default model, architecture and best practices
nvidia/parakeet-tdt-0.6b-v3 Multilingual version, 25 languages
nvidia/canary-1b-v2 Multilingual with translation support
HuggingFace Space Demo Official demo with long audio handling

NeMo Framework References

File Path Description
examples/asr/transcribe_speech.py Main architecture reference
nemo/collections/asr/parts/utils/transcribe_utils.py Core utilities: get_inference_device(), get_inference_dtype()
nemo/collections/asr/parts/utils/rnnt_utils.py Hypothesis class, timestamp data structure

Key Implementation Details

Long Audio Optimization (from HuggingFace Space):

# Switch to local attention for memory efficiency on audio >8 minutes
model.change_attention_model("rel_pos_local_attn", [256, 256])
model.change_subsampling_conv_chunking_factor(1)  # 1 = auto select

Timestamp Data Structure (from Hypothesis):

{
    'segment': [{'start': float, 'end': float, 'segment': str}, ...],
    'word': [{'start': float, 'end': float, 'word': str}, ...],
    'char': [...]  # character-level timestamps
}

Documentation


If NemoScribe saves you time, consider giving it a ⭐ β€” it helps others find the project!

NemoScribe β€” automatic subtitle generation, video transcription, and speech-to-text for everyone.

About

🎬 AI subtitle generator: convert video to SRT subtitles locally with NVIDIA NeMo Parakeet-TDT speech-to-text. GPU-accelerated, word-level timestamps, VAD, LLM correction β€” a fast offline Whisper alternative.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages