Fast, local, GPU-accelerated automatic transcription with word-level timestamps.
A free, offline alternative to cloud captioning services.
English | ηΉι«δΈζ
Quick Start β’ Installation β’ Configuration β’ Models β’ Tuning Guide
NemoScribe is a command-line speech-to-text subtitle generator that converts video files (MP4, MKV, AVI, MOV, WebM) into accurately timed SRT subtitles β entirely on your own machine. Built on the NVIDIA NeMo ASR framework with Parakeet-TDT as the default model, it delivers state-of-the-art English automatic speech recognition (ASR) with word-level timestamps, and handles long audio (up to 3 hours) through chunked inference.
video.mp4 ββΊ FFmpeg ββΊ VAD speech detection ββΊ NeMo ASR (Parakeet-TDT) ββΊ ITN / LLM correction ββΊ video.srt
| π State-of-the-art accuracy | Parakeet-TDT-0.6B-v2 is a top-ranked English model on the HuggingFace Open ASR Leaderboard, ahead of much larger models including Whisper-large-v3 |
| β‘ Blazing fast | Up to ~240Γ realtime on a consumer GPU β transcribe a full TV episode in well under a minute |
| π 100% local & private | Your audio never leaves your machine. No cloud upload, no subscription, no per-minute fees |
| π― Accurate timestamps | Word-level and segment-level timestamps straight from the model β no forced alignment hacks |
| π Tuned for real content | VAD presets and punctuation-based segmentation optimized for movies, TV drama, and dialogue-dense audio |
| π€ Optional AI cleanup | LLM post-processing (OpenAI / Anthropic) fixes character names, proper nouns, and homophones |
Use cases: subtitling movies and TV shows, transcribing lectures and tutorials, captioning YouTube videos, interview and podcast transcription, accessibility captions (SDH/CC).
- Features
- Requirements
- Installation
- Quick Start
- Usage Examples
- Configuration Reference
- Recommended Models
- Long Audio Support
- Troubleshooting
- Contributing
- License
- Acknowledgments
- References
- Accurate Timestamps: Word-level and segment-level timestamps from NeMo ASR models
- Long Audio Support: Process videos up to 3 hours with automatic chunking
- Voice Activity Detection (VAD): Filter non-speech content to reduce hallucinations
- Smart Segmentation: Split audio at silence boundaries, not mid-speech
- Inverse Text Normalization (ITN): Convert spoken forms to written forms ("twenty five" β "25")
- LLM Post-processing: Fix character names and transcription errors using AI (OpenAI/Anthropic)
- CUDA Optimized: CUDA graphs enabled by default for faster inference
- Batch Processing: Process entire directories of videos
| Requirement | Details |
|---|---|
| OS | Windows 10/11, Linux |
| Python | 3.10+ (3.12 recommended, avoid 3.13) |
| Package Manager | uv (recommended) |
| CUDA Toolkit | Default cu130 (13.0). PyTorch also supports 12.6/12.8. |
| FFmpeg | Required for audio extraction |
| Hardware | NVIDIA GPU with CUDA (recommended) |
- Windows: Download from gyan.dev, extract, add
binfolder to PATH - Linux:
sudo apt install ffmpeg
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"# Linux/macOS
curl -LsSf https://astral.sh/uv/install.sh | shgit clone https://github.com/charles1018/NemoScribe.git
cd NemoScribeuv sync --python 3.12The lockfile currently resolves NeMo ASR to nemo-toolkit[asr] 2.7.3 with PyTorch 2.11 CUDA 13.0 wheels. The version constraints stay on the NeMo 2.7.x line (>=2.7.3,<2.8) for patch-level compatibility.
By default, uv sync may install CPU-only PyTorch. GPU acceleration is strongly recommended for reasonable transcription speed. The project is pre-configured to use CUDA 13.0, so GPU users only need to run uv sync.
Note: PyTorch officially supports CUDA 12.6, 12.8, and 13.0. See PyTorch Get Started for details.
If you need a different CUDA version, modify pyproject.toml:
CUDA 13.0 (Default, Recommended):
[[tool.uv.index]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu130"
explicit = true
[tool.uv.sources]
torch = { index = "pytorch" }
torchvision = { index = "pytorch" }
torchaudio = { index = "pytorch" }CUDA 12.8:
[[tool.uv.index]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu128"
explicit = trueCUDA 12.6:
[[tool.uv.index]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu126"
explicit = trueThen re-sync:
uv syncTo enable AI-powered subtitle correction (fixes character names, proper nouns):
uv sync --extra llmThen create a .env file with your API key:
cp .env.example .env
# Edit .env: OPENAI_API_KEY=sk-... or ANTHROPIC_API_KEY=sk-ant-...uv run python scripts/check_cuda.py
# Expected output: CUDA available: True# Basic usage
uv run nemoscribe video_path="video.mp4"
# With VAD (useful for noisy audio, but not always best)
uv run nemoscribe video_path="video.mp4" vad.enabled=true
# Generate both VAD and no-VAD candidates (recommended when unsure)
uv run nemoscribe video_path="video.mp4" ab_test.vad=true
# Batch processing
uv run nemoscribe video_dir=/path/to/videos/ output_dir=/path/to/subtitles/π Advanced Tuning: For optimal parameter configurations for different scenarios (drama, news, technical tutorials), see TUNING_GUIDE.md.
uv run nemoscribe video_path=video.mp4 \
subtitle.max_chars_per_line=32 \
subtitle.max_segment_duration=3.0 \
subtitle.word_gap_threshold=0.5
# Disable word gap splitting
uv run nemoscribe video_path=video.mp4 subtitle.word_gap_threshold=null# Force CPU
uv run nemoscribe video_path=video.mp4 cuda=-1
# Specific GPU
uv run nemoscribe video_path=video.mp4 cuda=0
# Force float32 precision
uv run nemoscribe video_path=video.mp4 compute_dtype=float32# Enable VAD with smart segmentation
uv run nemoscribe video_path=video.mp4 \
vad.enabled=true \
audio.smart_segmentation=true
# Adjust VAD sensitivity (optimized for drama/movie)
uv run nemoscribe video_path=video.mp4 \
compute_dtype=float32 \
vad.enabled=true \
vad.onset=0.2 \
vad.offset=0.1 \
vad.min_duration_off=0.05 \
vad.pad_onset=0.1 \
vad.pad_offset=0.1 \
decoding.rnnt_fused_batch_size=0 \
decoding.segment_gap_threshold=20Chicago Fire S12E01 validation on 2026-05-05 with NeMo 2.7.3 showed that compute_dtype=float32 and decoding.rnnt_fused_batch_size=0 are still the stable CUDA settings on an RTX 3070 Laptop GPU. In that sample, no-VAD produced slightly more complete output than VAD, so use ab_test.vad=true when you want the safer choice without manually running two commands.
uv run nemoscribe video_path=video.mp4 \
output_path=tmp_outputs/video.srt \
compute_dtype=float32 \
decoding.rnnt_fused_batch_size=0 \
ab_test.vad=trueThis writes both video.vad.srt and video.no_vad.srt using the same ASR settings. Use this when you want two candidate subtitles without manually running the command twice. VAD can reduce hallucinations on noisy audio, while no-VAD may preserve more dialogue on clean drama/movie audio.
# Enable ITN (requires nemo_text_processing)
uv run nemoscribe video_path=video.mp4 postprocessing.enable_itn=true
# For models with auto-capitalization
uv run nemoscribe video_path=video.mp4 \
postprocessing.enable_itn=true \
postprocessing.itn_input_case=cased
# Install ITN dependency
uv add nemo_text_processingITN Examples:
"twenty five dollars"β"$25""january first twenty twenty five"β"January 1, 2025""three point one four"β"3.14""the meeting is at ten thirty am"β"the meeting is at 10:30 a.m."
Fix transcription errors (character names, proper nouns) using an LLM:
# Using OpenAI GPT-4o-mini (recommended: best cost/quality ratio, ~$0.06/episode)
uv run nemoscribe video_path=video.mp4 \
vad.enabled=true \
llm_postprocess.enabled=true \
llm_postprocess.provider=openai \
llm_postprocess.model=gpt-4o-mini
# Using Anthropic Claude 3.5 Sonnet (higher quality, ~$0.24/episode)
uv run nemoscribe video_path=video.mp4 \
vad.enabled=true \
llm_postprocess.enabled=true \
llm_postprocess.provider=anthropic \
llm_postprocess.model=claude-3-5-sonnet-20241022What it fixes:
- Character names:
"Alias of us"β"Kylie Estevez","Herman"β"Herrmann" - Proper nouns and technical terms
- Homophones:
their/there,to/too
Known limitations:
- May over-correct ~10% of segments (mostly minor changes)
- Semantic errors remain challenging
- Requires API key and internet connection
uv run nemoscribe video_path=video.mp4 performance.calculate_rtfx=true
# Example output: RTFx=15.2x realtime (transcribed 600s in 39.5s)| Option | Default | Description |
|---|---|---|
video_path |
- | Path to input video file |
video_dir |
- | Path to directory containing videos |
output_path |
auto | Output SRT file path |
output_dir |
auto | Output directory for batch processing |
pretrained_name |
nvidia/parakeet-tdt-0.6b-v2 |
Pretrained ASR model |
model_path |
- | Path to local .nemo checkpoint |
cuda |
auto | CUDA device ID (None=auto, negative=CPU) |
compute_dtype |
auto | float32, bfloat16, or float16 |
overwrite |
true | Overwrite existing SRT files |
| Option | Default | Description |
|---|---|---|
max_chars_per_line |
42 | Maximum characters per subtitle line |
max_segment_duration |
5.0 | Maximum seconds per subtitle segment |
word_gap_threshold |
0.8 | New segment if word gap >= this (seconds) |
| Option | Default | Description |
|---|---|---|
sample_rate |
16000 | Audio sample rate for ASR |
max_chunk_duration |
300.0 | Max chunk size (5 min, safe for 8GB GPU) |
chunk_overlap |
2.0 | Overlap between chunks (seconds) |
smart_segmentation |
true | Use VAD-based optimal split points |
min_silence_for_split |
0.3 | Minimum silence duration for split point |
prefer_longer_silence |
true | Prefer splitting at longer silences |
| Option | Default | Description |
|---|---|---|
enabled |
false | Enable Voice Activity Detection |
model |
vad_multilingual_frame_marblenet |
VAD model name |
onset |
0.3 | Speech detection onset threshold (0-1) |
offset |
0.3 | Speech detection offset threshold (0-1) |
pad_onset |
0.2 | Padding before speech segments (seconds) |
pad_offset |
0.2 | Padding after speech segments (seconds) |
min_duration_on |
0.2 | Minimum speech segment duration |
min_duration_off |
0.2 | Minimum non-speech gap to merge |
| Option | Default | Description |
|---|---|---|
rnnt_fused_batch_size |
-1 | CUDA graphs: -1=enabled, 0=disabled |
rnnt_timestamp_type |
"all" | Timestamp type: "char", "word", "segment", "all" |
ctc_timestamp_type |
"all" | CTC timestamp type |
segment_separators |
[".", "?", "!"] |
Split segments at punctuation marks |
segment_gap_threshold |
None | Positive integer in frames; splits on large inter-word gaps and remains compatible with segment_separators |
Internally, NemoScribe maps segment_separators to whichever NeMo decoding config field is available (segment_separators or the historical segment_seperators spelling).
| Option | Default | Description |
|---|---|---|
enable_itn |
false | Enable Inverse Text Normalization |
itn_lang |
"en" | Language for ITN |
itn_input_case |
"lower_cased" | Input case: "lower_cased" or "cased" |
| Option | Default | Description |
|---|---|---|
vad |
false | Generate both .vad.srt and .no_vad.srt candidates |
| Option | Default | Description |
|---|---|---|
enabled |
false | Enable LLM-based subtitle correction |
provider |
"anthropic" | LLM provider: "anthropic" or "openai" |
model |
"claude-3-5-sonnet-20241022" | Model name (provider-specific) |
api_key |
None | API key (None = read from environment) |
batch_size |
20 | Segments per LLM request |
max_retries |
3 | Max validation/retry attempts per batch |
timeout |
30 | API request timeout (seconds) |
| Option | Default | Description |
|---|---|---|
calculate_rtfx |
false | Calculate Real-Time Factor (RTFx) |
warmup_steps |
1 | Warmup iterations before timing |
| Option | Default | Description |
|---|---|---|
verbose |
false | Show all NeMo internal logs (useful for debugging) |
suppress_repetitive_logs |
true | Suppress repetitive NeMo logs during chunk processing |
| Model | Speed | Accuracy | Features |
|---|---|---|---|
nvidia/parakeet-tdt-0.6b-v2 |
Fast | Best (EN) | Default. 1.69% WER, auto-punctuation |
nvidia/parakeet-tdt-0.6b-v3 |
Fast | Excellent | Multilingual (25 languages), auto language detection |
nvidia/parakeet-tdt-1.1b |
Medium | Best | Highest accuracy, no auto-punctuation |
nvidia/parakeet-ctc-1.1b |
Fastest | Good | Fastest inference |
nvidia/canary-1b-v2 |
Medium | Good | Multilingual, supports translation |
- English subtitles:
parakeet-tdt-0.6b-v2(default, best out-of-box experience) - Multilingual:
parakeet-tdt-0.6b-v3(25 languages, auto-detection) - Highest accuracy:
parakeet-tdt-1.1b(lowest WER, but no punctuation) - Fastest speed:
parakeet-ctc-1.1b - Translation:
canary-1b-v2(25 languages, transcription + translation)
Note:
parakeet-tdt-1.1bproduces lowercase output without punctuation. The script automatically uses word-level timestamps to generate fine-grained subtitles.
Known model limitation:
parakeet-tdt-0.6b-v2may drop repeated words and false starts (disfluencies) β a known regression vs the 1.1b model (discussion #8). If verbatim disfluencies matter (e.g. stuttered or repeated lines), considerparakeet-tdt-1.1b.
Tip: You can try the default model without a GPU via the free hosted API on build.nvidia.com.
The script uses audio chunking to handle videos of any length:
- Automatically splits long audio into smaller chunks (default: 5 minutes)
- Chunks overlap (default: 2 seconds) to ensure accurate boundaries
- Merges subtitles from all chunks, handling duplicates automatically
- Long-audio attention tweaks are gated by
audio.long_audio_threshold(default disables; lower to enable)
GPU Memory Recommendations:
| GPU VRAM | max_chunk_duration |
|---|---|
| 8GB | 120β300 (default 300) |
| 16GB | 600 |
| 24GB+ | 0 (no chunking) |
Note: On 8GB GPUs with
compute_dtype=float32, dialogue-dense content can OOM at the default 300s because smart segmentation may produce longer continuous-speech chunks (verified on Yellowstone S03E01 with an RTX 3070 8GB). If you hitCUDA out of memory, setaudio.max_chunk_duration=120.
The script obtains timestamps in this priority order:
- Segment-level: Direct segment timestamps from model (most accurate)
- Word-level: Word timestamps grouped by line length/duration/gaps
- Fallback: Estimated by speech rate (~150 words/min) when no timestamps available
Auto Fallback: If average segment length exceeds
max_segment_duration * 2(e.g., models without punctuation), the script automatically switches to word-level timestamps.
nemoscribe/
βββ __init__.py # Package entry, version info
βββ __main__.py # python -m nemoscribe support
βββ cli.py # CLI parsing and entry point
βββ config.py # All dataclass configurations
βββ audio.py # Audio processing with ffmpeg
βββ vad.py # Voice Activity Detection
βββ transcriber.py # ASR model and transcription
βββ srt.py # SRT formatting and output
βββ postprocess.py # ITN, segment merging
βββ llm_postprocess.py # LLM-based subtitle correction
βββ log_utils.py # Log filtering
.mp4, .mkv, .avi, .mov, .webm, .m4v
1
00:00:00,120 --> 00:00:03,450
Welcome to our show today.
2
00:00:03,680 --> 00:00:07,200
We have an exciting episode planned for you.
3
00:00:07,450 --> 00:00:11,800
Let's get started with our first topic.# Run all tests
uv run python tests/test_improvements.py
# Run specific test
uv run python tests/test_improvements.py --test vad
uv run python tests/test_improvements.py --test itn
uv run python tests/test_improvements.py --test segmentation
uv run python tests/test_improvements.py --test metrics
# Available tests: baseline, vad, itn, decoding, nemo_api, segmentation, merging, performance, ab_test, metrics, srt, srt_edge, path, cli, cli_list, llm, llm_cli, llm_validation, llm_parsing, llm_fallback, llm_validation_fallback, full- baseline_config: Default configuration backward compatibility
- vad_config: VAD configuration correctness
- itn_functions: ITN normalization functionality
- decoding_config: Decoding configuration (CUDA graphs)
- nemo_api_compatibility: NeMo decoding config alias compatibility
- smart_segmentation: Smart segmentation logic
- segment_merging: Overlapping segment merging
- performance_config: Performance configuration
- ab_test_config: VAD A/B test configuration and output path helpers
- quality_metrics: WER/CER calculation
- srt_formatting: SRT formatting
- srt_edge_cases: SRT edge case handling (empty segments, special characters)
- path_validation: Path validation and security checks
- cli_config_override: CLI configuration override functionality
- llm_config: LLM post-processing configuration defaults
- llm_cli_override: LLM CLI parameter overrides
- llm_validation: Batch result similarity validation
- llm_parsing: JSON response parsing and prompt building
- llm_fallback: Graceful fallback when disabled or no API key
- llm_validation_fallback: Invalid LLM corrections fall back to the original batch
- full_config: Complete configuration combination
Calculate transcription quality using NeMo's official tools:
from tests.test_improvements import calculate_transcription_quality
result = calculate_transcription_quality(
hypothesis="transcribed text",
reference="ground truth text"
)
print(f"WER: {result['wer']:.2%}")
print(f"CER: {result['cer']:.2%}")Output includes: wer, cer, insertion_rate, deletion_rate, substitution_rate
Reduce chunk size:
uv run nemoscribe video_path=video.mp4 audio.max_chunk_duration=180.0Use a model with timestamp support (parakeet-tdt-* recommended) and adjust segmentation parameters:
uv run nemoscribe video_path=video.mp4 \
subtitle.max_segment_duration=3.0 \
subtitle.word_gap_threshold=0.5Models are automatically downloaded from HuggingFace/NGC on first use. For slow connections:
# Use HuggingFace mirror (China mainland)
export HF_ENDPOINT=https://hf-mirror.comHow is NemoScribe different from OpenAI Whisper? NemoScribe uses NVIDIA NeMo's Parakeet-TDT models, which rank above Whisper-large-v3 in English word error rate on the HuggingFace Open ASR Leaderboard while being far smaller and faster (up to ~240Γ realtime on GPU). Whisper supports more languages out of the box; for English video subtitling, Parakeet-TDT typically gives better accuracy per second of compute, plus native word-level timestamps.
Does it work offline? Yes. After the first model download, transcription runs fully offline on your machine. Only the optional LLM post-processing step requires an internet connection.
Can it generate subtitles for languages other than English?
Yes β switch to nvidia/parakeet-tdt-0.6b-v3 (25 languages with auto-detection) or nvidia/canary-1b-v2 (transcription + translation) via pretrained_name=....
Do I need a GPU?
A CUDA-capable NVIDIA GPU is strongly recommended (8GB VRAM is enough). CPU mode works (cuda=-1) but is much slower.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository at github.com/charles1018/NemoScribe
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
For bug reports and feature requests, please open an issue.
This project is licensed under the MIT License - see the LICENSE file for details.
NemoScribe is built upon the following open-source projects:
- NVIDIA NeMo - Neural Modules toolkit for conversational AI (Apache 2.0 License)
- Parakeet-TDT - NVIDIA's state-of-the-art ASR model (CC-BY-4.0 License)
We thank NVIDIA for making these excellent tools and models available to the community.
| Resource | Description |
|---|---|
| nvidia/parakeet-tdt-0.6b-v2 | Default model, architecture and best practices |
| nvidia/parakeet-tdt-0.6b-v3 | Multilingual version, 25 languages |
| nvidia/canary-1b-v2 | Multilingual with translation support |
| HuggingFace Space Demo | Official demo with long audio handling |
| File Path | Description |
|---|---|
examples/asr/transcribe_speech.py |
Main architecture reference |
nemo/collections/asr/parts/utils/transcribe_utils.py |
Core utilities: get_inference_device(), get_inference_dtype() |
nemo/collections/asr/parts/utils/rnnt_utils.py |
Hypothesis class, timestamp data structure |
Long Audio Optimization (from HuggingFace Space):
# Switch to local attention for memory efficiency on audio >8 minutes
model.change_attention_model("rel_pos_local_attn", [256, 256])
model.change_subsampling_conv_chunking_factor(1) # 1 = auto selectTimestamp Data Structure (from Hypothesis):
{
'segment': [{'start': float, 'end': float, 'segment': str}, ...],
'word': [{'start': float, 'end': float, 'word': str}, ...],
'char': [...] # character-level timestamps
}If NemoScribe saves you time, consider giving it a β β it helps others find the project!
NemoScribe β automatic subtitle generation, video transcription, and speech-to-text for everyone.