Download, transcribe, and summarize podcast episodes. Fetches transcripts from RSS feeds (Podcasting 2.0), generates them when missing, detects speakers, creates summaries, and optionally extracts structured insights (GIL) and knowledge graphs (KG). Use local models (Whisper, transformers) or cloud APIs (OpenAI, Gemini, Anthropic, Mistral, DeepSeek, Grok, Ollama) — your choice.
Learning project: This is a personal project where I'm exploring AI-assisted coding and hands-on work with edge and cloud AI/ML technologies.
Personal use only. Downloaded content must remain local and not be redistributed. See Legal Notice.
-
Transcript Downloads — Automatic detection and download from RSS feeds
-
Episode selection — Order (
newest/oldest), optional publish-date window (--since/--until), offset, andmax_episodesfor large back-catalogs (CONFIGURATION.md, GitHub #521) -
Multi-feed corpus — One config or CLI invocation for multiple shows:
feeds/rss_urlsin YAML,--feeds-specfor structuredfeeds.spec.yaml/ JSON, or repeatable--rss/ legacy--rss-file; isolated output under<output_dir>/feeds/<stable_id>/per feed. Withvector_search+ FAISS, a single parent index is built under<output_dir>/searchafter all feeds finish;corpus_manifest.json,corpus_run_summary.json, and structured log lines record batch status (RFC-063, CONFIGURATION.md). Inspect offline:python -m podcast_scraper.cli corpus-status --output-dir <corpus_parent>. -
Transcription — Generate transcripts with Whisper, OpenAI API, or Google Gemini API
-
Audio Preprocessing — Optimize audio files before transcription (reduce size, remove silence, normalize loudness)
-
Speaker Detection — Identify speakers using spaCy NER, OpenAI, Google Gemini, Grok (real-time info), or other providers
-
Summarization — Episode summaries via local transformers (BART/LED), hybrid MAP-REDUCE (
hybrid_ml), or any of 7 LLM providers -
GIL Extraction — Structured insights and verbatim quotes with evidence grounding (
gi.jsonper episode, PRD-017) -
KG Extraction — Topic-entity knowledge graphs from transcripts (
kg.jsonper episode, RFC-055) -
Metadata Generation — JSON/YAML metadata per episode
-
Resumable — Skip existing files, handle interruptions gracefully
-
Provider System — Swap between local and cloud providers via config
-
MPS Exclusive Mode — Serialize GPU work on Apple Silicon to prevent memory contention and improve reliability (enabled by default)
-
Reproducibility — Deterministic runs with seed control, pinned model revisions, and comprehensive run manifests (Issue #379)
-
Operational Hardening — Configurable HTTP retries (media vs RSS), application-level episode retries on transient errors, timeout enforcement,
--fail-fast/--max-failures, structured JSON logging, andfailure_summaryinrun.jsonwhen episodes fail (CONFIGURATION.md — Download resilience, Issue #379) -
Security — Path validation, model allowlist validation, safetensors format preference, and
trust_remote_code=Falseenforcement (Issue #379) -
Diagnostics —
doctorcommand for environment validation and dependency checks (Issue #379) -
Semantic corpus search — Optional FAISS index (
vector_searchin config),search/indexCLIs, and semanticgi explore --topicwhen an index exists (guide, RFC-061) -
Run Tracking — Per-episode stage timings, run summaries, episode index files, and
metrics.jsonfields for download retries (http_urllib3_retry_events,episode_download_retries) (Issue #379) -
Live pipeline monitor — Optional
--monitor: separate process +richdashboard (or.monitor.logif stderr is not a TTY) and.pipeline_status.json. Optionalpip install -e ".[monitor]":--memrayfor heap profiling; with monitor + TTY,fruns py-spy todebug/flamegraph_*.svg(RFC-065, guide, #512) -
GI / KG Viewer (v2) — Optional browser UI: graph visualization, dashboard, semantic search, explore, Corpus Library (feed/episode browser, RFC-067), and Corpus Digest (time-windowed highlights, RFC-068) — all against a pipeline output folder (RFC-062; see below)
The Python package (CLI, pipeline, FastAPI) lives at the repo root (pyproject.toml,
Makefile, tests/). The GI/KG viewer is a Node app under web/gi-kg-viewer/
(package.json, Vite, Vitest, Playwright). They share one repository but use separate
install steps and different .env.example files (root vs web/gi-kg-viewer/) by design.
Onboarding in one place: Polyglot repository guide.
Browse Grounded Insight and Knowledge Graph artifacts, use semantic search and explore on your corpus, and view a dashboard—all against the same --output-dir you use for the pipeline.
| Goal | Install | Notes |
|---|---|---|
| API + built UI in one process | pip install -e ".[dev]" and, once, cd web/gi-kg-viewer && npm install && npm run build |
make init installs [dev,ml,llm]; FastAPI ships with [dev]. |
| Graph only, no Python API | Just open the UI (e.g. Vite dev) and use Choose .gi.json / .kg.json files | Works when /api/health fails; no list/search/index from the server. |
| Semantic search / index stats | [dev] + [ml] (FAISS, embeddings) as for podcast search |
Index lives under <output_dir>/search/. See Semantic Search Guide. |
From the repository root, with your virtualenv active:
pip install -e ".[dev]"
cd web/gi-kg-viewer && npm install && npm run build && cd ../..
python -m podcast_scraper.cli serve --output-dir /path/to/your/runThen open http://127.0.0.1:8000 (default port). In the sidebar, set Corpus root folder to that same directory, click List files, select .gi.json / .kg.json, and Load selected into graph.
FastAPI / HTTP API: Full /api/* reference, architecture, and tests — Server Guide. With the server running, interactive OpenAPI (Swagger) is at /docs and /openapi.json.
Development (API + hot-reload UI): make serve SERVE_OUTPUT_DIR=/path/to/your/run runs the FastAPI app and the Vite dev server together (UI proxied to the API). See web/gi-kg-viewer/README.md and the Server Guide above.
Testing the viewer: make test-ui (Vitest unit tests, ~1s) and make test-ui-e2e (Playwright browser E2E, Firefox).
The fastest way to try the platform end-to-end — no Python venv, no Node install, no provider keys required. You get the viewer, the API, and an on-demand pipeline runner all wired together; you drive the platform from your browser.
make stack-test-build # build api + viewer + pipeline images (~5–15 min first time)
make stack-test-up # bring up viewer (8090) + api + bundled mock RSS fixturesThen open http://127.0.0.1:8090, open Configuration → Feeds, add an RSS URL (or use the bundled fixture http://mock-feeds/p01_fast_with_transcript.xml), pick a profile under Job Profile (airgapped_thin runs locally with Whisper + transformers, no cloud calls), click Save, then Dashboard → Pipeline → Run pipeline job. Artifacts populate the Library, Digest, Search, and Graph tabs as the job runs.
To use cloud providers (OpenAI Whisper API + Gemini for everything else — faster + better quality), put OPENAI_API_KEY and GEMINI_API_KEY in a repo-root .env file, then:
make stack-test-build-cloud # adds the cloud-thin pipeline-llm image
# in the UI, choose the "cloud_thin" profile under Job Profile and SaveTear down (preserves the corpus): make stack-test-down. Tear down + drop everything: make stack-test-down STACK_TEST_DOWN_VOLUMES=1.
Full guide with architecture, troubleshooting, and production hints: Docker Compose guide.
For development, scripted CLI runs, or when you'd rather not run the whole stack — install the package in a venv and use the CLI directly. The Docker Compose path above is the recommended way to operate the platform; the Python path below is for development and headless CLI scripting.
- Python 3.10+ — Check with
python3 --version - ffmpeg (only needed for local Whisper transcription):
- macOS:
brew install ffmpeg - Linux:
apt install ffmpegoryum install ffmpeg - Windows: Download from ffmpeg.org
- Note: Not required if using OpenAI providers only
- macOS:
- Graphviz (only needed to regenerate architecture diagrams locally):
- macOS:
brew install graphviz - Linux:
apt install graphvizoryum install graphviz - Windows: Download from graphviz.org
- Note: Required for
make visualize. Diagrams must be committed;make ci/make ci-fastand CI runvisualizeand fail if they are stale.
- macOS:
Choose the installation method based on your use case:
| Use Case | Installation Command | What You Get | Disk Space |
|---|---|---|---|
| OpenAI only | pip install -e ".[llm]" |
Core + OpenAI, Gemini, Anthropic, Mistral SDKs | ~50MB |
| Local ML only | pip install -e ".[ml]" |
Core + Whisper, spaCy, torch, transformers, FAISS, llama-cpp-python (GGUF), etc. | ~1-3GB |
| Both (recommended) | pip install -e ".[ml,llm]" |
Local ML + all LLM API SDKs | ~1-3GB |
| Run comparison UI (eval runs) | pip install -e ".[compare]" |
Streamlit compare tool (RFC-047; make run-compare) |
moderate |
| GI/KG viewer API | pip install -e ".[dev]" |
FastAPI + uvicorn for python -m podcast_scraper.cli serve |
small |
| Monitor profiling extras (optional) | pip install -e ".[monitor]" |
py-spy + memray integrated with --monitor / --memray (RFC-065); not required for --monitor alone |
small |
Quick decision guide:
- Using LLM APIs (OpenAI, Gemini, etc.)? →
pip install -e ".[llm]" - Want to run models locally (Whisper, spaCy, Transformers)? →
pip install -e ".[ml]" - Want both? →
pip install -e ".[ml,llm]"(recommended) - Contributing / developing? →
pip install -e ".[dev,ml,llm]"
Note: LLM provider SDKs (openai, google-genai, anthropic, mistralai, httpx) are not in core — they require the [llm] extra. Core (pip install -e .) gives you the pipeline framework only.
Corpus indexing (vector_search with the default FAISS backend) calls the embedder with allow_download=False, so sentence-transformers weights must already be on disk in the Hugging Face hub cache. Before offline or sandbox runs, run make preload-ml-models and omit SKIP_GIL=1 so evidence embeddings (including the model named by vector_embedding_model) are pre-cached; GIL’s gi_embedding_model is checked separately when GIL uses transformers. If you set HF_HUB_CACHE, use the same value when preloading and when running the pipeline. See Semantic Search Guide.
Tip: For end users who just want to run the tool, consider using pipx or uv for easier installation. See the Installation Guide for all methods.
Use the latest released version for normal usage.
For LLM API users (OpenAI, Gemini, etc. — no local ML):
# Clone the repository
git clone https://github.com/chipi/podcast_scraper.git
cd podcast_scraper
git checkout <latest-release-tag> # e.g. v2.6.0
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Verify Python version (must be 3.10+)
python --version # Should show Python 3.10.x or higher
# CRITICAL: Upgrade pip and setuptools before installing
pip install --upgrade pip setuptools wheel
# Install with LLM provider SDKs (OpenAI, Gemini, Anthropic, Mistral, etc.)
pip install -e ".[llm]"For local ML users (Whisper, spaCy, Transformers):
# Same setup as above, then:
# Install package with ML dependencies (includes Whisper, spaCy, Transformers)
pip install -e ".[ml]"Note: LLM provider SDKs require the [llm] extra. For both local ML and LLM APIs: pip install -e ".[ml,llm]".
For isolated installation without managing virtual environments:
For LLM-only users:
# Install pipx (if not already installed)
# macOS: brew install pipx
# Linux: pip install --user pipx && pipx ensurepath
# Clone and install with LLM SDKs
git clone https://github.com/chipi/podcast_scraper.git
cd podcast_scraper
pipx install -e ".[llm]"
# Verify
python -m podcast_scraper.cli --helpFor local ML users:
# Same as above, but install with ML dependencies:
pipx install -e ".[ml]"See Installation Guide for details.
For faster installation using uv:
For LLM API users:
# Install uv (if not already installed)
# macOS/Linux: curl -LsSf https://astral.sh/uv/install.sh | sh
# macOS: brew install uv
# Clone and install with LLM SDKs
git clone https://github.com/chipi/podcast_scraper.git
cd podcast_scraper
uv venv
source .venv/bin/activate
uv pip install -e ".[llm]"For local ML users:
# Same as above, but install with ML dependencies:
uv pip install -e ".[ml]"See Installation Guide for details.
For most users the Quickest start: Docker Compose path above is the recommended way to run the platform — it brings up the viewer + API + on-demand pipeline jobs as a single stack.
The single-container docker run style below is appropriate when you want the pipeline as a long-running service triggered by an external scheduler (cron, systemd, supervisord), not via the operator UI.
LLM-only variant (small, fast, ~500 MB):
# Build LLM-only image
docker build --build-arg INSTALL_EXTRAS=llm -f docker/pipeline/Dockerfile -t podcast-scraper:llm .
# Run with config file
docker run -v ./config.yaml:/app/config.yaml \
-v ./output:/app/output \
-e OPENAI_API_KEY=sk-your-key \
podcast-scraper:llmML-enabled variant (full features, ~3-5 GB):
# Build ML-enabled image (default)
docker build --build-arg INSTALL_EXTRAS=ml -f docker/pipeline/Dockerfile -t podcast-scraper:ml .
# Run with config file
docker run -v ./config.yaml:/app/config.yaml \
-v ./output:/app/output \
podcast-scraper:mlSee Docker Compose guide for the recommended end-to-end stack flow, Docker Service guide for the single-container deployment style, and Docker Variants guide for image-tier (ml vs llm) trade-offs.
Use this if you are contributing or experimenting. This branch may contain unreleased changes.
# Clone the repository
git clone https://github.com/chipi/podcast_scraper.git
cd podcast_scraper
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Verify Python version (must be 3.10+)
python --version # Should show Python 3.10.x or higher
# CRITICAL: Upgrade pip and setuptools before installing
# This is required for editable installs with pyproject.toml
pip install --upgrade pip setuptools wheel
# Install package with ML dependencies
pip install -e ".[ml]"Important Notes:
- Python 3.10+ is REQUIRED — The project uses features that require Python 3.10 or higher. Always verify with
python --versionafter activating the venv. - Installation is required — You must run
pip install -e .(orpip install -e ".[ml]"for ML) before running CLI commands. Without it, you'll getModuleNotFoundError: No module named 'podcast_scraper'. - LLM API users — If you're using LLM-based providers (OpenAI, Gemini, Anthropic, etc.), install with
pip install -e ".[llm]". LLM provider SDKs are not in core — the[llm]extra is required. - Local ML users — If you want to use local Whisper, spaCy, or Transformers, install with
pip install -e ".[ml]"to get ML dependencies. - Upgrade pip/setuptools first — If you see
"editable mode currently requires a setuptools-based build"error, runpip install --upgrade pip setuptools wheeland try again. - Always activate the venv — Remember to activate your virtual environment (
source .venv/bin/activate) before running any commands.
If you plan to use LLM-based providers (OpenAI, etc.) for transcription, speaker detection, or summarization, you must set up a .env file with your API key:
# Copy the template
cp config/examples/.env.example .env
# Edit .env and add your LLM API key (REQUIRED for LLM providers)
# For OpenAI: OPENAI_API_KEY=sk-your-actual-api-key-here
# For Gemini: GEMINI_API_KEY=your-gemini-api-key-hereImportant variables:
OPENAI_API_KEY- Required if using OpenAI providers (transcription, speaker detection, or summarization)GEMINI_API_KEY- Required if using Gemini providers (transcription, speaker detection, or summarization)LOG_LEVEL- Controls logging verbosity (DEBUG, INFO, WARNING, ERROR, CRITICAL)OUTPUT_DIR- Custom output directory (default:./output/)CACHE_DIR- ML model cache location (only needed for local ML providers)- Performance tuning variables (WORKERS, TIMEOUT, etc.)
See config/examples/.env.example for all available options and detailed documentation.
Before running, verify the installation worked:
# Make sure venv is activated
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Test that the package is installed
python -c "import podcast_scraper; print('[ok] Installation successful')"
# Check CLI is available
python -m podcast_scraper.cli --helpPrerequisite: Make sure you've completed the installation steps above and activated your virtual environment.
After install, most people want packaged defaults (profile), their own knobs (operator YAML), and where the RSS URLs live (feeds file or spec). The CLI wires all three together; you do not need to merge YAML by hand.
# From the repo root, venv active, API keys in .env or the environment:
python -m podcast_scraper.cli \
--profile cloud_balanced \
--config config/manual/operator_defaults.yaml \
--feeds-spec config/manual/feeds.spec.registry_10.yaml| Flag | Role |
|---|---|
--profile NAME |
Named preset under config/profiles/<NAME>.yaml (e.g. cloud_balanced, local, cloud_quality). Merged first as defaults. |
--config PATH |
Operator YAML: output_dir, max_episodes, workers, booleans, etc. Explicit keys override the profile. |
| Feeds (choose one) | --feeds-spec PATH — structured file whose root has a feeds array (URLs as strings or { url: … } objects); same shape as corpus feeds.spec.yaml. --rss-file PATH — legacy one RSS URL per line. Positional URL or repeatable --rss URL — single-feed or multi-feed without a file. |
Do not combine --feeds-spec with --rss-file or with explicit RSS URL arguments in the same invocation (the CLI rejects that mix).
Alternative: --config config/profiles/cloud_balanced.yaml alone is valid when the feed URL(s) live inside that YAML (rss / rss_urls / feeds) or you pass a URL on the command line. Use --profile + --config when your operator file is small (paths, limits, flags) and the profile stays the packaged preset.
In-repo references: config/examples/feeds.spec.example.yaml, config/manual/operator_defaults.yaml, config/profiles/cloud_balanced.yaml. Deeper detail: CONFIGURATION.md (RSS / multi-feed), RSS_GUIDE.md.
The easiest way to get started is using an example config file:
# Make sure venv is activated
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Set your API key (if using LLM providers)
export OPENAI_API_KEY=sk-your-actual-api-key-here
# Run with an example config file
python -m podcast_scraper.cli --config config/examples/config.example.yamlAvailable example config:
config/examples/config.example.yaml— Example configuration with local ML providers (Whisper, spaCy, Transformers)
To customize: Copy the example config and edit the rss field with your podcast feed URL:
# Copy the example config
cp config/examples/config.example.yaml my-config.yaml
# Edit my-config.yaml and change the RSS feed URL
# Then run:
python -m podcast_scraper.cli --config my-config.yamlReplace https://example.com/feed.xml with your podcast's RSS feed URL.
For LLM-only users (no ML dependencies):
# Make sure venv is activated
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Set LLM API key (if not in .env file)
# For OpenAI:
export OPENAI_API_KEY=sk-your-actual-api-key-here
# For Gemini:
export GEMINI_API_KEY=your-gemini-api-key-here
# Download transcripts (automatically generates missing ones with LLM API)
# Using OpenAI:
python -m podcast_scraper.cli https://example.com/feed.xml \
--transcription-provider openai
# Using Gemini:
python -m podcast_scraper.cli https://example.com/feed.xml \
--transcription-provider gemini
# Only download existing transcripts (skip transcription)
python -m podcast_scraper.cli https://example.com/feed.xml \
--no-transcribe-missing
# Full processing with OpenAI providers: transcripts + speaker detection + summaries + metadata
python -m podcast_scraper.cli https://example.com/feed.xml \
--transcription-provider openai \
--speaker-detector-provider openai \
--summary-provider openai \
--generate-metadata \
--generate-summaries
# Full processing with Gemini providers:
python -m podcast_scraper.cli https://example.com/feed.xml \
--transcription-provider gemini \
--speaker-detector-provider gemini \
--summary-provider gemini \
--generate-metadata \
--generate-summariesFor local ML users (with ML dependencies installed):
# Make sure venv is activated
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Download transcripts (automatically generates missing ones with local Whisper)
python -m podcast_scraper.cli https://example.com/feed.xml
# Full processing with local ML: transcripts + speaker detection + summaries + metadata
python -m podcast_scraper.cli https://example.com/feed.xml \
--generate-metadata \
--generate-summariesUsing a config file:
See the Basic Usage with Example Config section above for the recommended approach. You can also use any config file from the config/examples/ directory or create your own.
Output: Files are organized in output/ with subdirectories:
transcripts/— Transcript files (RSS may serve.vtt/.srt; the pipeline normalizes those to.txtplus*.segments.jsonfor GI quote audio timing when cues parse cleanly; plain.txt/.htmlfeeds have no segment timing unless you add a sidecar)metadata/— JSON/YAML metadata, plusgi.json(GIL) andkg.json(KG) when enabledsearch/— FAISS vector index (whenvector_searchis enabled)run.json,index.json,metrics.json— Run tracking artifacts
Multi-feed corpora add feeds/<stable_id>/ per feed, plus corpus_manifest.json and corpus_run_summary.json at the corpus root.
Use --output-dir to customize the location (default: ./output/).
Problem: ModuleNotFoundError: No module named 'podcast_scraper'
Solution: Make sure you:
- Activated the virtual environment:
source .venv/bin/activate - Installed the package:
pip install -e ".[ml]" - Are using the venv's Python:
python -m podcast_scraper.cli(notpython3if system Python is different)
Problem: "editable mode currently requires a setuptools-based build"
Solution: Upgrade pip and setuptools first:
pip install --upgrade pip setuptools wheel
pip install -e ".[ml]"Problem: Python version is < 3.10
Solution: Create venv with a newer Python:
# Find available Python versions
python3.11 --version # or python3.12, etc.
# Create venv with specific version
python3.11 -m venv .venv
source .venv/bin/activateFor more help, see Troubleshooting Guide.
| Resource | Description |
|---|---|
| Roadmap | Project roadmap with prioritized PRDs and RFCs |
| Architecture | System design and module responsibilities |
| Testing Strategy | Testing approach and test pyramid |
| CLI Reference | All command-line options |
| Configuration | Config files and environment variables |
| Server Guide | FastAPI viewer API, endpoints, development |
| Semantic Search Guide | FAISS indexing, search CLI, configuration |
| AI Provider Comparison | Compare all providers: cost, quality, speed, privacy |
| Provider Deep Dives | Per-provider reference cards, benchmarks, and magic quadrant |
| Experiment Guide | Eval datasets, baselines, experiments (data/eval/) |
| Performance Profiles | Per-release stage timing snapshots (data/profiles/) |
| Guides | Development, testing, and usage guides |
| Troubleshooting | Common issues and solutions |
| Full Documentation | Complete docs site |
Contributing? See CONTRIBUTING.md.
MIT License — see LICENSE.
Note: The license applies to source code only, not to podcast content downloaded using this software. See Legal Notice.