Skip to content

BF667-IDLE/Hyper-RVC

Repository files navigation

Hyper RVC WebUI

Open in Colab

An autonomous pipeline to create covers with any RVC v2 trained AI voice from YouTube videos or a local audio file. For developers who may want to add a singing functionality into their AI assistant/chatbot/vtuber, or for people who want to hear their favourite characters sing their favourite song.


Project Structure

Hyper-RVC/
├── app.py                      # Main WebUI entry point (Gradio)
├── main.py                     # Legacy entry point (redirects to app.py)
├── core.py                     # Backward-compatible shim (re-exports from main)
├── cli.py                      # CLI interface
│
├── main/                       # Core processing modules
│   ├── __init__.py             # Package init + audioop 3.13 shim
│   ├── core.py                 # Pipeline orchestrator (full_inference_program)
│   │
│   ├── uvr/                    # Audio separation (vocal, karaoke, dereverb, deecho, denoise)
│   │   ├── __init__.py
│   │   ├── separator.py        # High-level separation functions
│   │   └── models/             # Separation model architectures
│   │       ├── bs_roformer/    # BS-Roformer & Mel-Band-Roformer
│   │       ├── bandit/         # Band-Split RNN v1
│   │       ├── bandit_v2/      # Band-Split RNN v2
│   │       ├── scnet/          # SCNet
│   │       ├── scnet_unofficial/ # SCNet unofficial variant
│   │       ├── demucs4ht.py    # Demucs v4 hybrid transformer
│   │       ├── mdx23c_tfc_tdf_v3.py # MDX23C TFC-TDF
│   │       ├── segm_models.py  # Segmentation models
│   │       ├── torchseg_models.py # Torch segmentation models
│   │       ├── upernet_swin_transformers.py # UperNet Swin
│   │       ├── ensemble.py     # Model ensembling
│   │       ├── inference.py    # Inference engine
│   │       └── utils.py        # Audio processing utilities
│   │
│   ├── rvc/                    # RVC voice conversion
│   │   ├── __init__.py
│   │   ├── converter.py        # High-level RVC conversion wrapper
│   │   └── engine/             # Applio RVC inference engine
│   │       ├── configs/config.py
│   │       ├── infer/infer.py
│   │       ├── infer/pipeline.py
│   │       └── lib/
│   │           ├── algorithm/  # Neural network architectures (11 modules)
│   │           ├── predictors/ # F0 extractors (CREPE, FCPE, RMVPE)
│   │           ├── utils.py
│   │           └── tools/      # Model download, TTS, audio split
│   │
│   ├── tts/                    # Text-to-Speech (Edge TTS + RVC)
│   │   ├── __init__.py
│   │   └── synthesis.py        # TTS generation + optional RVC conversion
│   │
│   ├── whisper/                # Whisper transcription
│   │   ├── __init__.py
│   │   ├── transcriber.py      # High-level transcription wrapper
│   │   └── diarization/        # Speaker diarization engine
│   │       ├── whisper.py      # Whisper model wrapper
│   │       ├── speechbrain.py  # SpeechBrain integration
│   │       ├── ECAPA_TDNN.py   # Speaker embedding model
│   │       ├── encoder.py      # Speaker encoder
│   │       ├── features.py     # Audio feature extraction
│   │       ├── segment.py      # Voice activity segmentation
│   │       ├── embedding.py    # Speaker embeddings
│   │       ├── audio.py        # Audio preprocessing
│   │       └── parameter_transfer.py # Model weight transfer
│   │
│   └── tools/                  # Shared utilities
│       ├── __init__.py
│       ├── variables.py        # Model definitions, FP16 config
│       ├── config.py           # Application configuration management
│       ├── file_utils.py       # File search, model lookup, downloads
│       ├── audio_utils.py      # Audio effects (Pedalboard), merging (pydub)
│       ├── downloader.py       # Model & music download orchestration
│       ├── gdown.py            # Google Drive download handler
│       ├── hf.py               # HuggingFace download handler
│       ├── mediafire.py        # MediaFire download handler
│       └── logger.py           # Logging utilities
│
├── tabs/                       # Gradio UI tabs
│   ├── full_inference.py       # Voice Conversion tab
│   ├── tts_inference.py        # TTS Generation tab
│   ├── whisper_transcription.py # Transcription tab
│   ├── download_music.py       # Download Music tab
│   ├── download_model.py       # Download Model tab
│   └── settings.py             # Settings tab
│
├── assets/                     # Static assets
│   ├── themes/                 # Gradio themes
│   ├── i18n/                   # Internationalization (8 languages)
│   ├── config.json             # User settings
│   ├── logo.ico                # Favicon
│   └── colab.ipynb             # Google Colab notebook
│
├── docs/                       # Documentation
├── tests/                      # Test suite
├── requirements.txt
├── run.sh / run.bat            # Launch scripts
└── update.sh / update.bat      # Update scripts

Quick Start

# Install dependencies
pip install -r requirements.txt

# Start the WebUI
python app.py

# With custom options
python app.py --port 8080 --share --open

CLI Usage

List available models

python cli.py list-models

Download a model

python cli.py download-model --link https://huggingface.co/username/model

Download music from YouTube

python cli.py download-music --link https://youtube.com/watch?v=...

Basic audio conversion

python cli.py convert --model-path /path/to/model.pth --input-audio song.mp3

Full conversion with all options

python cli.py convert --model-path model.pth --index-path index.pth \
  --input-audio song.mp3 --pitch 12 --reverb --denoise \
  --vocal-model "Mel-Roformer by KimberleyJSN" \
  --export-format-final mp3

Add Effect

python cli.py add-effects input.wav --room-size 0.8 --wet 0.4 --output-path output.wav

Merge audio files

python cli.py merge \
  --vocals vocals.flac \
  --instrumental instrumental.flac \
  --backing-vocals backing.flac \
  --format mp3

Module Overview

main/uvr/ — Audio Separation

Handles all audio source separation tasks using state-of-the-art deep learning models including Mel-Roformer, BS-Roformer, MDX23C, Demucs v4, Bandit-Split RNN, and SCNet architectures:

  • Vocal/instrumental separation
  • Karaoke (lead + backing vocal) separation
  • Dereverb processing
  • Deecho processing
  • Denoise processing
  • Model ensembling for improved separation quality

main/rvc/ — Voice Conversion

Wraps the Applio RVC inference engine for high-quality voice conversion with support for multiple pitch extractors (CREPE, FCPE, RMVPE), embedder models, and various export formats. The engine includes a full pipeline architecture with attention-based generators, discriminators, and synthesizer modules.

main/tts/ — Text-to-Speech

Microsoft Edge TTS integration with 400+ voices across 11 languages, with optional RVC voice conversion on the generated audio for creating AI covers from text input alone.

main/whisper/ — Transcription & Diarization

OpenAI Whisper-based speech-to-text with word-level timestamps, multi-language support, and speaker diarization powered by SpeechBrain and ECAPA-TDNN speaker embeddings. Supports SRT, VTT, and JSON export formats.

main/tools/ — Utilities

Shared helpers used across all modules:

  • variables: Model definitions, FP16 hardware detection
  • config: Application configuration management
  • file_utils: File search, model metadata lookup, file downloads
  • audio_utils: Reverb effects (Pedalboard), audio merging (pydub), FP16 config patching
  • downloader: RVC model download and YouTube music download orchestration
  • gdown / hf / mediafire: Platform-specific download handlers

main/core.py — Pipeline Orchestrator

The full_inference_program() function coordinates the complete audio processing pipeline by calling into the specialized sub-modules in sequence: vocal separation → karaoke separation → dereverb → deecho → denoise → RVC conversion → backing vocals → reverb → pitch adjust → merge.

Cloud Usage

Credits

👑 Project Team

Role Member Description
👑 Base Project Owner ShiromiyaG Owner of RVC-AI-Cover-Maker-UI which this project is based on
🔧 Base Project Contributor Eddycrack864 Contributor to RVC-AI-Cover-Maker-UI
🧩 Fork Owner BF667-IDLE Hyper RVC fork owner & maintainer
🧪 Colab UI Nick088 Start UI cells in Colab & Kaggle, local setup guide
🧪 QA Testing FullmatheusBallZ Google Colab testing & quality assurance

🏗️ Core Projects & Libraries

Project Author Role
RVC-AI-Cover-Maker-UI ShiromiyaG Original UI framework & cover pipeline design (owned by ShiromiyaG)
Applio IAHispano RVC inference engine, pitch extraction & model management
Audio Separator beveradb Python audio source separation wrapping UVR models
UVR GUI Anjok07 Gold standard vocal removal with pretrained model weights
ZFTurbo ZFTurbo BS-Roformer, Mel-Band-Roformer, SCNet, MDX23C, Bandit, Demucs
AICoverGen SociallyIneptWeeb AI cover generation pipeline & processing concepts
Vietnamese-RVC PhamHuynhAnh16 Base RVC library code, additional F0 predictors & method fixes

🧠 AI Models & Frameworks

Library Author Purpose
Whisper OpenAI Speech recognition & transcription
SpeechBrain SpeechBrain Team Speaker diarization & ECAPA-TDNN embeddings
PyTorch Meta AI Deep learning framework for all neural networks
Transformers HuggingFace Model loading & pretrained model utilities
NumPy NumPy Team Numerical computing & array operations
ONNX Runtime Microsoft High-performance model inference

🎵 Voice & Pitch Extraction

Library Author Purpose
Edge TTS rany2 400+ voices in 11 languages via Microsoft Edge
CREPE Max Morrison Neural pitch estimation (F0 extraction)
RMVPE OpenVPI Robust vocal pitch estimation
FCPE SCToolsystem Fundamental frequency contour extraction
Faiss Meta Research Voice embedding similarity search & retrieval
TorchCREPE Max Morrison PyTorch-native CREPE implementation

🎧 Audio Processing

Library Author Purpose
Pedalboard Spotify Studio-quality reverb, EQ & audio effects
pydub James Robert Audio manipulation, format conversion & merging
librosa librosa Team Music & audio analysis, feature extraction
ffmpeg FFmpeg Project Audio/video encoding, decoding & processing
SoundFile Bastian Bechtold Audio file I/O via libsndfile
SciPy SciPy Team Signal processing & scientific computing

📥 Download & Network

Library Author Purpose
yt-dlp yt-dlp contributors YouTube & 1000+ site audio/video downloader
HuggingFace Hub HuggingFace Model & dataset hosting for pretrained RVC models
gdown Kentaro Wada Google Drive file downloader
Requests Kenneth Reitz HTTP library for Python
tqdm Casper da Costa-Luis Progress bars for downloads & processing

🖼️ UI & Design

Library Author Purpose
Gradio HuggingFace Web UI framework with tabs, sliders & file uploads
Python Python Software Foundation Core language runtime
Freepik Freepik Cyber-themed cover image for the WebUI

Built with ❤️ by the Hyper-RVC community · Open Source under MIT License

About

A WebUI to create song covers with any RVC v2 trained AI voice from YouTube videos or audio files.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages