PyVoice

A Python-based speech processing tool with ASR (Speech-to-Text) and TTS (Text-to-Speech) capabilities, powered by ONNX Runtime.

Features

🎙 Speech-to-Text (ASR) - Paraformer model for Chinese/English recognition
🔊 Text-to-Speech (TTS) - MeloTTS for natural Chinese/English synthesis
🚀 FastAPI Server - RESTful API with Swagger documentation
📦 Python SDK - Easy integration into other projects
🎵 Multi-format Audio - Supports WAV, MP3, FLAC, OGG (via FFmpeg)
⚡ Optimized Performance - NumPy vectorization for 10-100x speedup
🖥 CLI Tool - Simple command-line interface

Quick Start

Installation

# Clone and setup
git clone <repo-url>
cd PyVoice

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # macOS/Linux

# Install dependencies
pip install -r requirements.txt

# Install ffmpeg (for MP3/FLAC support)
brew install ffmpeg  # macOS
# or: sudo apt install ffmpeg  # Linux

Download Models

git lfs install
git clone https://huggingface.co/getcharzp/go-speech ./temp_models
mv ./temp_models/melo_weights ./melo_weights
mv ./temp_models/paraformer_weights ./paraformer_weights
rm -rf ./temp_models

Usage

1. CLI Tool

# Text-to-Speech
python3 main.py tts "Hello, world!" --out output.wav

# Speech-to-Text
python3 main.py asr audio.wav

2. FastAPI Server

python api.py

Then visit:

Web UI: http://localhost:8000
API Docs: http://localhost:8000/docs

API Endpoints:

Method	Endpoint	Description
POST	`/asr`	Upload audio → Get text
POST	`/tts`	Send text → Get audio (WAV/MP3)
GET	`/records`	Get history records
GET	`/audio/{id}`	Download audio file

3. Python SDK

from pyvoice import ASR, TTS

# Speech Recognition
asr = ASR()
text = asr.recognize("audio.mp3")  # Supports WAV, MP3, FLAC, etc.

# Text-to-Speech
tts = TTS()
tts.synthesize_to_file("Hello world", "output.wav")

Project Structure

PyVoice/
├── api.py              # FastAPI server
├── pyvoice.py          # SDK interface
├── main.py             # CLI tool
├── asr/                # ASR engine (Paraformer)
├── tts/                # TTS engine (MeloTTS)
├── internal/           # Audio processing utilities
├── storage/            # API uploads/outputs (auto-created)
├── paraformer_weights/ # ASR model files
└── melo_weights/       # TTS model files

Tech Stack

Runtime: Python 3.8+, ONNX Runtime
API: FastAPI, Uvicorn
Audio: NumPy, SciPy, FFmpeg
NLP: jieba (Chinese segmentation)
Storage: SQLite

Performance

Audio processing optimized with NumPy vectorization:

Operation	Optimization	Speedup
Pre-emphasis	Vectorized array ops	~50x
FFT	Batch `np.fft.fft`	~20x
Mel filterbank	Matrix multiplication	~100x

License

MIT License

Acknowledgments

Based on getcharzp/go-speech, refactored into a Python SDK with FastAPI integration.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
asr		asr
assets		assets
internal		internal
tts		tts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api.py		api.py
main.py		main.py
onnx_config.py		onnx_config.py
pyvoice.py		pyvoice.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyVoice

Features

Quick Start

Installation

Download Models

Usage

1. CLI Tool

2. FastAPI Server

3. Python SDK

Project Structure

Tech Stack

Performance

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

Frida7771/PyVoice

Folders and files

Latest commit

History

Repository files navigation

PyVoice

Features

Quick Start

Installation

Download Models

Usage

1. CLI Tool

2. FastAPI Server

3. Python SDK

Project Structure

Tech Stack

Performance

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages