live-speech-to-text

Speech to text transcription. Ambient sound is recorded and streamed to an Automatic Speech Recognition (ASR) model that will transcribe the audio to text.

Highlights

Works completely offline and runs locally.
- No API's or other services needed.
Capable of transcribing both pre-recorded audio files (i.e. a .flac file) and running a live stream of audio to perform real time ASR.
Multiple Hugging Face ASR model options available.

Requirements

Microphone Setup

On Windows, there was no additional setup required beyond connecting a recording device. On Linux (Raspberry pi, etc.), there's some additional setup required to connect and specify the recording HAT. For details on setting up recording devices on the RPi, see this wiki.

Software

Python 3.11.1

Other versions of Python may work but not guranteed.

Create a virtual environment and install from "requirements.txt".

pip install -r requirements.txt

OS

Supported

Linux (Raspberry Pi OS Bookworm)
- Performance on other Linux distributions has not been tested.
Windows 11 (see note below about minor differences)

Windows Differences

There are two packages that need to be installed separately for full functionality on Windows, however most of the tools will work without them. These libraries haven't been fully tested with this repo and may or may not work after installing.

cudnn_ops_infer64_8.dll
- Required to run only the "Systran/faster-whisper-tiny.en" model.
ffmpeg
- This only affects transcription of a recorded audio file (i.e. .flac, .wav). It doesn't affect the continuous ASR ability which is done by streaming raw audio data.

Hardware

The table below are combinations of Hardware and Models that have been tested. These tests were done using "continuous_asr.py" to determine the working ability of the model. A model that was too large and caused freezing or other problems has no time listed for Transcription Time.

See Hugging Face for details on the models.

Successfully tested hardware/model combinations

Hardware	OS	GPU	Model	Transcription Time (single spoken word)
Raspberry Pi 4B (2 GB)	Raspberry Pi OS Bookworm	-	facebook/wav2vec2-base-960hr	-
Raspberry Pi 4B (2 GB)	Raspberry Pi OS Bookworm	-	openai/whisper-tiny.en	~4 sec
Raspberry Pi 4B (2 GB)	Raspberry Pi OS Bookworm	-	Systran/faster-whisper-tiny.en	~3 sec
Desktop PC	Windows 10	3060 Ti	facebook/wav2vec2-base-960hr	~.15 sec
Desktop PC	Windows 10	3060 Ti	openai/whisper-tiny.en	~.55 sec

Audio Recording Hardware

All performance benchmarks were done using the recording devices below.

Hardware	Recording Device
Raspberry Pi 4B (2 GB)	Seeed Studio 2-mic HAT
Desktop PC	Bluetooth headphones with built in mic

How to Run

Determine Recording Device

If you are confident that your system's default recording device is the one you want to use, then skip this step. If you're unsure of what the default device is, list all the current recording devices by index make a note of the index of the device you want to use. This index will then be an input for "continuous_asr" ("input_device_index").

python list_recording_devices.py

continuous_asr.py

Continually streams live captured audio to the model and transcribes real time.

python continuous_asr.py --input_device_index <i> --model_name <name of model>

tests/test_all.py

A decent run through of the capabilities of a couple different models and the main audio manipulation class.

python test_all.py

Notes

Technical Details

How to stream sound samples to model

Record super short samples of sound (~.1 seconds in length).
Analyze each sample for exceeding a sound threshold above an adjustable limit.
Continuously build a buffer of sound samples that contain sound above the threshold.
Once you have a sample where the threshold is NOT exceeded, assume that this is the end of the phrase or word and stop building the buffer.
Send the whole buffer to the model directly to be transcribed into text.
Go back to listening for sound.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
AsrModels		AsrModels
tests		tests
.gitignore		.gitignore
README.md		README.md
audio_object.py		audio_object.py
continuous_asr.py		continuous_asr.py
list_recording_devices.py		list_recording_devices.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

live-speech-to-text

Highlights

Requirements

Microphone Setup

Software

Python 3.11.1

OS

Supported

Windows Differences

Hardware

Successfully tested hardware/model combinations

Audio Recording Hardware

How to Run

Determine Recording Device

continuous_asr.py

tests/test_all.py

Notes

Technical Details

How to stream sound samples to model

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

live-speech-to-text

Highlights

Requirements

Microphone Setup

Software

Python 3.11.1

OS

Supported

Windows Differences

Hardware

Successfully tested hardware/model combinations

Audio Recording Hardware

How to Run

Determine Recording Device

continuous_asr.py

tests/test_all.py

Notes

Technical Details

How to stream sound samples to model

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages