Paralinguistic feature extraction. A stethoscope, not a classifier. Emits dimensional, human-legible features per utterance so you can reason about how someone is speaking — not a diagnosis of who they are.
Given an audio file, hearing returns a per-utterance list of acoustic features: fundamental frequency, jitter, shimmer, HNR (harmonics-to-noise ratio), pause-before, duration, intensity, pitch range. Each utterance is segmented via Silero-VAD, then measured via Parselmouth (Praat bindings).
{"start_sec": 8.87, "duration_sec": 3.26, "pause_before_sec": 0.0, "f0_mean_hz": 123, "f0_range_hz": 45, "jitter_local": 0.031, "shimmer_local": 0.140, "hnr_db": 5.6, "intensity_mean_db": 58.1}
{"start_sec": 13.12, "duration_sec": 30.88, "pause_before_sec": 1.00, "f0_mean_hz": 108, "f0_range_hz": 28, "jitter_local": 0.023, "shimmer_local": 0.106, "hnr_db": 11.2, "intensity_mean_db": 62.3}Two utterances. The first was higher-pitched, rougher, lower HNR — the speaker was nervous. The second was lower, cleaner — they found their footing. hearing doesn't tell you that. It gives you the numbers so your reasoning layer can.
Most voice-emotion tools ship with a classifier: "87% sad." Those classifiers are trained on narrow data, degrade badly on individual variation, and encode assumptions that aren't transparent. When used inside an AI companion, they can quietly shape behavior in ways neither the developer nor the user can audit.
hearing makes a different bet: emit clean numbers, let the consuming layer reason about them. If your reasoning layer wants to say "voice sounded strained", it can do so by reading the features and citing them ("jitter 0.084 on utterance 14"). The chain is legible.
This also makes per-user baselines trivial — which is the only way paralinguistic signals become trustworthy across real people, given how much individual variation there is in voice.
Single file, plus two dependencies:
curl -o hearing.py https://raw.githubusercontent.com/kithfoss/hearing/main/hearing.py
curl -o observe.py https://raw.githubusercontent.com/kithfoss/hearing/main/observe.py
pip install praat-parselmouth silero-vad torch soundfile numpyTorch is the heaviest dep; you can use CPU-only torch on modest hardware.
python3 hearing.py recording.wav --out features.jsonl
# writes one JSON object per detected utterancefrom hearing import extract_hearing
features = extract_hearing("clip.wav")
for utt in features:
print(f"{utt.start_sec:.2f}s f0={utt.f0_mean_hz:.0f}Hz jit={utt.jitter_local:.3f}")observe.py is a thin, rule-based interpreter that emits descriptive observations from a feature jsonl. Not classifiers; claims like "voice came out steadier on that one — less edge, cleaner tone" tethered to specific feature values.
from observe import observe
obs = observe("features.jsonl")
print(obs.sentence) # "Your voice moved around on that one — broke up on a word partway through."
print(obs.feature, obs.utterance_index, obs.value) # citation for verificationRules are template-driven so you can tune or extend them. One-liner descriptions like "a longer pause than the others" / "the pitch moved less than usual." Every observation cites the specific feature and utterance that triggered it — the chain is auditable end-to-end.
observe.update_baseline(user_id, features_jsonl) accumulates Welford running-statistics (mean + variance) for the user's own speech over time. Once a user has 3+ clips, observations become baseline-relative ("higher than your usual") instead of clip-relative ("higher than the rest of this clip"). This is where the real grip comes from — individual variation dwarfs clip-to-clip variation.
Baselines stored at users/<id>/hearing_baseline.json. Tiny, human-readable.
- Voice-interactive AI companions that want to shape their response style to the speaker's state without labeling them.
- Real-time voice coaching where you need to show numbers to the speaker (breath pauses, pitch stability) as feedback.
- Longitudinal analysis where you care about an individual's trajectory (e.g., clinical speech research) and off-the-shelf classifiers throw out the signal you want.
- Emotion recognition / affective classification. By design.
- Speaker ID. Use a diarization tool.
- Transcription. Use whisper.
- Anything where you need a single scalar "mood" output. Go use one of the many tools that do that.
Built as the measurement layer under a companion AI's response-pacing system (where the signal is routed to prompt-level style bias, not to a classifier). The "dimensional, human-legible, no classifier" stance came from working with vulnerable users where a mis-labeled "sad" output would be more harmful than useful.
The matching reasoning-layer approach is written up at: (pacing module — coming soon)
MIT.