hearing

Paralinguistic feature extraction. A stethoscope, not a classifier. Emits dimensional, human-legible features per utterance so you can reason about how someone is speaking — not a diagnosis of who they are.

What it does

Given an audio file, hearing returns a per-utterance list of acoustic features: fundamental frequency, jitter, shimmer, HNR (harmonics-to-noise ratio), pause-before, duration, intensity, pitch range. Each utterance is segmented via Silero-VAD, then measured via Parselmouth (Praat bindings).

{"start_sec": 8.87, "duration_sec": 3.26, "pause_before_sec": 0.0, "f0_mean_hz": 123, "f0_range_hz": 45, "jitter_local": 0.031, "shimmer_local": 0.140, "hnr_db": 5.6, "intensity_mean_db": 58.1}
{"start_sec": 13.12, "duration_sec": 30.88, "pause_before_sec": 1.00, "f0_mean_hz": 108, "f0_range_hz": 28, "jitter_local": 0.023, "shimmer_local": 0.106, "hnr_db": 11.2, "intensity_mean_db": 62.3}

Two utterances. The first was higher-pitched, rougher, lower HNR — the speaker was nervous. The second was lower, cleaner — they found their footing. hearing doesn't tell you that. It gives you the numbers so your reasoning layer can.

Why "no classifier"?

Most voice-emotion tools ship with a classifier: "87% sad." Those classifiers are trained on narrow data, degrade badly on individual variation, and encode assumptions that aren't transparent. When used inside an AI companion, they can quietly shape behavior in ways neither the developer nor the user can audit.

hearing makes a different bet: emit clean numbers, let the consuming layer reason about them. If your reasoning layer wants to say "voice sounded strained", it can do so by reading the features and citing them ("jitter 0.084 on utterance 14"). The chain is legible.

This also makes per-user baselines trivial — which is the only way paralinguistic signals become trustworthy across real people, given how much individual variation there is in voice.

Install

Single file, plus two dependencies:

curl -o hearing.py https://raw.githubusercontent.com/kithfoss/hearing/main/hearing.py
curl -o observe.py https://raw.githubusercontent.com/kithfoss/hearing/main/observe.py

pip install praat-parselmouth silero-vad torch soundfile numpy

Torch is the heaviest dep; you can use CPU-only torch on modest hardware.

Usage

CLI

python3 hearing.py recording.wav --out features.jsonl
# writes one JSON object per detected utterance

Library

from hearing import extract_hearing

features = extract_hearing("clip.wav")
for utt in features:
    print(f"{utt.start_sec:.2f}s  f0={utt.f0_mean_hz:.0f}Hz  jit={utt.jitter_local:.3f}")

Observation layer (optional)

observe.py is a thin, rule-based interpreter that emits descriptive observations from a feature jsonl. Not classifiers; claims like "voice came out steadier on that one — less edge, cleaner tone" tethered to specific feature values.

from observe import observe
obs = observe("features.jsonl")
print(obs.sentence)             # "Your voice moved around on that one — broke up on a word partway through."
print(obs.feature, obs.utterance_index, obs.value)   # citation for verification

Rules are template-driven so you can tune or extend them. One-liner descriptions like "a longer pause than the others" / "the pitch moved less than usual." Every observation cites the specific feature and utterance that triggered it — the chain is auditable end-to-end.

Per-user baselines

observe.update_baseline(user_id, features_jsonl) accumulates Welford running-statistics (mean + variance) for the user's own speech over time. Once a user has 3+ clips, observations become baseline-relative ("higher than your usual") instead of clip-relative ("higher than the rest of this clip"). This is where the real grip comes from — individual variation dwarfs clip-to-clip variation.

Baselines stored at users/<id>/hearing_baseline.json. Tiny, human-readable.

What this is good for

Voice-interactive AI companions that want to shape their response style to the speaker's state without labeling them.
Real-time voice coaching where you need to show numbers to the speaker (breath pauses, pitch stability) as feedback.
Longitudinal analysis where you care about an individual's trajectory (e.g., clinical speech research) and off-the-shelf classifiers throw out the signal you want.

What this is NOT good for

Emotion recognition / affective classification. By design.
Speaker ID. Use a diarization tool.
Transcription. Use whisper.
Anything where you need a single scalar "mood" output. Go use one of the many tools that do that.

Design origin

Built as the measurement layer under a companion AI's response-pacing system (where the signal is routed to prompt-level style bias, not to a classifier). The "dimensional, human-legible, no classifier" stance came from working with vulnerable users where a mis-labeled "sad" output would be more harmful than useful.

The matching reasoning-layer approach is written up at: (pacing module — coming soon)

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
hearing.py		hearing.py
observe.py		observe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hearing

What it does

Why "no classifier"?

Install

Usage

CLI

Library

Observation layer (optional)

Per-user baselines

What this is good for

What this is NOT good for

Design origin

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hearing

What it does

Why "no classifier"?

Install

Usage

CLI

Library

Observation layer (optional)

Per-user baselines

What this is good for

What this is NOT good for

Design origin

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages