Skip to content

kithfoss/hearing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

hearing

Paralinguistic feature extraction. A stethoscope, not a classifier. Emits dimensional, human-legible features per utterance so you can reason about how someone is speaking — not a diagnosis of who they are.

What it does

Given an audio file, hearing returns a per-utterance list of acoustic features: fundamental frequency, jitter, shimmer, HNR (harmonics-to-noise ratio), pause-before, duration, intensity, pitch range. Each utterance is segmented via Silero-VAD, then measured via Parselmouth (Praat bindings).

{"start_sec": 8.87, "duration_sec": 3.26, "pause_before_sec": 0.0, "f0_mean_hz": 123, "f0_range_hz": 45, "jitter_local": 0.031, "shimmer_local": 0.140, "hnr_db": 5.6, "intensity_mean_db": 58.1}
{"start_sec": 13.12, "duration_sec": 30.88, "pause_before_sec": 1.00, "f0_mean_hz": 108, "f0_range_hz": 28, "jitter_local": 0.023, "shimmer_local": 0.106, "hnr_db": 11.2, "intensity_mean_db": 62.3}

Two utterances. The first was higher-pitched, rougher, lower HNR — the speaker was nervous. The second was lower, cleaner — they found their footing. hearing doesn't tell you that. It gives you the numbers so your reasoning layer can.

Why "no classifier"?

Most voice-emotion tools ship with a classifier: "87% sad." Those classifiers are trained on narrow data, degrade badly on individual variation, and encode assumptions that aren't transparent. When used inside an AI companion, they can quietly shape behavior in ways neither the developer nor the user can audit.

hearing makes a different bet: emit clean numbers, let the consuming layer reason about them. If your reasoning layer wants to say "voice sounded strained", it can do so by reading the features and citing them ("jitter 0.084 on utterance 14"). The chain is legible.

This also makes per-user baselines trivial — which is the only way paralinguistic signals become trustworthy across real people, given how much individual variation there is in voice.

Install

Single file, plus two dependencies:

curl -o hearing.py https://raw.githubusercontent.com/kithfoss/hearing/main/hearing.py
curl -o observe.py https://raw.githubusercontent.com/kithfoss/hearing/main/observe.py

pip install praat-parselmouth silero-vad torch soundfile numpy

Torch is the heaviest dep; you can use CPU-only torch on modest hardware.

Usage

CLI

python3 hearing.py recording.wav --out features.jsonl
# writes one JSON object per detected utterance

Library

from hearing import extract_hearing

features = extract_hearing("clip.wav")
for utt in features:
    print(f"{utt.start_sec:.2f}s  f0={utt.f0_mean_hz:.0f}Hz  jit={utt.jitter_local:.3f}")

Observation layer (optional)

observe.py is a thin, rule-based interpreter that emits descriptive observations from a feature jsonl. Not classifiers; claims like "voice came out steadier on that one — less edge, cleaner tone" tethered to specific feature values.

from observe import observe
obs = observe("features.jsonl")
print(obs.sentence)             # "Your voice moved around on that one — broke up on a word partway through."
print(obs.feature, obs.utterance_index, obs.value)   # citation for verification

Rules are template-driven so you can tune or extend them. One-liner descriptions like "a longer pause than the others" / "the pitch moved less than usual." Every observation cites the specific feature and utterance that triggered it — the chain is auditable end-to-end.

Per-user baselines

observe.update_baseline(user_id, features_jsonl) accumulates Welford running-statistics (mean + variance) for the user's own speech over time. Once a user has 3+ clips, observations become baseline-relative ("higher than your usual") instead of clip-relative ("higher than the rest of this clip"). This is where the real grip comes from — individual variation dwarfs clip-to-clip variation.

Baselines stored at users/<id>/hearing_baseline.json. Tiny, human-readable.

What this is good for

  • Voice-interactive AI companions that want to shape their response style to the speaker's state without labeling them.
  • Real-time voice coaching where you need to show numbers to the speaker (breath pauses, pitch stability) as feedback.
  • Longitudinal analysis where you care about an individual's trajectory (e.g., clinical speech research) and off-the-shelf classifiers throw out the signal you want.

What this is NOT good for

  • Emotion recognition / affective classification. By design.
  • Speaker ID. Use a diarization tool.
  • Transcription. Use whisper.
  • Anything where you need a single scalar "mood" output. Go use one of the many tools that do that.

Design origin

Built as the measurement layer under a companion AI's response-pacing system (where the signal is routed to prompt-level style bias, not to a classifier). The "dimensional, human-legible, no classifier" stance came from working with vulnerable users where a mis-labeled "sad" output would be more harmful than useful.

The matching reasoning-layer approach is written up at: (pacing module — coming soon)

License

MIT.

About

Paralinguistic feature extraction from audio. A stethoscope, not a classifier. Dimensional features per utterance; observation layer with citations; per-user baselines via Welford.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages