Skip to content

Continue - Fix Out-Of-Sync Audio at high load#12

Open
kingster wants to merge 7 commits intomainfrom
fix
Open

Continue - Fix Out-Of-Sync Audio at high load#12
kingster wants to merge 7 commits intomainfrom
fix

Conversation

@kingster
Copy link
Copy Markdown
Member

@kingster kingster commented Mar 15, 2026

Continues from #10
Solves issues in #11

Summary by CodeRabbit

  • Bug Fixes

    • Improved RTP packet handling to avoid false packet-loss concealment for late or reordered packets and coordinated transcription/recording writes with concealment logic.
    • Fixed playback issues from unfinalized WAV files so remaining audio is correctly read and remains playable.
  • New Features

    • Added DTX-aware packet-loss concealment and a detector to replace artifact-prone decoded frames with silence for more stable audio.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 15, 2026

📝 Walkthrough

Walkthrough

Adds DTX-aware PLC and payload-type skipping to RTP processing, timestamp-based PLC decisions, WAV recording finalization on goroutine exit, a G.729 oscillation detector that silences unstable frames, and handling for unfinalized WAV headers so remaining bytes are treated as data.

Changes

Cohort / File(s) Summary
RTP processing & PLC
pkg/media/rtp.go
Introduces timestamp-based PLC using lastTimestamp/hasLastTimestamp/lastDecodedPCMSize, DTX-aware small-gap PLC insertion, skip logic for non-audio payloads (emits DTMF AcousticEvent), deferred WAV finalization on goroutine exit, and reordered-packet handling that preserves sequencing state.
Decoder robustness (G.729 / Opus / G722)
pkg/media/audio_conversion.go
Adds isG729Oscillation detector and applies silence replacement when G.729 decoding errors or oscillation detected; minor formatting/spacing cleanups in G722 and Opus paths; updates to ensure lastDecodedPCMSize tracking.
WAV reading resilience
pkg/media/wav_reader.go
Handles unfinalized WAV headers where data chunk size is zero by using remaining bytes from dataOffset to EOF as data, enabling ReadSamples to return PCM for such files.

Sequence Diagram(s)

sequenceDiagram
  participant RTP as RTP Goroutine
  participant PT as PayloadTypeChecker
  participant DEC as Decoder
  participant PLC as PLC Inserter
  participant REC as Recorder (WAV)
  participant STT as STT/Transcription

  RTP->>PT: receive packet (seq, ts, payloadType)
  PT-->>RTP: audio? / non-audio (DTMF)
  alt non-audio
    RTP->>REC: emit DTMF AcousticEvent
    RTP-->>STT: skip decoding
  else audio
    RTP->>DEC: decode payload -> PCM
    DEC-->>RTP: decoded PCM (update lastDecodedPCMSize)
    RTP->>PLC: check seq & timestamp gaps, DTX window
    alt small timestamp gap (DTX)
      PLC-->>RTP: generate limited silence based on lastDecodedPCMSize
    else real loss
      PLC-->>RTP: generate concealment
    end
    RTP->>REC: write PCM / finalize on exit (update WAV header)
    RTP->>STT: forward decoded PCM
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐇 I hopped through RTP and timestamps bright,
Plugged little gaps with silences light.
I silenced odd hums from a G.729 tune,
Finalized WAVs beneath the moon.
Hooray — the stream sleeps snug and right! 🎶

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main objective of fixing out-of-sync audio at high load, which is reflected in the substantial changes to RTP packet handling, PLC logic, and timestamp synchronization across the modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/media/rtp.go`:
- Around line 608-611: Reorder detection incorrectly excludes wraparound cases
due to the extra `seq <= *lastSeq` check; update the logic in the
reorder-detection block that sets `isReordered` (using variables `seq`,
`lastSeq`, and `isReordered`) to remove the `seq <= *lastSeq` condition and rely
solely on the modular arithmetic test `uint16(*lastSeq - seq) < 32768` to detect
late/reordered packets (this preserves correct behavior across 16-bit sequence
wraparound).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6d0fe0a6-66dc-4772-ae5f-bace50f67743

📥 Commits

Reviewing files that changed from the base of the PR and between 5830c04 and cafbb1c.

📒 Files selected for processing (1)
  • pkg/media/rtp.go

Comment thread pkg/media/rtp.go
kingster and others added 6 commits March 15, 2026 17:15
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
## Root Cause

Analysis of all 36 WAV files in `/Users/kinshuk.bairagi/Desktop/rec/` confirms the pattern:

- **7 legs are broken** (89-98.5% zeros, inflated to 1000-3600 seconds)
- **All broken legs** belong to the "mostly-listening" party in a call
- **Counterpart legs** for the same call are small and healthy (0-5% zeros)
- **Combined WAVs** inherit the problem (e.g. `0287...wav` is 86.1% zeros)

**What's happening:** G.729 Annex B uses DTX (Discontinuous Transmission) -- when a party is silent, the transmitter stops sending RTP packets entirely. When the next packet arrives (comfort noise or brief speech), the RTP sequence number has jumped by hundreds. The PLC code at [rtp.go](siprec/pkg/media/rtp.go) line 614-648 interprets every such jump as packet loss and inserts `min(gap, maxPLC=100)` frames of silence (up to 2 seconds each time). For a party who is mostly listening, this happens every few seconds, inflating a 15-second file to 1000+ seconds.

**Evidence from `0bbf...10.wav`:**

- 50,487 frames total, 49,248 are exact zeros (97.5%)
- Non-zero frames occur every ~100 frames (exactly `maxPLC`)
- Silence blocks are consistently ~2000ms = 100 x 20ms

## Fix: Distinguish DTX Gaps from Real Packet Loss

The key insight: **DTX gaps have large wall-clock inter-arrival times** (the transmitter literally stopped sending). **Real packet loss during active speech** has normal arrival rates (~20ms apart) with sequence gaps.

### Changes in [rtp.go](siprec/pkg/media/rtp.go)

**A. Track last packet arrival time** alongside `lastSeq`:

- Add `var lastArrivalTime time.Time` next to `lastSeq` (line 499)
- Update it when `lastSeq` is updated (line 678-681)

**B. Gate PLC on inter-arrival time:**

- Before computing PLC silence (line 614), check if `arrival.Sub(lastArrivalTime) > dtxThreshold`
- If the gap between consecutive packet arrivals exceeds a threshold (e.g. **60ms** -- 3x the normal 20ms packet interval), treat it as a DTX gap and skip PLC entirely
- Only apply PLC when the inter-arrival gap is small (real packet loss during active speech)

**C. Reduce `maxPLC` from 100 to 10:**

- Even for genuine packet loss, 2 seconds of silence is far too much
- Cap at 10 packets (200ms) which is more than enough for transient network loss and won't be audible as a major disruption
- Line 631: `const maxPLC = 10`

### Summary of the logic

```
if lastSeq != nil && seq != expectedNext && !isReordered {
    arrivalGap := arrival.Sub(lastArrivalTime)
    if arrivalGap <= dtxThreshold {
        // Packets arriving at normal rate but with seq gap = real packet loss
        // Insert PLC silence (capped at maxPLC=10)
    }
    // else: large arrival gap = DTX silence suppression, skip PLC
}
```
Ensure WAV files are finalized before the RTP goroutine cleanup by deferring a call to WAVWriter.Finalize() (with warning log on error) in StartRTPForwarding. Also make WAVReader.parseHeader tolerant of unfinalized WAVs: if the data chunk size is zero, infer the data size from the remaining file length so CombineWAVRecordings can still produce playable output. This avoids corrupt/empty recordings when goroutines exit or headers weren't updated.
Root cause: The G.729 decoder's synthesis filter occasionally goes unstable (typically after DTX gaps or corrupted frames), producing a 2kHz square wave oscillating between +32767 and -32768. This sounds like harsh buzzing/distortion.
Fix: Added isG729Oscillation() that detects the unstable pattern per decoded frame:
>50% of samples at |value| >= 30000 (railed to max)
>25% sign changes (high-frequency alternation near Nyquist)
When both conditions are met, the frame is replaced with silence. Normal speech clipping (loud voice) is never flagged because it has far fewer railed samples (2-20 out of 80) and normal sign-change rates.
The check is integrated directly into DecodeG729WithSSRC -- after each decoder.Decode(), if the output is oscillation, it's zeroed:
Problem: The arrival-time-based DTX detection (arrivalGap > 60ms) failed for calls where the PBX buffers and releases packets in bursts. Packets would arrive with near-zero wall-clock gaps but large sequence gaps. The PLC code treated these as real packet loss and inserted 200ms of silence per gap -- 652 times in this file alone.
Fix: Replaced arrival-time heuristic with RTP timestamp gap detection. The RTP timestamp is set by the source and reflects the actual audio timeline, regardless of network jitter or PBX buffering:
During DTX silence: the RTP timestamp jumps by hundreds of milliseconds worth of samples (e.g., 2 seconds = 16000 at 8kHz). This exceeds dtxTimestampThreshold (60ms worth = 480 samples) → PLC skipped.
During real packet loss: consecutive packets have normal timestamp deltas (~160 per 20ms at 8kHz). This is within the threshold → PLC correctly inserts up to maxPLC=10 frames of silence.
The arrival-time variable (lastArrivalTime) was replaced with lastTimestamp (uint32) and hasLastTimestamp (bool) tracking the previous packet's RTP timestamp.
Avoid decoding non-audio RTP payload types (e.g. event/Dtmf PTs) once the audio codec PT is known: such packets are early-returned and, if dtmfCh is present, a brief AcousticEvent with the payload_type is emitted. Add lastDecodedPCMSize to track the actual PCM bytes produced by the last decoded packet and prefer it when computing PLC silence length (falling back to PCMBytesPerPacket if unknown). This prevents decode errors/garbage from event payloads, avoids spurious PLC insertions, and ensures inserted silence has the correct byte length.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/media/audio_conversion.go (1)

499-523: Keep the oscillation guard on both G.729 entry points.

This protection only applies to DecodeG729WithSSRC; callers that still reach G.729 through DecodeAudioPayload("G729") / decodeG729Packet() keep the old failure mode. Reusing isG729Oscillation(decoded) there would keep the API behavior consistent.

Possible follow-up
-		err := decoder.Decode(frameData, decoded)
-		if err != nil {
+		err := decoder.Decode(frameData, decoded)
+		if err != nil || isG729Oscillation(decoded) {
 			// On error, generate silence for this frame
 			for i := 0; i < 80; i++ {
 				decoded[i] = 0
 			}
 		}

Also applies to: 562-566

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/media/audio_conversion.go` around lines 499 - 523, The oscillation guard
only used in DecodeG729WithSSRC must also be applied to the other G.729 entry
path; update DecodeAudioPayload (the branch handling "G729") and/or
decodeG729Packet to call isG729Oscillation(decoded) after decoding and before
returning the audio, and return the same failure/skip behavior that
DecodeG729WithSSRC uses when isG729Oscillation returns true so both entry points
share consistent handling; ensure you reference the same decoded []int16 buffer
and mirror the exact early-exit/logging behavior used in DecodeG729WithSSRC.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/media/rtp.go`:
- Around line 568-588: The branch uses currentPayloadType != 0 to mean "audio PT
established", which breaks real PT 0 (PCMU) streams; replace this logic by
introducing an explicit flag/identifier for the negotiated audio PT (e.g.
audioPtEstablished bool or currentAudioPayloadType int and/or
currentCodecName/currentSampleRate) and a separate negotiatedTelephoneEventPT
variable; change the condition that skips non-audio packets to compare
rtpPacket.PayloadType against the actual negotiated audio PT (not against 0) and
only emit the dtmf AcousticEvent on dtmfCh when rtpPacket.PayloadType matches
the negotiated telephone-event PT (not merely "different from
currentPayloadType"); update references to currentPayloadType,
rtpPacket.PayloadType, dtmfCh and AcousticEvent accordingly so PT 0 is treated
as a valid audio PT.
- Around line 642-658: Replace the hard-coded 60ms DTX cutoff used to compute
dtxTimestampThreshold with a derived packet-duration-based threshold: compute
the expected packet timestamp increment (in samples) from recent RTP timestamp
deltas when hasLastTimestamp is true (e.g., use the lastTimestamp delta between
consecutive packets or a short moving average), or fall back to converting
lastDecodedPCMSize into samples (using known bytes-per-sample and channel count)
to get the packet's sample count; then set dtxTimestampThreshold to that sample
count (or that value multiplied by a small tolerance factor like 1–2 to allow
for jitter) instead of uint32(sampleRate * 60 / 1000). Update the logic around
dtxTimestampThreshold (used where lastTimestamp, lastSeq, rtpPacket,
recordingWriter are referenced) so DTX detection is based on actual packet
duration rather than a fixed 60ms.

---

Nitpick comments:
In `@pkg/media/audio_conversion.go`:
- Around line 499-523: The oscillation guard only used in DecodeG729WithSSRC
must also be applied to the other G.729 entry path; update DecodeAudioPayload
(the branch handling "G729") and/or decodeG729Packet to call
isG729Oscillation(decoded) after decoding and before returning the audio, and
return the same failure/skip behavior that DecodeG729WithSSRC uses when
isG729Oscillation returns true so both entry points share consistent handling;
ensure you reference the same decoded []int16 buffer and mirror the exact
early-exit/logging behavior used in DecodeG729WithSSRC.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c639457-03d3-4180-ae38-4018268071e6

📥 Commits

Reviewing files that changed from the base of the PR and between cafbb1c and 2b6ac4c.

📒 Files selected for processing (3)
  • pkg/media/audio_conversion.go
  • pkg/media/rtp.go
  • pkg/media/wav_reader.go

Comment thread pkg/media/rtp.go
Comment on lines +568 to +588
// Skip non-audio payload types: once the audio codec PT is
// established, packets with a different PT are event payloads
// (e.g. RFC 2833 DTMF on a dynamic PT) — not audio. Decoding
// them with the audio codec produces errors or garbage and
// creates sequence-number gaps that trigger spurious PLC.
if currentPayloadType != 0 && byte(rtpPacket.PayloadType) != currentPayloadType {
if dtmfCh != nil {
select {
case dtmfCh <- AcousticEvent{
Type: "dtmf",
Confidence: 0.9,
Timestamp: time.Now(),
Details: map[string]interface{}{
"payload_type": rtpPacket.PayloadType,
},
}:
default:
}
}
return
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

This payload-type shortcut breaks common PT 0 audio streams.

0 is a real RTP audio payload type (PCMU), so currentPayloadType != 0 is not a safe "audio PT established" check. On a PCMU call, one PT 101/13 packet can still poison the tracked PT, and later real PT 0 audio starts getting dropped by this branch. Please track establishment with a separate bool (or currentCodecName/currentSampleRate), and only emit dtmf when the packet is actually the negotiated telephone-event PT.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/media/rtp.go` around lines 568 - 588, The branch uses currentPayloadType
!= 0 to mean "audio PT established", which breaks real PT 0 (PCMU) streams;
replace this logic by introducing an explicit flag/identifier for the negotiated
audio PT (e.g. audioPtEstablished bool or currentAudioPayloadType int and/or
currentCodecName/currentSampleRate) and a separate negotiatedTelephoneEventPT
variable; change the condition that skips non-audio packets to compare
rtpPacket.PayloadType against the actual negotiated audio PT (not against 0) and
only emit the dtmf AcousticEvent on dtmfCh when rtpPacket.PayloadType matches
the negotiated telephone-event PT (not merely "different from
currentPayloadType"); update references to currentPayloadType,
rtpPacket.PayloadType, dtmfCh and AcousticEvent accordingly so PT 0 is treated
as a valid audio PT.

Comment thread pkg/media/rtp.go
Comment on lines +642 to +658
// 60ms of audio samples at the current rate; gaps larger than this are DTX
dtxTimestampThreshold := uint32(sampleRate * 60 / 1000)
if lastSeq != nil {
expectedNext := uint16(*lastSeq + 1)
seq := rtpPacket.SequenceNumber
if seq != expectedNext {
// Reordered (late) packet: seq is earlier than lastSeq in the 16-bit window.
// uint16(*lastSeq-seq) < 32768 means the packet is in the recent half, not wraparound.
if seq <= *lastSeq && uint16(*lastSeq-seq) < 32768 {
if uint16(*lastSeq-seq) < 32768 {
isReordered = true
// Out-of-order arrival: skip PLC, do not insert silence.
} else {
var lost int
if seq > expectedNext {
lost = int(seq - expectedNext)
} else {
// Wraparound
lost = int(seq) + (65536 - int(expectedNext))
}
const maxPLC = 100 // cap at ~2s at 20ms/packet to avoid runaway on bad sequence
if lost > maxPLC {
lost = maxPLC
}
if lost > 0 && recordingWriter != nil {
bytesPerPacket := PCMBytesPerPacket(codecName, sampleRate)
silenceLen := lost * bytesPerPacket
if silenceLen > 0 {
silence := make([]byte, silenceLen)
if _, writeErr := recordingWriter.Write(silence); writeErr != nil {
forwarder.Logger.WithError(writeErr).WithField("call_uuid", callUUID).Debug("PLC silence write failed")
} else if metrics.IsMetricsEnabled() {
metrics.RecordRTPDroppedPackets("plc_concealed", float64(lost))
} else if hasLastTimestamp {
// Use RTP timestamp gap to distinguish real packet loss from DTX.
// During DTX the timestamp jumps by many packets' worth; during real
// loss consecutive packets arrive with a normal timestamp delta.
tsGap := rtpPacket.Timestamp - lastTimestamp
if tsGap <= dtxTimestampThreshold && recordingWriter != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The fixed 60 ms DTX cutoff will miss real packet loss on longer packetization.

sampleRate * 60 / 1000 classifies one lost 60 ms packet—or two lost 30 ms packets—as DTX, so no PLC silence is inserted and the recording becomes short again. Derive this threshold from the stream's actual packet duration, e.g. recent RTP timestamp deltas or lastDecodedPCMSize, instead of hard-coding 60 ms.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/media/rtp.go` around lines 642 - 658, Replace the hard-coded 60ms DTX
cutoff used to compute dtxTimestampThreshold with a derived
packet-duration-based threshold: compute the expected packet timestamp increment
(in samples) from recent RTP timestamp deltas when hasLastTimestamp is true
(e.g., use the lastTimestamp delta between consecutive packets or a short moving
average), or fall back to converting lastDecodedPCMSize into samples (using
known bytes-per-sample and channel count) to get the packet's sample count; then
set dtxTimestampThreshold to that sample count (or that value multiplied by a
small tolerance factor like 1–2 to allow for jitter) instead of
uint32(sampleRate * 60 / 1000). Update the logic around dtxTimestampThreshold
(used where lastTimestamp, lastSeq, rtpPacket, recordingWriter are referenced)
so DTX detection is based on actual packet duration rather than a fixed 60ms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant