Conversation
📝 WalkthroughWalkthroughAdds DTX-aware PLC and payload-type skipping to RTP processing, timestamp-based PLC decisions, WAV recording finalization on goroutine exit, a G.729 oscillation detector that silences unstable frames, and handling for unfinalized WAV headers so remaining bytes are treated as data. Changes
Sequence Diagram(s)sequenceDiagram
participant RTP as RTP Goroutine
participant PT as PayloadTypeChecker
participant DEC as Decoder
participant PLC as PLC Inserter
participant REC as Recorder (WAV)
participant STT as STT/Transcription
RTP->>PT: receive packet (seq, ts, payloadType)
PT-->>RTP: audio? / non-audio (DTMF)
alt non-audio
RTP->>REC: emit DTMF AcousticEvent
RTP-->>STT: skip decoding
else audio
RTP->>DEC: decode payload -> PCM
DEC-->>RTP: decoded PCM (update lastDecodedPCMSize)
RTP->>PLC: check seq & timestamp gaps, DTX window
alt small timestamp gap (DTX)
PLC-->>RTP: generate limited silence based on lastDecodedPCMSize
else real loss
PLC-->>RTP: generate concealment
end
RTP->>REC: write PCM / finalize on exit (update WAV header)
RTP->>STT: forward decoded PCM
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/media/rtp.go`:
- Around line 608-611: Reorder detection incorrectly excludes wraparound cases
due to the extra `seq <= *lastSeq` check; update the logic in the
reorder-detection block that sets `isReordered` (using variables `seq`,
`lastSeq`, and `isReordered`) to remove the `seq <= *lastSeq` condition and rely
solely on the modular arithmetic test `uint16(*lastSeq - seq) < 32768` to detect
late/reordered packets (this preserves correct behavior across 16-bit sequence
wraparound).
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
## Root Cause
Analysis of all 36 WAV files in `/Users/kinshuk.bairagi/Desktop/rec/` confirms the pattern:
- **7 legs are broken** (89-98.5% zeros, inflated to 1000-3600 seconds)
- **All broken legs** belong to the "mostly-listening" party in a call
- **Counterpart legs** for the same call are small and healthy (0-5% zeros)
- **Combined WAVs** inherit the problem (e.g. `0287...wav` is 86.1% zeros)
**What's happening:** G.729 Annex B uses DTX (Discontinuous Transmission) -- when a party is silent, the transmitter stops sending RTP packets entirely. When the next packet arrives (comfort noise or brief speech), the RTP sequence number has jumped by hundreds. The PLC code at [rtp.go](siprec/pkg/media/rtp.go) line 614-648 interprets every such jump as packet loss and inserts `min(gap, maxPLC=100)` frames of silence (up to 2 seconds each time). For a party who is mostly listening, this happens every few seconds, inflating a 15-second file to 1000+ seconds.
**Evidence from `0bbf...10.wav`:**
- 50,487 frames total, 49,248 are exact zeros (97.5%)
- Non-zero frames occur every ~100 frames (exactly `maxPLC`)
- Silence blocks are consistently ~2000ms = 100 x 20ms
## Fix: Distinguish DTX Gaps from Real Packet Loss
The key insight: **DTX gaps have large wall-clock inter-arrival times** (the transmitter literally stopped sending). **Real packet loss during active speech** has normal arrival rates (~20ms apart) with sequence gaps.
### Changes in [rtp.go](siprec/pkg/media/rtp.go)
**A. Track last packet arrival time** alongside `lastSeq`:
- Add `var lastArrivalTime time.Time` next to `lastSeq` (line 499)
- Update it when `lastSeq` is updated (line 678-681)
**B. Gate PLC on inter-arrival time:**
- Before computing PLC silence (line 614), check if `arrival.Sub(lastArrivalTime) > dtxThreshold`
- If the gap between consecutive packet arrivals exceeds a threshold (e.g. **60ms** -- 3x the normal 20ms packet interval), treat it as a DTX gap and skip PLC entirely
- Only apply PLC when the inter-arrival gap is small (real packet loss during active speech)
**C. Reduce `maxPLC` from 100 to 10:**
- Even for genuine packet loss, 2 seconds of silence is far too much
- Cap at 10 packets (200ms) which is more than enough for transient network loss and won't be audible as a major disruption
- Line 631: `const maxPLC = 10`
### Summary of the logic
```
if lastSeq != nil && seq != expectedNext && !isReordered {
arrivalGap := arrival.Sub(lastArrivalTime)
if arrivalGap <= dtxThreshold {
// Packets arriving at normal rate but with seq gap = real packet loss
// Insert PLC silence (capped at maxPLC=10)
}
// else: large arrival gap = DTX silence suppression, skip PLC
}
```
Ensure WAV files are finalized before the RTP goroutine cleanup by deferring a call to WAVWriter.Finalize() (with warning log on error) in StartRTPForwarding. Also make WAVReader.parseHeader tolerant of unfinalized WAVs: if the data chunk size is zero, infer the data size from the remaining file length so CombineWAVRecordings can still produce playable output. This avoids corrupt/empty recordings when goroutines exit or headers weren't updated.
Root cause: The G.729 decoder's synthesis filter occasionally goes unstable (typically after DTX gaps or corrupted frames), producing a 2kHz square wave oscillating between +32767 and -32768. This sounds like harsh buzzing/distortion. Fix: Added isG729Oscillation() that detects the unstable pattern per decoded frame: >50% of samples at |value| >= 30000 (railed to max) >25% sign changes (high-frequency alternation near Nyquist) When both conditions are met, the frame is replaced with silence. Normal speech clipping (loud voice) is never flagged because it has far fewer railed samples (2-20 out of 80) and normal sign-change rates. The check is integrated directly into DecodeG729WithSSRC -- after each decoder.Decode(), if the output is oscillation, it's zeroed:
Problem: The arrival-time-based DTX detection (arrivalGap > 60ms) failed for calls where the PBX buffers and releases packets in bursts. Packets would arrive with near-zero wall-clock gaps but large sequence gaps. The PLC code treated these as real packet loss and inserted 200ms of silence per gap -- 652 times in this file alone. Fix: Replaced arrival-time heuristic with RTP timestamp gap detection. The RTP timestamp is set by the source and reflects the actual audio timeline, regardless of network jitter or PBX buffering: During DTX silence: the RTP timestamp jumps by hundreds of milliseconds worth of samples (e.g., 2 seconds = 16000 at 8kHz). This exceeds dtxTimestampThreshold (60ms worth = 480 samples) → PLC skipped. During real packet loss: consecutive packets have normal timestamp deltas (~160 per 20ms at 8kHz). This is within the threshold → PLC correctly inserts up to maxPLC=10 frames of silence. The arrival-time variable (lastArrivalTime) was replaced with lastTimestamp (uint32) and hasLastTimestamp (bool) tracking the previous packet's RTP timestamp.
Avoid decoding non-audio RTP payload types (e.g. event/Dtmf PTs) once the audio codec PT is known: such packets are early-returned and, if dtmfCh is present, a brief AcousticEvent with the payload_type is emitted. Add lastDecodedPCMSize to track the actual PCM bytes produced by the last decoded packet and prefer it when computing PLC silence length (falling back to PCMBytesPerPacket if unknown). This prevents decode errors/garbage from event payloads, avoids spurious PLC insertions, and ensures inserted silence has the correct byte length.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/media/audio_conversion.go (1)
499-523: Keep the oscillation guard on both G.729 entry points.This protection only applies to
DecodeG729WithSSRC; callers that still reach G.729 throughDecodeAudioPayload("G729")/decodeG729Packet()keep the old failure mode. ReusingisG729Oscillation(decoded)there would keep the API behavior consistent.Possible follow-up
- err := decoder.Decode(frameData, decoded) - if err != nil { + err := decoder.Decode(frameData, decoded) + if err != nil || isG729Oscillation(decoded) { // On error, generate silence for this frame for i := 0; i < 80; i++ { decoded[i] = 0 } }Also applies to: 562-566
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/media/audio_conversion.go` around lines 499 - 523, The oscillation guard only used in DecodeG729WithSSRC must also be applied to the other G.729 entry path; update DecodeAudioPayload (the branch handling "G729") and/or decodeG729Packet to call isG729Oscillation(decoded) after decoding and before returning the audio, and return the same failure/skip behavior that DecodeG729WithSSRC uses when isG729Oscillation returns true so both entry points share consistent handling; ensure you reference the same decoded []int16 buffer and mirror the exact early-exit/logging behavior used in DecodeG729WithSSRC.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/media/rtp.go`:
- Around line 568-588: The branch uses currentPayloadType != 0 to mean "audio PT
established", which breaks real PT 0 (PCMU) streams; replace this logic by
introducing an explicit flag/identifier for the negotiated audio PT (e.g.
audioPtEstablished bool or currentAudioPayloadType int and/or
currentCodecName/currentSampleRate) and a separate negotiatedTelephoneEventPT
variable; change the condition that skips non-audio packets to compare
rtpPacket.PayloadType against the actual negotiated audio PT (not against 0) and
only emit the dtmf AcousticEvent on dtmfCh when rtpPacket.PayloadType matches
the negotiated telephone-event PT (not merely "different from
currentPayloadType"); update references to currentPayloadType,
rtpPacket.PayloadType, dtmfCh and AcousticEvent accordingly so PT 0 is treated
as a valid audio PT.
- Around line 642-658: Replace the hard-coded 60ms DTX cutoff used to compute
dtxTimestampThreshold with a derived packet-duration-based threshold: compute
the expected packet timestamp increment (in samples) from recent RTP timestamp
deltas when hasLastTimestamp is true (e.g., use the lastTimestamp delta between
consecutive packets or a short moving average), or fall back to converting
lastDecodedPCMSize into samples (using known bytes-per-sample and channel count)
to get the packet's sample count; then set dtxTimestampThreshold to that sample
count (or that value multiplied by a small tolerance factor like 1–2 to allow
for jitter) instead of uint32(sampleRate * 60 / 1000). Update the logic around
dtxTimestampThreshold (used where lastTimestamp, lastSeq, rtpPacket,
recordingWriter are referenced) so DTX detection is based on actual packet
duration rather than a fixed 60ms.
---
Nitpick comments:
In `@pkg/media/audio_conversion.go`:
- Around line 499-523: The oscillation guard only used in DecodeG729WithSSRC
must also be applied to the other G.729 entry path; update DecodeAudioPayload
(the branch handling "G729") and/or decodeG729Packet to call
isG729Oscillation(decoded) after decoding and before returning the audio, and
return the same failure/skip behavior that DecodeG729WithSSRC uses when
isG729Oscillation returns true so both entry points share consistent handling;
ensure you reference the same decoded []int16 buffer and mirror the exact
early-exit/logging behavior used in DecodeG729WithSSRC.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5c639457-03d3-4180-ae38-4018268071e6
📒 Files selected for processing (3)
pkg/media/audio_conversion.gopkg/media/rtp.gopkg/media/wav_reader.go
| // Skip non-audio payload types: once the audio codec PT is | ||
| // established, packets with a different PT are event payloads | ||
| // (e.g. RFC 2833 DTMF on a dynamic PT) — not audio. Decoding | ||
| // them with the audio codec produces errors or garbage and | ||
| // creates sequence-number gaps that trigger spurious PLC. | ||
| if currentPayloadType != 0 && byte(rtpPacket.PayloadType) != currentPayloadType { | ||
| if dtmfCh != nil { | ||
| select { | ||
| case dtmfCh <- AcousticEvent{ | ||
| Type: "dtmf", | ||
| Confidence: 0.9, | ||
| Timestamp: time.Now(), | ||
| Details: map[string]interface{}{ | ||
| "payload_type": rtpPacket.PayloadType, | ||
| }, | ||
| }: | ||
| default: | ||
| } | ||
| } | ||
| return | ||
| } |
There was a problem hiding this comment.
This payload-type shortcut breaks common PT 0 audio streams.
0 is a real RTP audio payload type (PCMU), so currentPayloadType != 0 is not a safe "audio PT established" check. On a PCMU call, one PT 101/13 packet can still poison the tracked PT, and later real PT 0 audio starts getting dropped by this branch. Please track establishment with a separate bool (or currentCodecName/currentSampleRate), and only emit dtmf when the packet is actually the negotiated telephone-event PT.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/media/rtp.go` around lines 568 - 588, The branch uses currentPayloadType
!= 0 to mean "audio PT established", which breaks real PT 0 (PCMU) streams;
replace this logic by introducing an explicit flag/identifier for the negotiated
audio PT (e.g. audioPtEstablished bool or currentAudioPayloadType int and/or
currentCodecName/currentSampleRate) and a separate negotiatedTelephoneEventPT
variable; change the condition that skips non-audio packets to compare
rtpPacket.PayloadType against the actual negotiated audio PT (not against 0) and
only emit the dtmf AcousticEvent on dtmfCh when rtpPacket.PayloadType matches
the negotiated telephone-event PT (not merely "different from
currentPayloadType"); update references to currentPayloadType,
rtpPacket.PayloadType, dtmfCh and AcousticEvent accordingly so PT 0 is treated
as a valid audio PT.
| // 60ms of audio samples at the current rate; gaps larger than this are DTX | ||
| dtxTimestampThreshold := uint32(sampleRate * 60 / 1000) | ||
| if lastSeq != nil { | ||
| expectedNext := uint16(*lastSeq + 1) | ||
| seq := rtpPacket.SequenceNumber | ||
| if seq != expectedNext { | ||
| // Reordered (late) packet: seq is earlier than lastSeq in the 16-bit window. | ||
| // uint16(*lastSeq-seq) < 32768 means the packet is in the recent half, not wraparound. | ||
| if seq <= *lastSeq && uint16(*lastSeq-seq) < 32768 { | ||
| if uint16(*lastSeq-seq) < 32768 { | ||
| isReordered = true | ||
| // Out-of-order arrival: skip PLC, do not insert silence. | ||
| } else { | ||
| var lost int | ||
| if seq > expectedNext { | ||
| lost = int(seq - expectedNext) | ||
| } else { | ||
| // Wraparound | ||
| lost = int(seq) + (65536 - int(expectedNext)) | ||
| } | ||
| const maxPLC = 100 // cap at ~2s at 20ms/packet to avoid runaway on bad sequence | ||
| if lost > maxPLC { | ||
| lost = maxPLC | ||
| } | ||
| if lost > 0 && recordingWriter != nil { | ||
| bytesPerPacket := PCMBytesPerPacket(codecName, sampleRate) | ||
| silenceLen := lost * bytesPerPacket | ||
| if silenceLen > 0 { | ||
| silence := make([]byte, silenceLen) | ||
| if _, writeErr := recordingWriter.Write(silence); writeErr != nil { | ||
| forwarder.Logger.WithError(writeErr).WithField("call_uuid", callUUID).Debug("PLC silence write failed") | ||
| } else if metrics.IsMetricsEnabled() { | ||
| metrics.RecordRTPDroppedPackets("plc_concealed", float64(lost)) | ||
| } else if hasLastTimestamp { | ||
| // Use RTP timestamp gap to distinguish real packet loss from DTX. | ||
| // During DTX the timestamp jumps by many packets' worth; during real | ||
| // loss consecutive packets arrive with a normal timestamp delta. | ||
| tsGap := rtpPacket.Timestamp - lastTimestamp | ||
| if tsGap <= dtxTimestampThreshold && recordingWriter != nil { |
There was a problem hiding this comment.
The fixed 60 ms DTX cutoff will miss real packet loss on longer packetization.
sampleRate * 60 / 1000 classifies one lost 60 ms packet—or two lost 30 ms packets—as DTX, so no PLC silence is inserted and the recording becomes short again. Derive this threshold from the stream's actual packet duration, e.g. recent RTP timestamp deltas or lastDecodedPCMSize, instead of hard-coding 60 ms.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/media/rtp.go` around lines 642 - 658, Replace the hard-coded 60ms DTX
cutoff used to compute dtxTimestampThreshold with a derived
packet-duration-based threshold: compute the expected packet timestamp increment
(in samples) from recent RTP timestamp deltas when hasLastTimestamp is true
(e.g., use the lastTimestamp delta between consecutive packets or a short moving
average), or fall back to converting lastDecodedPCMSize into samples (using
known bytes-per-sample and channel count) to get the packet's sample count; then
set dtxTimestampThreshold to that sample count (or that value multiplied by a
small tolerance factor like 1–2 to allow for jitter) instead of
uint32(sampleRate * 60 / 1000). Update the logic around dtxTimestampThreshold
(used where lastTimestamp, lastSeq, rtpPacket, recordingWriter are referenced)
so DTX detection is based on actual packet duration rather than a fixed 60ms.
Continues from #10
Solves issues in #11
Summary by CodeRabbit
Bug Fixes
New Features