Skip to content

SPI reliability: fix PE03 frame drops (RX-drain) + SPI_DIAG build; retire panel_master#5

Open
mbreiser wants to merge 7 commits into
mainfrom
spi-bringup-step0
Open

SPI reliability: fix PE03 frame drops (RX-drain) + SPI_DIAG build; retire panel_master#5
mbreiser wants to merge 7 commits into
mainfrom
spi-bringup-step0

Conversation

@mbreiser
Copy link
Copy Markdown

Summary

Fixes the flickery-display / occasional PE03 seen during arena-master streaming bring-up, adds a silent on-demand SPI diagnostic build that found it, and retires the panel_master bench harness now that the real arena master is available. All at the arena's actual ~5 MHz operating point — no SPI rewrite; the PIO+DMA work remains future-proofing for >10 MHz.

The fix (headline): PE03 rejected frames — last-byte drop

custom_spi_read_blocking() broke out of the receive loop the instant gpio_get(cs_pin) read high. The final data byte's last SCK edge can precede the CS rising edge by less than the RP2350 RX-path latency (input synchronizer + SSP RX pipeline), so that byte wasn't spi_is_readable() yet when CS was sampled high → the read returned one byte shortcheck_length() failed → PE03.

Fix: after CS-high, drain any straggler bytes still arriving in the RX FIFO before returning, bounded by RX_DRAIN_SETTLE consecutive empty polls (reset on each read) so a finished/truncated transaction still exits promptly and can never hang. CS is high during the drain, so no next-transaction bytes can enter the FIFO.

Diagnostic key finding: parity never failed — only length, almost all got = expected − 1. So it was framing, not signal integrity.

What's in this PR (5 commits)

  1. panel: make DEVEL serial heartbeat non-blocking — the per-1000-message heartbeat could stall core 0 between transactions; now it builds one line and writes only when the USB-CDC FIFO has room.
  2. Retire panel_master harness; add high-speed SPI bench protocolpanel_master (≈250 kHz bitbang stand-in) removed; bench docs point at the real arena master; new panel/bench/handoff-spi-highspeed-bench.md.
  3. panel: fix PE03 rejected frames — drain RX FIFO after CS-high — the fix above.
  4. panel: add SPI_DIAG diagnostic build + ignore sub-minimum runts — new pico_v021_spidiag / pico_v031_spidiag envs (-DSPI_DIAG=1): behaves like production but silently tallies per-check fail counts, a per-command histogram, and a ring of the last 32 failures with got-vs-expected byte counts; d dumps in one burst, z zeros. Also: transactions below MESSAGE_MINIMUM_SIZE (1–2 byte CS glitch / aborted clock) no longer raise a glyph that would blank a streaming display (still counted by SPI_DIAG).
  5. bench: fold canonical-spec facts into high-speed SPI protocol doc — controller clocks ~5 MHz today (g6_07); 30 MHz is aspirational/unmeasured; shared-CIPO + column-buffer topology; ≥0.5 ms CS-gap target.

Validation (v0.3.1 panel + arena master, via SPI_DIAG)

config messages rejects
GS2 / 200 Hz — before fix 24,130 10 (5× got=exp−1, 5× got=1)
GS2 / 200 Hz — after fix 24,101 0
GS16 — after fix 24,117 0
cumulative (GS2→GS16) — after fix 229,726 0

Production builds (pico_v021, pico_v031) and the *_spidiag builds all compile; the fix lives in panel_spi_custom.cpp, so production carries it too.

Notes / follow-ups

  • The got=1 runts are a real ~1-byte CS glitch or master short-transaction (the drain actually recovered them). Source unconfirmed — a Saleae peek or a note to Frank would close it; low priority since they no longer affect the display.
  • The PIO+DMA SPI-slave rewrite stays gated as future-proofing for the >10 MHz aspiration (see panel/bench/handoff-spi-highspeed-bench.md).

🤖 Generated with Claude Code

mbreiser and others added 5 commits May 29, 2026 09:47
…> PE03)

The per-1000-message heartbeat in Messenger::update() did 11 blocking
`Serial <<` writes (~280 B). This runs on core 0 between SPI transactions;
if the USB host isn't draining, a full TX buffer could stall core 0 for
milliseconds, so the panel wasn't back in custom_spi_read_blocking() when
the master started the next transaction -> missed leading bytes -> num_rx
undercount -> PE03 length error.

Build one compact line and emit it only when Serial.availableForWrite() has
room for the whole line, so write() can never block. The heartbeat is
diagnostic and safely droppable. Builds clean for pico_v021 and pico_v031.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
panel_master was a ~250 kHz GPIO-bitbang stand-in for a real arena master;
it cannot validate the 10-30 MHz SPI operation that matters now. With the
real arena master available it is superseded, so remove the subproject and
point the bench docs at the arena master.

- Remove panel_master/ (README, platformio.ini, src/main.cpp).
- handoff-v1-bench-testing.md: panel_master -> arena master; add a
  retirement note.
- Add handoff-spi-highspeed-bench.md: AD3/Saleae protocol to measure the
  arena master's real SCK clock + CS-high gap and sweep brightness vs SPI
  error rate, to decide signal-integrity vs PL022-shifter as the >10 MHz
  limit. Notes the canonical spec facts (controller ~5 MHz today; 30 MHz
  aspirational/unmeasured; shared CIPO; column-buffer topology).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
custom_spi_read_blocking() broke out of the receive loop the instant
gpio_get(cs_pin) read high. The final data byte's last SCK edge can precede
the CS rising edge by less than the RP2350 RX-path latency (input synchronizer
+ SSP RX pipeline), so that byte was not yet spi_is_readable() when CS was
sampled high -> the loop returned one byte short -> check_length() failed ->
PE03 (visible as occasional flicker / error glyph during streaming).

After detecting CS-high, drain any straggler bytes still arriving in the RX
FIFO before returning, bounded by RX_DRAIN_SETTLE consecutive empty polls
(reset on each read) so a finished or truncated transaction still exits
promptly and can never hang. CS is high during the drain, so no bytes from the
next transaction can enter the FIFO.

Validated on v0.3.1 + arena master via SPI_DIAG: GS2/200Hz rejects 10/24130
-> 0/24101; GS16 0/24117; 0 rejects across 229,726 messages (GS2+GS16).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SPI_DIAG (-DSPI_DIAG=1; new pico_v021_spidiag / pico_v031_spidiag envs):
behaves identically to production during streaming but silently tallies
reception stats in RAM — per-check fail counts, per-command histogram, and a
ring of the last 32 failures with got-vs-expected byte counts. No serial
output during streaming; 'd' dumps everything in one burst, 'z' zeros the
window. This localized the PE03 cause (one-byte-short last-byte drop) without
perturbing timing, and is kept for the >10 MHz SPI work.

Also ignore sub-minimum runts: a transaction shorter than MESSAGE_MINIMUM_SIZE
(a 1-2 byte CS glitch / aborted clock, not a message attempt) no longer raises
a PE03/PE04 glyph that would blank a streaming display. SPI_DIAG still counts
them, so visibility is retained.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per Modular-LED-Display/docs/development: the slim G4.1 controller clocks SPI
at ~5 MHz today (g6_07 line 178) and the G6 SPI clock is an unmeasured bring-up
item (the 30 MHz in g6_01 is aspirational). Put 5 MHz first in the sweep (the
deployed operating point), note the shared-CIPO + SN74HCS08 column-buffer
topology, and target a master-guaranteed >=0.5 ms CS-high gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mbreiser and others added 2 commits May 30, 2026 11:32
…d-bearing

Add an idle timeout to the SPI_DIAG diagnostic build so the panel can answer
'd'/'z' host commands while the arena master is idle between bursts:
custom_spi_read_blocking() returns 0 after 50 ms of CS-high idle, and
Messenger::update() early-returns + services commands on a 0-byte read
(factored into diag_service_commands()). This is what makes the contained-burst
reliability benchmark choreography work (z before the burst, d after).

The production path is untouched: the timeout is guarded by SPI_DIAG and never
fires during streaming (frames arrive every few ms << 50 ms), so streaming
timing is identical to production.

Also document, in panel_spi_read(), that the per-valid-frame DOUBLE SSE toggle
(clear in panel_spi_read + arm in messenger) is NOT redundant: an A/B test that
collapsed it to a single reload regressed 15 MHz from 0% to 2.4% byte-drops. The
extra SSP disable/enable keeps the marginal PL022 slave RX aligned. Do not
"optimize" it away.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- CLAUDE.md: repo orientation (dual-core, PL022 slave, two HW revs), build/flash
  including AUTONOMOUS BOOTSEL flashing via the 1200-baud-touch reset (no need to
  prompt the user to press BOOTSEL), the SPI_DIAG benchmark workflow, and
  do-not-break notes (load-bearing double toggle, RX-drain PE03 fix, 18 MHz
  reliable / 20 MHz cliff).
- panel/bench/SPI-BRINGUP-SUMMARY.md: full SPI bring-up wrap-up -- merged PRs
  #2/#3, flicker/PE03 root-cause (last-byte drop) + fix, the clock/rate/rev
  reliability sweep, the rev-delta=0 finding (the >18 MHz cliff is PL022
  sampling, not board SI/pin-routing), the abandoned turnaround optimization,
  and the gated future PL022+DMA -> PIO+DMA path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant