SPI reliability: fix PE03 frame drops (RX-drain) + SPI_DIAG build; retire panel_master#5
Open
mbreiser wants to merge 7 commits into
Open
SPI reliability: fix PE03 frame drops (RX-drain) + SPI_DIAG build; retire panel_master#5mbreiser wants to merge 7 commits into
mbreiser wants to merge 7 commits into
Conversation
…> PE03) The per-1000-message heartbeat in Messenger::update() did 11 blocking `Serial <<` writes (~280 B). This runs on core 0 between SPI transactions; if the USB host isn't draining, a full TX buffer could stall core 0 for milliseconds, so the panel wasn't back in custom_spi_read_blocking() when the master started the next transaction -> missed leading bytes -> num_rx undercount -> PE03 length error. Build one compact line and emit it only when Serial.availableForWrite() has room for the whole line, so write() can never block. The heartbeat is diagnostic and safely droppable. Builds clean for pico_v021 and pico_v031. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
panel_master was a ~250 kHz GPIO-bitbang stand-in for a real arena master; it cannot validate the 10-30 MHz SPI operation that matters now. With the real arena master available it is superseded, so remove the subproject and point the bench docs at the arena master. - Remove panel_master/ (README, platformio.ini, src/main.cpp). - handoff-v1-bench-testing.md: panel_master -> arena master; add a retirement note. - Add handoff-spi-highspeed-bench.md: AD3/Saleae protocol to measure the arena master's real SCK clock + CS-high gap and sweep brightness vs SPI error rate, to decide signal-integrity vs PL022-shifter as the >10 MHz limit. Notes the canonical spec facts (controller ~5 MHz today; 30 MHz aspirational/unmeasured; shared CIPO; column-buffer topology). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
custom_spi_read_blocking() broke out of the receive loop the instant gpio_get(cs_pin) read high. The final data byte's last SCK edge can precede the CS rising edge by less than the RP2350 RX-path latency (input synchronizer + SSP RX pipeline), so that byte was not yet spi_is_readable() when CS was sampled high -> the loop returned one byte short -> check_length() failed -> PE03 (visible as occasional flicker / error glyph during streaming). After detecting CS-high, drain any straggler bytes still arriving in the RX FIFO before returning, bounded by RX_DRAIN_SETTLE consecutive empty polls (reset on each read) so a finished or truncated transaction still exits promptly and can never hang. CS is high during the drain, so no bytes from the next transaction can enter the FIFO. Validated on v0.3.1 + arena master via SPI_DIAG: GS2/200Hz rejects 10/24130 -> 0/24101; GS16 0/24117; 0 rejects across 229,726 messages (GS2+GS16). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SPI_DIAG (-DSPI_DIAG=1; new pico_v021_spidiag / pico_v031_spidiag envs): behaves identically to production during streaming but silently tallies reception stats in RAM — per-check fail counts, per-command histogram, and a ring of the last 32 failures with got-vs-expected byte counts. No serial output during streaming; 'd' dumps everything in one burst, 'z' zeros the window. This localized the PE03 cause (one-byte-short last-byte drop) without perturbing timing, and is kept for the >10 MHz SPI work. Also ignore sub-minimum runts: a transaction shorter than MESSAGE_MINIMUM_SIZE (a 1-2 byte CS glitch / aborted clock, not a message attempt) no longer raises a PE03/PE04 glyph that would blank a streaming display. SPI_DIAG still counts them, so visibility is retained. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per Modular-LED-Display/docs/development: the slim G4.1 controller clocks SPI at ~5 MHz today (g6_07 line 178) and the G6 SPI clock is an unmeasured bring-up item (the 30 MHz in g6_01 is aspirational). Put 5 MHz first in the sweep (the deployed operating point), note the shared-CIPO + SN74HCS08 column-buffer topology, and target a master-guaranteed >=0.5 ms CS-high gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d-bearing Add an idle timeout to the SPI_DIAG diagnostic build so the panel can answer 'd'/'z' host commands while the arena master is idle between bursts: custom_spi_read_blocking() returns 0 after 50 ms of CS-high idle, and Messenger::update() early-returns + services commands on a 0-byte read (factored into diag_service_commands()). This is what makes the contained-burst reliability benchmark choreography work (z before the burst, d after). The production path is untouched: the timeout is guarded by SPI_DIAG and never fires during streaming (frames arrive every few ms << 50 ms), so streaming timing is identical to production. Also document, in panel_spi_read(), that the per-valid-frame DOUBLE SSE toggle (clear in panel_spi_read + arm in messenger) is NOT redundant: an A/B test that collapsed it to a single reload regressed 15 MHz from 0% to 2.4% byte-drops. The extra SSP disable/enable keeps the marginal PL022 slave RX aligned. Do not "optimize" it away. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- CLAUDE.md: repo orientation (dual-core, PL022 slave, two HW revs), build/flash including AUTONOMOUS BOOTSEL flashing via the 1200-baud-touch reset (no need to prompt the user to press BOOTSEL), the SPI_DIAG benchmark workflow, and do-not-break notes (load-bearing double toggle, RX-drain PE03 fix, 18 MHz reliable / 20 MHz cliff). - panel/bench/SPI-BRINGUP-SUMMARY.md: full SPI bring-up wrap-up -- merged PRs #2/#3, flicker/PE03 root-cause (last-byte drop) + fix, the clock/rate/rev reliability sweep, the rev-delta=0 finding (the >18 MHz cliff is PL022 sampling, not board SI/pin-routing), the abandoned turnaround optimization, and the gated future PL022+DMA -> PIO+DMA path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the flickery-display / occasional PE03 seen during arena-master streaming bring-up, adds a silent on-demand SPI diagnostic build that found it, and retires the
panel_masterbench harness now that the real arena master is available. All at the arena's actual ~5 MHz operating point — no SPI rewrite; the PIO+DMA work remains future-proofing for >10 MHz.The fix (headline):
PE03rejected frames — last-byte dropcustom_spi_read_blocking()broke out of the receive loop the instantgpio_get(cs_pin)read high. The final data byte's last SCK edge can precede the CS rising edge by less than the RP2350 RX-path latency (input synchronizer + SSP RX pipeline), so that byte wasn'tspi_is_readable()yet when CS was sampled high → the read returned one byte short →check_length()failed →PE03.Fix: after CS-high, drain any straggler bytes still arriving in the RX FIFO before returning, bounded by
RX_DRAIN_SETTLEconsecutive empty polls (reset on each read) so a finished/truncated transaction still exits promptly and can never hang. CS is high during the drain, so no next-transaction bytes can enter the FIFO.Diagnostic key finding: parity never failed — only length, almost all
got = expected − 1. So it was framing, not signal integrity.What's in this PR (5 commits)
panel: make DEVEL serial heartbeat non-blocking— the per-1000-message heartbeat could stall core 0 between transactions; now it builds one line and writes only when the USB-CDC FIFO has room.Retire panel_master harness; add high-speed SPI bench protocol—panel_master(≈250 kHz bitbang stand-in) removed; bench docs point at the real arena master; newpanel/bench/handoff-spi-highspeed-bench.md.panel: fix PE03 rejected frames — drain RX FIFO after CS-high— the fix above.panel: add SPI_DIAG diagnostic build + ignore sub-minimum runts— newpico_v021_spidiag/pico_v031_spidiagenvs (-DSPI_DIAG=1): behaves like production but silently tallies per-check fail counts, a per-command histogram, and a ring of the last 32 failures with got-vs-expected byte counts;ddumps in one burst,zzeros. Also: transactions belowMESSAGE_MINIMUM_SIZE(1–2 byte CS glitch / aborted clock) no longer raise a glyph that would blank a streaming display (still counted by SPI_DIAG).bench: fold canonical-spec facts into high-speed SPI protocol doc— controller clocks ~5 MHz today (g6_07); 30 MHz is aspirational/unmeasured; shared-CIPO + column-buffer topology; ≥0.5 ms CS-gap target.Validation (v0.3.1 panel + arena master, via SPI_DIAG)
got=exp−1, 5×got=1)Production builds (
pico_v021,pico_v031) and the*_spidiagbuilds all compile; the fix lives inpanel_spi_custom.cpp, so production carries it too.Notes / follow-ups
got=1runts are a real ~1-byte CS glitch or master short-transaction (the drain actually recovered them). Source unconfirmed — a Saleae peek or a note to Frank would close it; low priority since they no longer affect the display.panel/bench/handoff-spi-highspeed-bench.md).🤖 Generated with Claude Code