Problem
Subagent results are being lost or delayed because completion delivery depends on synchronous direct announce into a busy parent lane. Restart/drain windows and local loopback tick timeouts make this worse.
Findings so far
- dominant failure mode is announce-path design under congestion, not just transport instability
- drain/restart windows reject announces outright
- retries amplify congestion
- there is no durable spool/inbox protecting results after retry exhaustion
Goal
Make completion durable first, delivery second.
Proposed direction
- persist failed completion payloads to a durable queue/inbox
- short-circuit direct announces during drain/restart states
- batch/opportunistic later delivery when the requester lane is available
- instrument effective timeout, queue depth, and event-loop lag
Acceptance criteria
- no completed subagent result can be silently lost after retry exhaustion
- restart/draining states persist instead of re-hammering direct announce
- blocked parent lanes delay delivery, but do not destroy result visibility
Notes
This should be traced in code, not just from logs. GitHub-native fix path only.
Problem
Subagent results are being lost or delayed because completion delivery depends on synchronous direct announce into a busy parent lane. Restart/drain windows and local loopback tick timeouts make this worse.
Findings so far
Goal
Make completion durable first, delivery second.
Proposed direction
Acceptance criteria
Notes
This should be traced in code, not just from logs. GitHub-native fix path only.