MUL-2489 fix(runtime): delete archived squads before runtime teardown#2955
MUL-2489 fix(runtime): delete archived squads before runtime teardown#2955tmih06 wants to merge 7 commits into
Conversation
…e teardown The DeleteAgentRuntime handler was failing with 500 'failed to clean up archived agents' because squad.leader_id has an ON DELETE RESTRICT FK on agent(id). When an archived agent was still referenced as a squad leader (even on an archived squad), the DELETE FROM agent query was blocked. Fix: add DeleteSquadsByArchivedAgentsOnRuntime query that removes squads whose leader_id points to an archived agent on the target runtime, and call it before DeleteArchivedAgentsByRuntime in the handler. Closes TMI-85
|
Someone is attempting to deploy a commit to the IndexLabs Team on Vercel. A member of the Team first needs to authorize it. |
Adds four tests around the DeleteSquadsByArchivedAgentsOnRuntime fix: * TestDeleteSquadsByArchivedAgentsOnRuntime_Query — query-level: deletes squads whose leader is an archived agent on the target runtime, leaves squads with active leaders or archived leaders on a different runtime alone, and is safe to call when nothing matches. Covers the archived- squad case that originally hid the FK blocker from `multica squad list`. * TestDeleteAgentRuntime_RemovesSquadsLedByArchivedAgents — handler end-to-end regression for TMI-85. Reverting the handler change makes this fail with the exact 500 'failed to clean up archived agents' the user reported. * TestDeleteAgentRuntime_NoSquadsRegression — happy path for runtimes whose archived agents were never squad leaders, ensuring the new step is a no-op there. * TestDeleteAgentRuntime_StillBlockedByActiveAgents — preserves the 409 CountActiveAgentsByRuntime guard so the active-agent contract isn't silently regressed by the new cleanup ordering. Refs TMI-85
|
Reviewed alongside #2954. FK analysis is right. I walked every FK into Placement is correct — between One concern worth addressing before merge — hard-delete bypasses the The normal squad-removal path (
Two options:
I'd lean toward (1) — smaller, more aligned with what the bug actually says. Nits (non-blocking):
Overall: approach is correct, FK analysis is correct, tests are good. Please consider (1) above before merging. |
Bohan-J
left a comment
There was a problem hiding this comment.
Thanks for the iteration on this — narrowing the query to AND archived_at IS NOT NULL does make the cleanup safer than the first version. However, after a second pass (combining feedback from another reviewer), I'm going to request changes: as it stands, the PR doesn't close the full "all agents archived → runtime delete returns 500" path that #2954 reports.
Must-fix: active squad + archived leader is still a reachable 500
This is the blocker. The state is reachable via the public API:
ArchiveAgent(server/internal/handler/agent.go:871-889) does not check whether the agent is currently a squad leader. So a user can archive an agent that still leads an active squad — leaving an active squad whoseleader_idpoints at an archived agent.DeleteAgentRuntimeonly gates on active agents (CountActiveAgentsByRuntime == 0). It does not check active squads.- The new
DeleteSquadsByArchivedAgentsOnRuntimenow intentionally skips active squads (AND archived_at IS NOT NULL) — and the test atruntime_archive_squad_cleanup_test.go:170-184explicitly asserts active squads are kept.
So when DeleteArchivedAgentsByRuntime runs, any active squad whose leader_id is one of those archived agents still triggers ON DELETE RESTRICT, and the handler returns the same failed to clean up archived agents 500 the issue is about. Repro: archive all agents on a runtime where at least one of them led a still-active squad, then DELETE /api/runtimes/{id}.
Two ways out — please pick one:
- (Preferred) Before
DeleteArchivedAgentsByRuntime, detect active squads whoseleader_idis an archived agent on this runtime, and return409 Conflictwith an actionable message (e.g. asking the caller to archive or rotate the leader of those squads first). This treats active squads as user-meaningful entities and refuses to silently hard-delete them. - If product wants runtime delete to also tear down those active squads, then the squad delete can't be a bare row delete —
issue.assignee_idandautopilot.assignee_idare polymorphic (assignee_type='squad') with no FK, so a hard delete leaves dangling references. You'd need to define and implement the migration: either rewrite those references (analogous toTransferSquadAssignees/TransferSquadAutopilotsToLeaderin the normalDeleteSquadpath) or null them out. Without that, the dangling-reference bug just moves from FK-visible to FK-invisible.
I'd lean (1) — smaller, matches user intent, and doesn't require deciding the polymorphic-cleanup story under this PR.
Test gap
The query-level tests are good. What's missing is a handler-level e2e for the case above: DeleteAgentRuntime against a runtime that has an active squad led by an archived agent. Today no test pins the final user-visible behavior for that path — please add one that asserts whichever resolution you take (clean 409 vs. successful tear-down with references migrated).
Nits (non-blocking)
- The error string
failed to clean up orphaned squadsis good in that it separates the failure mode from the existing one, but "orphaned" is slightly off — the squad isn't orphaned, its leader is archived. Something likefailed to clean up squads referencing archived agentsreads more directly. Pure nit, take it or leave it. - Transactionality:
PauseAutopilots → DeleteSquads → DeleteArchivedAgents → DeleteRuntimeare four independentExecs; a failure mid-chain leaves partial state. This isn't introduced by this PR (the chain already existed) and I don't think it should block merge, but worth a follow-up to wrap the last three in apgx.Tx.
Recap
- Direction is right and the test coverage on the query itself is solid.
- Blocker is the still-reachable 500 via active-squad-with-archived-leader.
- Once that path returns either a clean 409 or a fully-cleaned tear-down (with a handler test pinning the behavior), this should be good to go.
What does this PR do?
Fixes a 500 error when deleting a runtime via
DELETE /api/runtimes/{id}after all of its agents have been archived.DeleteAgentRuntimecleans up archived agents before removing the runtime row (DELETE FROM agent WHERE runtime_id = $1 AND archived_at IS NOT NULL), but that delete was blocked bysquad.leader_id REFERENCES agent(id) ON DELETE RESTRICTwhenever an archived agent on the runtime was still referenced as a squad leader — including by archived squads, which the user has no easy way to see (multica squad listfilters outarchived_at IS NOT NULL). The handler caught the FK error and returned500 {"error":"failed to clean up archived agents"}, leaving the runtime undeletable with no actionable signal.This PR adds a step that deletes squads whose
leader_idpoints to any archived agent on the target runtime, executed immediately before the existing archived-agent cleanup. Putting it in the handler (rather than asking the user to find and archive squads first) keeps the fix server-side and works for archived squads the user can't see in the UI.Related Issue
Closes #2954
Type of Change
Changes Made
server/pkg/db/queries/runtime.sql— newDeleteSquadsByArchivedAgentsOnRuntimequery:DELETE FROM squad WHERE leader_id IN (SELECT id FROM agent WHERE runtime_id = $1 AND archived_at IS NOT NULL)server/pkg/db/generated/runtime.sql.go— regenerated sqlc bindings for the new queryserver/internal/handler/runtime.go— callDeleteSquadsByArchivedAgentsOnRuntimeinDeleteAgentRuntimebeforeDeleteArchivedAgentsByRuntime; surface a distinct"failed to clean up orphaned squads"500 if the new step itself errorsHow to Test
CountActiveAgentsByRuntimereturns 0.DELETE /api/runtimes/{runtime-id}→500 {"error":"failed to clean up archived agents"}.200 OK. Verify withSELECT * FROM agent_runtime WHERE id = $1(gone),SELECT * FROM agent WHERE runtime_id = $1(gone), andSELECT * FROM squad WHERE leader_id… (squads pointing at the now-deleted archived agents are gone).Checklist
apps/web/features/landing/i18n/) and relevant docs (apps/docs/content/docs/)apps/docs/content/docs/developers/conventions.zh.mdx(terminology, mixed-rule fortask/issue/skill)Risks / notes for reviewers:
leader_id. That matches the existing semantics — a leader-less squad isn't a meaningful entity here, and the squads being removed are exactly those whose leader is an already-archived agent on a runtime being torn down. If we'd rather preserve the squad row (e.g. for audit), the alternative isUPDATE squad SET leader_id = NULL …, butsquad.leader_idis currentlyNOT NULLso that would also need a schema change.agent_runtime(id)(agent_task_queueCASCADE,chat_sessionSET NULL) and other FKs intoagent(id)were re-checked;squad.leader_idis the only remainingRESTRICTpath that could fail this handler.AI Disclosure
AI tool used: Multica Agent (Kiro auto model)
Prompt / approach:
Investigated
DeleteAgentRuntimeinserver/internal/handler/runtime.go, walked the FK graph fromagent_runtimeandagentto find what could still causeDELETE FROM agent … archived_at IS NOT NULLto fail after the handler's existing cleanup, identifiedsquad.leader_id … ON DELETE RESTRICTas the remaining culprit, then added the targeted pre-cleanup query and wired it into the handler.Screenshots (optional)
N/A — backend-only change.