Skip to content

MUL-2489 fix(runtime): delete archived squads before runtime teardown#2955

Open
tmih06 wants to merge 7 commits into
multica-ai:mainfrom
tmih06:agent/kiro-auto-model/77b3229a
Open

MUL-2489 fix(runtime): delete archived squads before runtime teardown#2955
tmih06 wants to merge 7 commits into
multica-ai:mainfrom
tmih06:agent/kiro-auto-model/77b3229a

Conversation

@tmih06
Copy link
Copy Markdown
Contributor

@tmih06 tmih06 commented May 20, 2026

What does this PR do?

Fixes a 500 error when deleting a runtime via DELETE /api/runtimes/{id} after all of its agents have been archived.

DeleteAgentRuntime cleans up archived agents before removing the runtime row (DELETE FROM agent WHERE runtime_id = $1 AND archived_at IS NOT NULL), but that delete was blocked by squad.leader_id REFERENCES agent(id) ON DELETE RESTRICT whenever an archived agent on the runtime was still referenced as a squad leader — including by archived squads, which the user has no easy way to see (multica squad list filters out archived_at IS NOT NULL). The handler caught the FK error and returned 500 {"error":"failed to clean up archived agents"}, leaving the runtime undeletable with no actionable signal.

This PR adds a step that deletes squads whose leader_id points to any archived agent on the target runtime, executed immediately before the existing archived-agent cleanup. Putting it in the handler (rather than asking the user to find and archive squads first) keeps the fix server-side and works for archived squads the user can't see in the UI.

Related Issue

Closes #2954

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Refactor / code improvement (no behavior change)
  • Documentation update
  • Tests (adding or improving test coverage)
  • CI / infrastructure

Changes Made

  • server/pkg/db/queries/runtime.sql — new DeleteSquadsByArchivedAgentsOnRuntime query: DELETE FROM squad WHERE leader_id IN (SELECT id FROM agent WHERE runtime_id = $1 AND archived_at IS NOT NULL)
  • server/pkg/db/generated/runtime.sql.go — regenerated sqlc bindings for the new query
  • server/internal/handler/runtime.go — call DeleteSquadsByArchivedAgentsOnRuntime in DeleteAgentRuntime before DeleteArchivedAgentsByRuntime; surface a distinct "failed to clean up orphaned squads" 500 if the new step itself errors

How to Test

  1. Pick a runtime that has at least one agent which is/was assigned as the leader of a squad (active or archived).
  2. Archive every agent on that runtime so CountActiveAgentsByRuntime returns 0.
  3. Before this PR: DELETE /api/runtimes/{runtime-id}500 {"error":"failed to clean up archived agents"}.
  4. After this PR: same call → 200 OK. Verify with SELECT * FROM agent_runtime WHERE id = $1 (gone), SELECT * FROM agent WHERE runtime_id = $1 (gone), and SELECT * FROM squad WHERE leader_id … (squads pointing at the now-deleted archived agents are gone).
  5. Regression check: deleting a runtime whose archived agents were never squad leaders still works (the new query is a no-op DELETE in that case).

Checklist

  • I have included a thinking path that traces from project context to this change
  • I have run tests locally and they pass
  • I have added or updated tests where applicable
  • If this change affects the UI, I have included before/after screenshots
  • I have updated relevant documentation to reflect my changes
  • If I added a new runtime / coding tool / UI tab, I synced the change to landing copy (apps/web/features/landing/i18n/) and relevant docs (apps/docs/content/docs/)
  • If this PR touches Chinese product copy, I checked it against apps/docs/content/docs/developers/conventions.zh.mdx (terminology, mixed-rule for task / issue / skill)
  • I have considered and documented any risks above
  • I will address all reviewer comments before requesting merge

Risks / notes for reviewers:

  • The new query deletes squads outright rather than nullifying leader_id. That matches the existing semantics — a leader-less squad isn't a meaningful entity here, and the squads being removed are exactly those whose leader is an already-archived agent on a runtime being torn down. If we'd rather preserve the squad row (e.g. for audit), the alternative is UPDATE squad SET leader_id = NULL …, but squad.leader_id is currently NOT NULL so that would also need a schema change.
  • Other tables referencing agent_runtime(id) (agent_task_queue CASCADE, chat_session SET NULL) and other FKs into agent(id) were re-checked; squad.leader_id is the only remaining RESTRICT path that could fail this handler.

AI Disclosure

AI tool used: Multica Agent (Kiro auto model)

Prompt / approach:
Investigated DeleteAgentRuntime in server/internal/handler/runtime.go, walked the FK graph from agent_runtime and agent to find what could still cause DELETE FROM agent … archived_at IS NOT NULL to fail after the handler's existing cleanup, identified squad.leader_id … ON DELETE RESTRICT as the remaining culprit, then added the targeted pre-cleanup query and wired it into the handler.

Screenshots (optional)

N/A — backend-only change.

…e teardown

The DeleteAgentRuntime handler was failing with 500 'failed to clean up
archived agents' because squad.leader_id has an ON DELETE RESTRICT FK on
agent(id). When an archived agent was still referenced as a squad leader
(even on an archived squad), the DELETE FROM agent query was blocked.

Fix: add DeleteSquadsByArchivedAgentsOnRuntime query that removes squads
whose leader_id points to an archived agent on the target runtime, and
call it before DeleteArchivedAgentsByRuntime in the handler.

Closes TMI-85
@vercel
Copy link
Copy Markdown

vercel Bot commented May 20, 2026

Someone is attempting to deploy a commit to the IndexLabs Team on Vercel.

A member of the Team first needs to authorize it.

Kiro and others added 3 commits May 20, 2026 17:31
Adds four tests around the DeleteSquadsByArchivedAgentsOnRuntime fix:

* TestDeleteSquadsByArchivedAgentsOnRuntime_Query — query-level: deletes
  squads whose leader is an archived agent on the target runtime, leaves
  squads with active leaders or archived leaders on a different runtime
  alone, and is safe to call when nothing matches. Covers the archived-
  squad case that originally hid the FK blocker from `multica squad list`.
* TestDeleteAgentRuntime_RemovesSquadsLedByArchivedAgents — handler
  end-to-end regression for TMI-85. Reverting the handler change makes
  this fail with the exact 500 'failed to clean up archived agents' the
  user reported.
* TestDeleteAgentRuntime_NoSquadsRegression — happy path for runtimes
  whose archived agents were never squad leaders, ensuring the new step
  is a no-op there.
* TestDeleteAgentRuntime_StillBlockedByActiveAgents — preserves the 409
  CountActiveAgentsByRuntime guard so the active-agent contract isn't
  silently regressed by the new cleanup ordering.

Refs TMI-85
@Bohan-J
Copy link
Copy Markdown
Collaborator

Bohan-J commented May 21, 2026

Reviewed alongside #2954.

FK analysis is right. I walked every FK into agent(id) and agent_runtime(id): squad.leader_id is the only remaining ON DELETE RESTRICT into agent(id). Everything else into agent(id) is CASCADE (agent_skill_run, agent_task_queue, daemon_connection, chat_session), and autopilot.assignee_id had its FK dropped in migration 096. So the targeted query in front of DeleteArchivedAgentsByRuntime is the right shape.

Placement is correct — between PauseAutopilotsByAgentAssignees and DeleteArchivedAgentsByRuntime. Distinct error code is good. Test coverage (cross-runtime isolation, idempotence, no-op regression, still-blocks-active-agents) is solid.

One concern worth addressing before merge — hard-delete bypasses the DeleteSquad transfer logic.

The normal squad-removal path (DeleteSquad in squad.go) is a soft-archive that runs TransferSquadAssignees and TransferSquadAutopilotsToLeader first, rewriting issue.assignee_id / autopilot.assignee_id from the squad to its leader. Those columns are polymorphic and have no FK, so a hard DELETE FROM squad does not error — but it does leave dangling assignee_type='squad', assignee_id=<deleted_squad_id> rows.

  • For archived squads (the case the user actually hit): the transfers were typically run at archive time, so this is fine.
  • For active squads whose leader has been archived (your query catches these too): the dangling refs are real. UI will resolve to "unknown squad," autopilot dispatch will skip with "assignee squad cannot be resolved."

Two options:

  1. Narrow the query to the case the user actually reported:

    WHERE leader_id IN (SELECT id FROM agent WHERE runtime_id = $1 AND archived_at IS NOT NULL)
      AND archived_at IS NOT NULL

    An active squad with an archived leader is a weird state that probably wants a separate, explicit handling path, not a silent cleanup during runtime delete.

  2. Or keep the broad scope but run the same TransferSquadAssignees / TransferSquadAutopilotsToLeader for each squad you delete, so polymorphic refs don't dangle.

I'd lean toward (1) — smaller, more aligned with what the bug actually says.

Nits (non-blocking):

  • The four delete steps (PauseAutopilots → DeleteSquads → DeleteArchivedAgents → DeleteRuntime) run outside a transaction. If step 3 fails after step 2 succeeds, the squads are gone but the runtime stays. Pre-existing, not introduced here, but this PR extends the chain — worth wrapping in a pgx.Tx in a follow-up.
  • failed to clean up orphaned squads reads slightly off — the squads aren't orphans, their leaders are archived. Something like failed to clean up squads referencing archived agents is more literal.

Overall: approach is correct, FK analysis is correct, tests are good. Please consider (1) above before merging.

@multica-eve multica-eve changed the title fix(runtime): delete squads referencing archived agents before runtim… MUL-2489 fix(runtime): delete archived squads before runtime teardown May 21, 2026
Copy link
Copy Markdown
Collaborator

@Bohan-J Bohan-J left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iteration on this — narrowing the query to AND archived_at IS NOT NULL does make the cleanup safer than the first version. However, after a second pass (combining feedback from another reviewer), I'm going to request changes: as it stands, the PR doesn't close the full "all agents archived → runtime delete returns 500" path that #2954 reports.

Must-fix: active squad + archived leader is still a reachable 500

This is the blocker. The state is reachable via the public API:

  • ArchiveAgent (server/internal/handler/agent.go:871-889) does not check whether the agent is currently a squad leader. So a user can archive an agent that still leads an active squad — leaving an active squad whose leader_id points at an archived agent.
  • DeleteAgentRuntime only gates on active agents (CountActiveAgentsByRuntime == 0). It does not check active squads.
  • The new DeleteSquadsByArchivedAgentsOnRuntime now intentionally skips active squads (AND archived_at IS NOT NULL) — and the test at runtime_archive_squad_cleanup_test.go:170-184 explicitly asserts active squads are kept.

So when DeleteArchivedAgentsByRuntime runs, any active squad whose leader_id is one of those archived agents still triggers ON DELETE RESTRICT, and the handler returns the same failed to clean up archived agents 500 the issue is about. Repro: archive all agents on a runtime where at least one of them led a still-active squad, then DELETE /api/runtimes/{id}.

Two ways out — please pick one:

  1. (Preferred) Before DeleteArchivedAgentsByRuntime, detect active squads whose leader_id is an archived agent on this runtime, and return 409 Conflict with an actionable message (e.g. asking the caller to archive or rotate the leader of those squads first). This treats active squads as user-meaningful entities and refuses to silently hard-delete them.
  2. If product wants runtime delete to also tear down those active squads, then the squad delete can't be a bare row delete — issue.assignee_id and autopilot.assignee_id are polymorphic (assignee_type='squad') with no FK, so a hard delete leaves dangling references. You'd need to define and implement the migration: either rewrite those references (analogous to TransferSquadAssignees / TransferSquadAutopilotsToLeader in the normal DeleteSquad path) or null them out. Without that, the dangling-reference bug just moves from FK-visible to FK-invisible.

I'd lean (1) — smaller, matches user intent, and doesn't require deciding the polymorphic-cleanup story under this PR.

Test gap

The query-level tests are good. What's missing is a handler-level e2e for the case above: DeleteAgentRuntime against a runtime that has an active squad led by an archived agent. Today no test pins the final user-visible behavior for that path — please add one that asserts whichever resolution you take (clean 409 vs. successful tear-down with references migrated).

Nits (non-blocking)

  • The error string failed to clean up orphaned squads is good in that it separates the failure mode from the existing one, but "orphaned" is slightly off — the squad isn't orphaned, its leader is archived. Something like failed to clean up squads referencing archived agents reads more directly. Pure nit, take it or leave it.
  • Transactionality: PauseAutopilots → DeleteSquads → DeleteArchivedAgents → DeleteRuntime are four independent Execs; a failure mid-chain leaves partial state. This isn't introduced by this PR (the chain already existed) and I don't think it should block merge, but worth a follow-up to wrap the last three in a pgx.Tx.

Recap

  • Direction is right and the test coverage on the query itself is solid.
  • Blocker is the still-reachable 500 via active-squad-with-archived-leader.
  • Once that path returns either a clean 409 or a fully-cleaned tear-down (with a handler test pinning the behavior), this should be good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: DELETE /api/runtimes/{id} returns 500 "failed to clean up archived agents" due to squad.leader_id FK constraint

2 participants