Skip to content

Add durable task queue and GitHub delivery idempotency #2

@romgenie

Description

@romgenie

Local source: coven-github/issues/01-add-durable-task-queue-and-delivery-idempotency.md

Summary

coven-github should persist accepted webhook work before returning success and should deduplicate GitHub webhook deliveries by X-GitHub-Delivery. The current development path uses an in-process tokio::mpsc channel and can drop tasks when the queue is full while still returning 200 OK to GitHub.

Current Evidence

  • crates/webhook/src/routes.rs validates the webhook, parses the event, maps it to a Task, and then calls state.task_tx.try_send(task).
  • If try_send fails, the route logs task queue full - dropping task and still returns StatusCode::OK.
  • crates/github/src/tasks.rs provides an in-memory task store backed by a HashMap.
  • README.md marks durable queue / task store as planned and required for hosted reliability.
  • DESIGN.md, ROADMAP.md, docs/security.md, and docs/hosted-mvp-plan.md all call out durable task state and delivery idempotency as hosted prerequisites.

Problem

GitHub considers a 200 OK webhook response a successful delivery. If coven-github drops the task after accepting the webhook, GitHub will not retry it, and the user sees a missing agent response with no obvious recovery path. Similarly, without delivery id idempotency, GitHub redelivery can create duplicate tasks, duplicate comments, duplicate branches, or duplicate PRs.

This is acceptable for local smoke testing, but it is not acceptable for a hosted GitHub App that promises reliable issue, PR, or commit review.

Impact

  • Accepted tasks can disappear under backpressure.
  • Process restarts lose queued/running/done task history.
  • GitHub webhook redelivery can run the same task more than once.
  • CovenCave cannot reliably reconstruct task state after a server restart.
  • Billing, audit, and support cannot answer whether a task was accepted, ignored, failed, retried, or duplicated.

Proposed Design

Introduce a durable task subsystem with these records:

  • webhook_deliveries: delivery id, event type, installation id, repository id/name, action, received time, payload hash, routing result, processing state.
  • tasks: task id, delivery id, installation id, repository id/name, familiar id, task kind, target issue/PR/commit, target head SHA, policy snapshot, state, retry count, timestamps.
  • task_attempts: task id, attempt number, worker id, start/end time, failure category, result pointer.

The webhook route should:

  1. Validate HMAC.
  2. Parse X-GitHub-Delivery.
  3. Persist the delivery before task dispatch.
  4. Check whether the delivery has already been routed.
  5. Persist the task in queued state.
  6. Enqueue by durable task id.
  7. Return 202 Accepted or 200 OK only after durable state exists.

The worker should consume task ids from the durable queue, claim them atomically, and update task state after every phase.

Acceptance Criteria

  • Webhook handling refuses or explicitly marks requests missing X-GitHub-Delivery.
  • Replaying the same GitHub delivery id does not create a duplicate task.
  • A process restart after webhook acceptance does not lose the task.
  • A full worker queue does not silently drop work after returning success to GitHub.
  • Task states include at least received, queued, running, failed, completed, and ignored.
  • Tests cover duplicate delivery, queue-full behavior, restart/reload behavior, and unsupported event routing.
  • README status changes from planned to partial or implemented only after the durable path is actually wired.

Test Notes

Add webhook route tests that inject a delivery id and verify persistence before enqueue. Add integration tests with a fake queue/store that fails enqueue and assert the webhook response does not claim success unless the task can be recovered or retried.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions