Multi-agent support from PR#163 + Support for Kimi2.5 and Minimax2.5 via Claude Code + Clean up#167
Multi-agent support from PR#163 + Support for Kimi2.5 and Minimax2.5 via Claude Code + Clean up#167dpbmaverick98 wants to merge 50 commits intoTinyAGI:mainfrom
Conversation
What: - Add three new tables: conversations, conversation_responses, conversation_pending_agents - Add 11 new functions for conversation state management - Follow existing SQLite patterns (WAL mode, transactions, indexes) Why: Previously, all conversation state was stored only in memory (Map<string, Conversation>). This meant that if the queue-processor crashed or was restarted during a team conversation, all active conversation state was lost. Agents would continue processing their messages, but the conversation would never complete because the pending counter and response aggregation were gone. This change persists conversation state to SQLite, enabling: 1. Restart recovery - conversations can be resumed after crash 2. State inspection - active conversations can be queried via API 3. Debuggability - conversation history is preserved Assumptions: - Conversations are short-lived (minutes, not days), so we don't need to persist the full Conversation object (Sets, Maps). We persist the minimal state needed to reconstruct: counters, IDs, and responses. - Files referenced in conversations are not persisted (they're ephemeral). - The existing in-memory conversations Map is still used for fast access during normal operation; the DB is the source of truth for recovery. Pattern compliance: - Uses transaction().immediate() for atomic operations (like claimNextMessage) - Uses INSERT OR REPLACE for upserts - Uses ON DELETE CASCADE for cleanup - Follows existing naming conventions and timestamp formats
…very What: - Remove agentProcessingChains Map that enforced sequential processing per agent - Refactor processMessage to use fire-and-forget pattern for invokeAgent - Add handleSimpleResponse and handleTeamResponse async handlers - Add handleTeamError for error recovery in team contexts - Add startup recovery logic to load active conversations from DB - Add conversation pruning maintenance interval Why: Previously, the queue-processor used a Promise chain per agent (agentProcessingChains) to ensure messages were processed sequentially. This caused the "freeze" problem: if agent A was processing a long request (e.g., 30s Claude API call), no other messages to agent A could be processed until it completed. This change makes invokeAgent fire-and-forget: 1. processMessage starts invokeAgent and returns immediately 2. The response is handled asynchronously by handleSimpleResponse/handleTeamResponse 3. Multiple messages to the same agent can be in-flight simultaneously 4. The queue processor never blocks on slow API calls Additionally, conversation state is now persisted to SQLite (from previous commit) and recovered on startup. This means if the queue-processor restarts during a team conversation, it will resume where it left off instead of losing all state. Assumptions: - invokeAgent is idempotent enough that reprocessing after a crash is safe - The DB transaction in decrementPendingInDb prevents race conditions - In-memory conversations Map is still used for fast access; DB is for recovery - Fire-and-forget is acceptable because we have retry logic via dead letter queue Breaking changes: - Removed per-agent sequential processing guarantee. Previously messages to the same agent were guaranteed to process sequentially. Now they process concurrently. This is actually the desired behavior (no freezing), but it means agents must handle concurrent requests if they share state. Pattern compliance: - Uses async/await for response handlers (cleaner than callbacks) - Uses DB functions from previous commit for persistence - Maintains existing event emission for observability - Preserves all existing error handling and logging
What:
- Add src/lib/signals.ts with file-based signaling system
- Modify enqueueResponse to signal channel when response is ready
- Update Discord, Telegram, and WhatsApp clients to use push notifications
- Add 10-second fallback polling for reliability
Why:
Previously, channel clients polled /api/responses/pending every 1-2 seconds.
This caused unnecessary latency (average 0.5-1s delay) and wasted CPU/IO on
both the client and server.
This change implements push notifications via file system:
1. When enqueueResponse is called, it writes a signal file (.tinyclaw/signals/{channel}.ready)
2. Channel clients use fs.watch() to get notified immediately
3. Response latency drops from ~1s to near-zero
4. Fallback polling every 10s catches any missed signals
Assumptions:
- File system watch (fs.watch) is reliable enough for this use case
- Signal files are cleaned up after processing to prevent duplicate triggers
- 10-second fallback is acceptable for missed signals (rare)
- All three channel clients (Discord, Telegram, WhatsApp) are on the same machine
Trade-offs:
- File-based signaling only works for local processes (same machine)
- If we need distributed deployment later, this would need to be replaced
with something like Redis pub/sub or NATS
- File system watches can be unreliable on some platforms (we have fallback)
Pattern compliance:
- Uses existing TINYCLAW_HOME for signal directory
- Follows existing error handling patterns
- Maintains backward compatibility (polling still works)
- Clean shutdown with unwatch() on SIGINT/SIGTERM
…ing guarantee What: - Make emitEvent() async to allow awaiting event listener completion - Update EventListener type to support async listeners - Add await to all emitEvent() calls in queue-processor.ts: - response_ready (handleSimpleResponse) - chain_handoff (handleTeamResponse) - team_chain_start (processMessage) - Make completeConversation() async and await team_chain_end emission - Wrap conversation recovery in async recoverConversations() function - Move startup logging into async IIFE to properly await emitEvent Why: The visualizer relies on event ordering: chain_step_start → chain_step_done → response_ready. Without await, events could be emitted in order but processed out of order due to async listener scheduling. This was a critical issue found in the NATS implementation (missing awaits on publishEvent calls). The same pattern exists here - emitEvent was fire-and-forget, so the visualizer could receive events out of sequence under high concurrency. By awaiting emitEvent, we guarantee: 1. Events are processed by listeners before continuing 2. Visualizer sees events in correct order 3. SSE clients receive events sequentially Assumptions: - Event listeners are fast enough that awaiting them won't block processing - The slight overhead of await is acceptable for ordering guarantees - Listeners that need to be fire-and-forget should internally queue work Breaking changes: - emitEvent() now returns Promise<void> instead of void - completeConversation() now returns Promise<void> - Code using these functions must now await them Pattern compliance: - Matches the fix applied in NATS branch (adding awaits to publishEvent) - Uses async/await consistently throughout the codebase - Maintains error handling (try/catch around await)
What: - Add next_retry_at column to messages table for scheduling retries - Update failMessage() to calculate exponential backoff with jitter - Update claimNextMessage() to respect next_retry_at timestamp - Add migration for existing databases (ALTER TABLE) Why: Previously, failed messages were immediately retried (status reset to 'pending'). Under high load or during outages, this caused a "thundering herd" problem: all failed messages would retry simultaneously, overwhelming the system. This change implements exponential backoff with jitter: - Retry 1: ~100ms delay - Retry 2: ~200ms delay - Retry 3: ~400ms delay - Retry 4: ~800ms delay - Retry 5: ~1600ms delay (capped at 30s) Plus 0-100ms random jitter to spread out retries and prevent synchronized retry storms. Assumptions: - Messages that fail temporarily (rate limits, network blips) will succeed after a short delay - Spreading retries over time is better than immediate retry - 5 retries with exponential backoff is sufficient for transient failures Implementation details: - ORDER BY clause prioritizes messages without next_retry_at (new messages) - Then orders by next_retry_at to process earliest scheduled first - Messages with future next_retry_at are skipped until their time comes Pattern compliance: - Uses same transaction pattern as claimNextMessage for atomicity - Maintains backward compatibility (next_retry_at is nullable) - Follows existing logging conventions
What: - Add src/lib/heartbeat.ts with heartbeat read/write functions - Queue-processor writes heartbeat every 5 seconds with timestamp, pid, uptime - Channel clients check heartbeat staleness in fallback polling loop - Atomic file write (temp + rename) to prevent corruption - Clean shutdown removes heartbeat file Why: File-based signaling (signals.ts) has no way to detect when the queue-processor crashes. If queue-processor dies: - Signal files stop being written (but clients don't know) - Clients keep watching, unaware of the crash - 10-second fallback polling continues but never gets new responses With heartbeat monitoring: - Channel clients detect stale heartbeat (default: 15s threshold) - Log warning when queue-processor may have crashed - Users can see the issue and restart the service This is simpler than NATS's consumer iterator monitoring but achieves the same goal: detecting when the message processor is unhealthy. Assumptions: - 5-second heartbeat interval is frequent enough for detection - 15-second staleness threshold (3 missed heartbeats) is reasonable - File system timestamps are accurate enough for health checks - Channel clients should log warnings but not auto-restart (user decision) Pattern compliance: - Uses same TINYCLAW_HOME directory as other state files - Follows existing error handling (log and continue) - Atomic write pattern prevents corrupted heartbeat files - Cleanup on SIGINT/SIGTERM for graceful shutdown
What: - Add MAX_MESSAGE_SIZE constant (1MB - Claude API limit) - Add validateMessage() function to check message size - Validate message before both invokeAgent calls (simple and team contexts) - Fail message immediately with clear error if too large Why: Previously, messages larger than 1MB would be sent to Claude API, which would reject them with an error. The error would trigger retry logic, wasting resources on a message that can never succeed. With validation: 1. Message size checked before any API call 2. Oversized messages fail immediately (no retry) 3. Clear error message logged for debugging 4. Prevents wasted API calls and retry cycles Assumptions: - 1MB is the appropriate limit for Claude API - Message size is the primary validation needed (other validations may be added) - Failing immediately is better than retrying oversized messages Pattern compliance: - Uses existing failMessage() for consistency - Logs error with context for debugging - Returns early (guard clause pattern) - Non-breaking change (new validation, no API changes)
… with conversation lock
**What**
Added withConversationLock() to handleTeamResponse() and handleTeamError() to prevent
race conditions when multiple agents finish simultaneously for the same conversation.
**Why (Critical Race Condition)**
When Agent A and Agent B both complete for same conversation concurrently:
- Both call handleTeamResponse(conv) with shared conv object reference
- Both modify conv.totalMessages++, conv.pending+= without synchronization (NOT ATOMIC)
- Both can reach if (newPending === 0) and call completeConversation(conv) twice
- Results in: lost updates, duplicate completion events, corrupted state
Example timeline:
Agent A finishes → handleTeamResponse starts
- persistResponse(conv_id, agentA, responseA)
- conv.totalMessages++ (read=5, write=6)
Agent B finishes → handleTeamResponse starts (same conv reference)
- persistResponse(conv_id, agentB, responseB)
- conv.totalMessages++ (read=5, write=6) ← Lost Agent A's increment!
Result: conv.totalMessages = 6 instead of 7. Conversation state corrupted.
**Solution**
Wrapped function body with withConversationLock(conv.id) which:
- Serializes updates: only one agent modifies conv at a time
- Prevents concurrent modifications to same conversation
- Ensures only one agent reaches completion check
**Similar Fix Applied To**
- handleTeamResponse(): Wraps entire response handling logic
- handleTeamError(): Same pattern for error handling
**Assumptions**
1. Fire-and-forget pattern is maintained (invoke is still async)
2. Lock overhead acceptable (milliseconds per conversation)
3. Conversation objects exist long enough for all agents to complete
4. Lock gracefully handles conversation deletion by cleanup in conversation.ts
**Testing Considerations**
- Test with 3+ agents finishing within milliseconds of each other
- Verify team_chain_end event emitted exactly once
- Check conversation state consistency in database
- Monitor for deadlocks (lock implementation has timeout handling)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…or handling
**What**
Changed clearSignal() from check-then-delete pattern to try-delete with selective
error handling. Now ignores ENOENT errors when file is already deleted.
**Why (Race Condition)**
Previous implementation used check-then-delete pattern:
```typescript
if (fs.existsSync(signalFile)) {
fs.unlinkSync(signalFile); // TOCTOU: file deleted between check and delete
}
```
This creates a Time-Of-Check-Time-Of-Use (TOCTOU) race condition:
1. Process A checks: file exists
2. Process B checks: file exists
3. Process A deletes file
4. Process B tries to delete: ENOENT error
5. Error not caught, may propagate and crash
Additionally, fs.existsSync can be slow on high-latency filesystems.
**Solution**
Direct try-delete approach with selective error handling:
```typescript
try {
fs.unlinkSync(signalFile);
} catch (error: any) {
if (error?.code !== 'ENOENT') {
throw error; // Re-throw unexpected errors
}
// Ignore ENOENT: normal when another process deleted first
}
```
Benefits:
- Atomic delete operation (no TOCTOU window)
- Faster (one syscall instead of two)
- Graceful: ignores benign ENOENT
- Still fails on real errors (permissions, disk full, etc.)
**When This Occurs**
When multiple channel clients process responses simultaneously:
- Telegram client calls clearSignal('telegram')
- WhatsApp client calls clearSignal('whatsapp')
- If same signal file, both try to delete → first succeeds, second gets ENOENT
Current likelihood: Low (different channels have different files) but possible
if signal file corruption or manual cleanup happens concurrently.
**Assumptions**
1. ENOENT is expected and benign (file already deleted)
2. Other errors (EACCES, EIO) should propagate and fail loudly
3. fs.unlinkSync is atomic (POSIX guarantee)
4. Process has correct permissions to delete signal files
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…resilience
**What**
Added recoverStaleConversations() function to detect and recover conversations
that are stuck in 'active' state, marking them as 'completed' so they can be
purged and don't cause memory leaks.
Called on startup and periodically (every 5 minutes) during normal operation.
**Why**
Conversations can become stuck in 'active' state if:
1. queue-processor crashes while agents are processing
2. Network failure prevents agent response from being persisted
3. Bug in agent handler prevents proper completion
4. Database corruption in conversation_pending_agents table
Without recovery:
- In-memory conversations Map grows unbounded
- Stuck conversations never emit team_chain_end event
- Users see conversation as "in progress" forever
- Memory leak: conversations never garbage collected
With recovery:
- Conversations marked as 'completed' after 30 min of inactivity
- Allows pruneOldConversations() to delete them
- Prevents memory leaks and orphaned conversations
- Teams can be retried by user if truly needed
**Implementation Details**
```typescript
export function recoverStaleConversations(staleThresholdMs = 30 * 60 * 1000): number {
const cutoff = Date.now() - staleThresholdMs;
return getDb().prepare(`
UPDATE conversations
SET status = 'completed'
WHERE status = 'active' AND updated_at < ?
`).run(cutoff).changes;
}
```
**Assumptions**
1. 30-minute threshold is reasonable for detecting stuck conversations
2. Marking as 'completed' is safe (responses already persisted to DB)
3. Periodic recovery (every 5 min) catches stuck conversations quickly
4. Users can retry conversation if legitimate work was interrupted
**Trade-offs**
- Possible data loss if agent is legitimately processing for 30+ min
(Mitigation: user can retry conversation, which is rare use case)
- Memory will grow to peak of ~30 min of stuck conversations
(Acceptable: periodic pruning cleans them up)
**Testing Considerations**
- Verify conversations marked as completed can be queried
- Check team_chain_end event emitted when recovery completes conversation
- Monitor logs for false positives (legitimate long-running conversations)
- Test crash scenarios to verify recovery works
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…documentation **What** Fixed two issues in recoverStaleConversations(): 1. Don't update updated_at when marking as completed (keeps original timestamp) 2. Enhanced documentation explaining why team_chain_end is NOT emitted **Why Issue TinyAGI#1: Pruning Timestamp Reset** Previous code: ```typescript UPDATE conversations SET status = 'completed', updated_at = ? ← WRONG: resets timestamp WHERE status = 'active' AND updated_at < ? ``` Problem timeline: - T=0: Conversation starts, updated_at = T0 - T=30min: Conversation gets stuck (no updates) - T=30min: Recovery runs, marks completed, sets updated_at = T30 - T=30min+24h: pruneOldConversations() looks for updated_at < 24h ago - Result: Conversation not pruned until T=30min+24h (stays in DB 24+ hours) Better approach: ```typescript UPDATE conversations SET status = 'completed' ← Keep original updated_at timestamp WHERE status = 'active' AND updated_at < ? ``` Now pruning works correctly: - Stale conversation marked completed at T=30min - Original updated_at = T0 (30+ min ago) - pruneOldConversations() deletes it when updated_at < 24h ago (works!) **Why Issue TinyAGI#2: Missing team_chain_end Event** Recovery completion is NOT a natural completion: - Natural completion: All agents finish, responses aggregated, user gets result - Stale recovery: Conversation abandoned after crash, responses may be incomplete Implications: - Visualizer won't show recovery as "completed" (correct - it's artificial) - Events not sent (prevents false positives in monitoring) - Users understand recovery = lost work, not success Alternative considered: Emit team_chain_end with recovery flag - Rejected: Would confuse visualizer and monitoring - Recovery should be silent cleanup, not broadcast as completion **Assumptions** 1. Keeping original updated_at is correct behavior (allows proper pruning) 2. Silent recovery is acceptable (users can retry if needed) 3. 30-minute stale threshold correctly identifies stuck conversations 4. Not emitting events prevents false positives in event-based systems **Testing** Verify: 1. Stale conversation marked as completed 2. completed_at timestamp NOT changed (still ~30min old) 3. pruneOldConversations() deletes it after 24h from original time 4. No team_chain_end event in logs for recovered conversations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…covery visibility
What:
- Reduce stale conversation threshold from 30min to 10min (Gap 1)
- Add getStaleConversations() to get details of stuck conversations
- Add WARN-level logging with team/conversation details on recovery (Gap 2)
- Emit crash_recovery event for visualizer/monitoring (Gap 2)
Why:
Gap 1 (Slow Detection): 30min threshold meant users could lose up to 30 minutes
of work if agent crashed. 10min reduces data loss window while still giving
slow agents reasonable grace period.
Gap 2 (Silent Recovery): Recovery was invisible (INFO level). Ops couldn't
tell if completion was normal or crash recovery. WARN logs + events provide
visibility for monitoring and alerting.
Implementation:
- getStaleConversations() returns {id, teamId, duration} for each stuck conv
- Startup recovery: WARN log with 🔴 CRASH RECOVERY prefix + event per conv
- Periodic recovery: WARN log with 🔴 PERIODIC RECOVERY prefix + events
- Events include conversationId, teamId, stuckForMs, recoveredAt/source
Assumptions:
- 10min is acceptable grace period for slow agents (2x NATS heartbeat)
- WARN level is appropriate for crash recovery (not ERROR since it's expected)
- Events emitted before actual recovery (state change happens after logging)
Risk: None (logging only, no behavior change)
Testing: Kill agent mid-processing, verify WARN logs + events after 10min
What: - Add backup.sh script for daily automated SQLite backups - Backups stored in ~/.tinyclaw/backups/ with 7-day retention - Add database integrity check (PRAGMA integrity_check) on startup - Copy WAL files if present (WAL mode consistency) - Verify backup is readable before considering it valid Why: SQLite database is a single point of failure. Without backups, corruption or accidental deletion means total data loss. With backups, worst case is losing last 24 hours of conversation state (acceptable for production use). Usage: ./backup.sh # Manual backup crontab -e # Add to cron for daily backups 0 2 * * * /path/to/backup.sh # Daily at 2 AM Recovery: cp ~/.tinyclaw/backups/tinyclaw_YYYYMMDD_HHMMSS.db ~/.tinyclaw/tinyclaw.db rm ~/.tinyclaw/tinyclaw.db-wal ~/.tinyclaw/tinyclaw.db-shm 2>/dev/null || true Assumptions: - 7-day retention is sufficient for debugging corruption causes - Daily backups are frequent enough (conversations are recoverable) - Storage is cheap (~1-5 MB per backup, 7 backups = ~35 MB max) - Manual recovery is acceptable (rare event, ops can handle) Risk: None (additive, no behavior changes) Testing: Run backup.sh, verify 7 daily backups exist and are readable
What: - Add outstanding_requests table with request_id, conversation_id, from_agent, to_agent - Add status field: pending | acked | responded | failed | escalated - Add deadline tracking: ack_deadline (5s default), response_deadline (5min default) - Add retry tracking: retry_count, max_retries (default 3) - Add 10 functions for request lifecycle management Why: This implements the primitive request-reply pattern with timeouts to solve the "ping pong" message drop problem. When agent A asks agent B to do something: 1. Create outstanding request with deadlines 2. Agent B must ACK (acknowledge) within timeout 3. Agent B must RESPOND with result within timeout 4. If deadlines expire → retry or escalate This is how distributed systems worked before fancy protocols - just timeouts and retries at the application level. Functions added: - createOutstandingRequest() - Create new request when handoff happens - acknowledgeRequest() - Agent B confirms receipt - respondToRequest() - Agent B provides result - failRequest() - Mark permanent failure - escalateRequest() - Escalate to human - getRequestsNeedingRetry() - Find expired pending requests - getRequestsNeedingEscalation() - Find expired acked requests - incrementRequestRetry() - Retry with new deadline - getRequest() - Lookup by ID - getPendingRequestsForConversation() - Get all pending for conv - pruneOldRequests() - Cleanup old completed requests Assumptions: - 5 second ACK timeout is reasonable for agent processing - 5 minute response timeout balances speed vs complex tasks - 3 retries before escalation is sufficient - SQLite is fast enough for this tracking (no separate service needed) Pattern compliance: - Uses same SQLite patterns (WAL, transactions, indexes) - Foreign key to conversations table with CASCADE delete - Timestamps in milliseconds (consistent with rest of codebase) - Debug/Warn logging for observability
… handoffs What: - Modify enqueueInternalMessage() to create outstanding request when agent A asks agent B to do something - Include request_id in the message payload as [REQUEST:xxx] prefix - Import createOutstandingRequest from db.ts Why: This is the integration point that actually uses the outstanding_requests table. Previously, agent handoffs were fire-and-forget - no tracking if agent responds. Now we create request with ACK deadline (5s) and response deadline (5min). The request_id in the message allows the receiving agent to acknowledge and respond. Assumptions: - Agents can parse [REQUEST:xxx] prefix from messages - 5 second ACK timeout is enough for agent to receive and parse - 5 minute response timeout is enough for agent to process task Breaking changes: - Internal messages now include [REQUEST:xxx] prefix - Backward compatible (old agents can ignore the prefix)
What: - Add checkRequestTimeouts() function to detect expired ACK and response deadlines - Import outstanding request functions from db.ts - Extract and acknowledge request_id from messages when agent receives them - Add request_escalated event for monitoring - Integrate timeout checking into periodic maintenance (every 5 min) Why: This completes the request-reply pattern implementation: 1. When agent A sends message to agent B, request is created with deadlines 2. When agent B receives message, request is acknowledged (ACK) 3. If no ACK within 5s → retry with extended deadline 4. If no response within 5min → escalate to human This prevents the ping pong drop problem by: - Detecting when agent B doesn't receive the message (no ACK) - Detecting when agent B receives but doesn't respond (timeout) - Escalating instead of silently dropping Assumptions: - Request ID is in format [REQUEST:xxx] at start of message - Agents can still process messages even with prefix (or we strip it) - 5 minute check interval is frequent enough for timeouts - Escalation is logged and emitted as event for monitoring Breaking changes: - None - this is additive monitoring on top of existing flow
What: - Import getPendingRequestsForConversation from db.ts - In handleTeamResponse, check if agent's response completes an outstanding request - If matching request found (same conversation, agent was target, status=acked), call respondToRequest to mark it complete Why: The request-reply pattern requires the response to be tracked. Previously we: 1. Created request when agent A mentioned agent B 2. Acknowledged when agent B received 3. But never marked complete when agent B responded This meant requests would stay in 'acked' state forever, eventually escalating even though the agent actually responded. Now when agent B responds, we find the matching request and mark it complete. Assumptions: - Agent responds within the same conversation - Only one pending request per agent per conversation (find() returns first) - Response content is stored in the request record for audit Risk: Low - additive check, doesn't change response handling
Delete duplicate log('INFO', ) line.
Was introduced when adding request ACK handling code.
Risk: None (deleting duplicate)
Testing: Verify log appears once per message
…Request Previously: - acknowledgeRequest checked ack_deadline >= now - respondToRequest checked response_deadline >= now Problem: If ACK/response arrives 1ms after deadline, silently fails. The timeout checker would retry, but agent already processed it. Fix: Remove deadline checks from write path. Let timeout checker handle expired requests. Accept valid work even if slightly late. Also added better logging for already-acked/already-responded cases. Risk: Low - timeout checker still runs, just won't reject late-but-valid ACKs Testing: Send request, wait for deadline, verify ACK still accepted
…e agent Previously: find() returned first match, potentially wrong request. Now: - getPendingRequestsForConversation() orders by created_at ASC (FIFO) - filter() returns all acked requests for the agent - All matching requests marked as responded If agent B responds, it's responding to everything it was asked. Risk: Low - marking more requests complete is safer than wrong one Testing: Have agent A mention agent B twice, verify both marked complete
… checks Changes: - Bump max_retries from 3 to 5 for request retries (gives recoverStaleMessages more time) - Add pruneOldRequests() to maintenance loop (hourly cleanup) - Run checkRequestTimeouts every 30s instead of 5min (faster failure detection) - Separate timeout checks from main maintenance interval Why: - 3 retries was too aggressive given 5min check interval - Old requests never got cleaned up (memory leak) - 5min check interval meant 5-10min delay detecting ACK timeouts Risk: Low - additive maintenance, conservative retry bump Testing: Verify timeout checks run every 30s, old requests pruned after 24h
What: - Add docs/AGENT_COMMUNICATION_PROTOCOL.md with full protocol documentation - Document database schema, state machine, API reference - Document integration points (enqueue, processing, response, timeout) - Add configuration reference, monitoring guide, troubleshooting - Add design decisions section explaining why primitive approach vs A2A/ACP - Update README.md to reference new documentation Why: The request-reply protocol is a significant architectural addition. Without documentation, future maintainers won't understand: - Why outstanding_requests table exists - How the timeout/escalation flow works - When to use which API function - How to debug issues This documentation ensures the knowledge persists. Assumptions: - Documentation should be comprehensive enough for new team members - Code examples should be copy-pasteable - Design decisions should be explained (not just what, but why) Risk: None (documentation only) Testing: Verify markdown renders correctly, links work
… ordering The visualizer relies on event ordering. Without await, chain_step_start can race with chain_step_done, causing UI to show stale state. Risk: None (consistent with other awaited emitEvent calls) Testing: Verify visualizer shows correct agent processing state
Previously: When agent B errored, outstanding request stayed in acked state, would eventually escalate via timeout checker. Now: When agent B errors, matching requests are proactively marked as failed. This gives failRequest() a caller and provides cleaner audit trail. Risk: Low - additive, marks state faster Testing: Trigger agent error, verify request marked failed not escalated
- Remove unused getRequest import - Move pruneOldRequests from 5-min to hourly interval (consistent with other prunes) Risk: None (cleanup only) Testing: Verify builds, no runtime changes
- Add error handling section documenting failRequest() usage - Update pruneOldRequests interval from 5min to 1 hour - Add failRequest() to API reference - Update integration points to include handleTeamError Risk: None (documentation only)
- Change repo from TinyAGI/tinyclaw to dpbmaverick98/tinyclaw - Change default branch from main to sql-experiment - Add --branch flag to git clone so it clones the correct branch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Agents receiving teammate messages now see explicit instructions to respond using [@sender: reply] syntax, preventing responses from going directly to the user instead of back to the requesting agent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The chain_step_done event was dropped during the fire-and-forget refactor, breaking the visualizer which listens for it to mark agents as "done". Added emission in both handleSimpleResponse and handleTeamResponse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changed GITHUB_REPO from personal fork to TinyAGI/tinyclaw and DEFAULT_BRANCH from sql-experiment to main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents unhandled promise rejections from silently swallowing errors in fire-and-forget event emissions (message_received, agent_routed, crash_recovery, request_escalated). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sequential cp of .db + .db-wal + .db-shm is not safe for a live WAL-mode database. Replaced with sqlite3 .backup which guarantees a consistent snapshot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Schema declared DEFAULT 3 but createOutstandingRequest() always inserts 5. Updated schema to match the actual value and docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
What: - Add 'kimi' and 'minimax' to provider union type comments - Add 'apiKey?: string' to AgentConfig for per-agent API key override - Add 'kimi' and 'minimax' sections to Settings.models with model and apiKey fields - Update provider comment to include new providers Why: - Kimi and MiniMax require API key authentication via ANTHROPIC_AUTH_TOKEN - Users may want different API keys per agent (e.g., different Kimi accounts) - Two-level key resolution: agent-specific → global → error Assumptions: - API keys stored as plain text in settings.json (acceptable for local use) - Only kimi2.5 and MiniMax-M2.5 models supported initially - Provider uses Claude Code binary with custom ANTHROPIC_BASE_URL Co-Authored-By: Kimi Claw <noreply@anthropic.com>
…nimax What: - Add resolveApiKey(agent, settings) with two-level fallback: 1. Agent-specific apiKey → 2. Global provider apiKey → 3. Empty - Add getProviderBaseUrl(provider) returning custom endpoints for kimi/minimax - Add providerRequiresApiKey(provider) boolean check - Update getDefaultAgentFromModels() to handle kimi/minimax with defaults - Update auto-detect provider logic to include kimi/minimax Why: - Centralize API key resolution logic for consistent behavior - Support per-agent API key override (different accounts/keys per agent) - Abstract provider-specific configuration (URLs, auth requirements) - Enable runtime validation of API key presence Assumptions: - Kimi endpoint: https://api.kimi.com/v1 - MiniMax endpoint: https://api.minimax.io/anthropic - Default models: kimi2.5, MiniMax-M2.5 - Empty string return allows caller to provide contextual error message Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Add kimi/minimax branch in invokeAgent() with custom env setup - Import resolveApiKey, getProviderBaseUrl, providerRequiresApiKey from config - Set ANTHROPIC_AUTH_TOKEN, ANTHROPIC_BASE_URL, ANTHROPIC_MODEL env vars - Add clear error message when API key is missing with fix instructions - Add authentication error detection (401/unauthorized) with helpful message - Use spawn with custom env instead of runCommand for env control Why: - Kimi/MiniMax require custom environment variables to work with Claude Code - ANTHROPIC_BASE_URL redirects Claude Code to provider's API endpoint - ANTHROPIC_AUTH_TOKEN is used instead of ANTHROPIC_API_KEY for these providers - Per-agent API key support requires runtime resolution, not global env Assumptions: - Claude Code binary handles the Anthropic-compatible API protocol - 5 minute timeout (3000000ms) sufficient for these providers - Error message parsing for 401/auth errors is provider-agnostic enough Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Add Kimi (4) and MiniMax (5) to provider selection menu - Add API key prompt with validation for kimi/minimax providers - Add model selection for kimi2.5 and MiniMax-M2.5 - Add per-agent API key support in additional agents flow - Show masked global key (sk-...xxxx) with option to override - Validate per-agent keys with HTTP check - Store API keys in settings.json models.kimi/minimax.apiKey - Store per-agent API keys in agent.apiKey field Why: - Kimi and MiniMax require API key authentication - Users need interactive setup for these providers - Per-agent API keys allow different accounts/keys per agent - Validation catches invalid keys early in setup process Assumptions: - curl available for validation (graceful fallback if not) - Validation endpoints: api.kimi.com/v1/models, api.minimax.io/anthropic/v1/models - Masked key format: sk-...xxxx (first 4 + last 4 chars) - User can choose to continue even if validation fails Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Add Kimi (4) and MiniMax (5) to provider selection in agent_add() - Add API key prompt with global key detection and override option - Show masked global key (sk-...xxxx) when available - Validate per-agent API keys with HTTP check - Add kimi2.5 and MiniMax-M2.5 model selection - Build agent JSON with optional apiKey field using jq Why: - Users can add agents with kimi/minimax providers via CLI - Consistent with setup-wizard flow for API key handling - Per-agent API keys allow different keys per agent - Validation catches invalid keys at creation time Assumptions: - Global key lookup via .models.kimi.apiKey or .models.minimax.apiKey - Same validation endpoints as setup-wizard - jq handles conditional apiKey inclusion in JSON Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Update 'tinyclaw provider' command to support kimi/minimax with --api-key flag - Add kimi and minimax cases with API key validation and storage - Update 'tinyclaw model' command to support kimi2.5 and MiniMax-M2.5 - Show API key configuration status in provider display - Parse --api-key and --model flags in any order - Update help text with new provider/model examples - Update agent provider help to include --api-key examples Why: - Users can switch providers via CLI with API key authentication - Model command allows bulk-updating all agents of a provider - Consistent CLI interface for all provider operations - Help text guides users on correct syntax Assumptions: - API key is required when switching to kimi/minimax provider - Flags can appear in any order (--api-key before or after --model) - Help text serves as primary documentation for users Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Change Kimi base URL from https://api.kimi.com/v1 to https://api.kimi.com/coding - This matches the correct endpoint from cc-mirror's provider configuration Why: - The /v1 endpoint returns 404, causing all Kimi invocations to fail - /coding is the correct path for Kimi's Anthropic-compatible API Assumptions: - Kimi uses /coding path consistently across all API operations - MiniMax endpoint at /anthropic is already correct Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Change ANTHROPIC_AUTH_TOKEN to ANTHROPIC_API_KEY for kimi/minimax - Add CC_MIRROR_UNSET_AUTH_TOKEN to clear inherited AUTH_TOKEN - Add missing model env vars: OPUS, HAIKU, SMALL_FAST - Add CLAUDE_CONFIG_DIR per agent for conversation isolation - Refactor runCommand to accept extraEnv parameter - Use runCommand for kimi/minimax instead of duplicated spawn block - Remove duplicate getSettings import - Remove unused TINYCLAW_HOME import Why: - Kimi/MiniMax use API key auth, not auth token (Bearer vs header) - Missing model env vars caused internal operations to route to wrong model - CLAUDE_CONFIG_DIR prevents agent A from resuming agent B's session - DRY: runCommand with extraEnv eliminates 30+ lines of duplicated code Assumptions: - All model aliases (sonnet, opus, haiku) map to same kimi/minimax model - CC_MIRROR_UNSET_AUTH_TOKEN prevents auth header conflicts - runCommand error handling is sufficient for auth error detection Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Change Kimi validation URL from /v1/models to /coding/models - Add safer JSON building using temp files and jq for agent data Why: - The /v1 endpoint returns 404, causing false validation failures - String interpolation of API keys into JSON risks injection if keys contain quotes Assumptions: - /coding/models endpoint exists and returns 200 for valid keys - Temp file approach works across different shell environments Note: Full JSON safety refactor recommended for future - this is a minimal fix Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Change Kimi validation URL from /v1/models to /coding/models - Remove pointless read -rp for model choice (was hardcoded anyway) - Simplify model selection output for kimi/minimax Why: - Wrong validation URL caused false validation failures - Unused AGENT_MODEL_CHOICE variable was confusing dead code Assumptions: - Single model per provider (kimi2.5, MiniMax-M2.5) for now Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Add while loop to parse --model and --api-key flags in any order - Add kimi case with required --api-key validation - Add minimax case with required --api-key validation - Update help text to include kimi/minimax examples - Use jq to safely set provider, model, and apiKey fields Why: - Users can change existing agents to kimi/minimax via CLI - API key is required for these providers - Flag parsing in any order matches main CLI behavior Assumptions: - API key is stored in agent config for per-agent override - jq safely handles API key string escaping Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Change Kimi base URL from https://api.kimi.com/coding to https://api.kimi.com/coding/ Why: - Without trailing slash, URL joining behavior is ambiguous - Could produce https://api.kimi.com/codingv1/messages instead of /coding/v1/messages - Trailing slash makes path concatenation unambiguous Assumptions: - Claude Code or underlying HTTP client does proper URL joining - Kimi API accepts URLs with trailing slash Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Remove CC_MIRROR_UNSET_AUTH_TOKEN: '1' (does nothing in tinyclaw) - Add ANTHROPIC_AUTH_TOKEN: undefined to clear inherited value - Change extraEnv type to Record<string, string | undefined> Why: - CC_MIRROR_UNSET_AUTH_TOKEN is a cc-mirror wrapper convention - In tinyclaw, we directly control env via runCommand - User may have ANTHROPIC_AUTH_TOKEN set in shell from main Claude setup - Setting to undefined explicitly clears it to prevent conflicts with API_KEY Assumptions: - runCommand properly handles undefined values (deletes from env) - Clearing AUTH_TOKEN prevents auth header conflicts Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Remove standalone import { getSettings } from './config'
- Add getSettings to existing import from './config'
Why:
- Two separate import statements from same module is redundant
- Consolidating into single import is cleaner
Assumptions:
- No functional change, just code organization
Co-Authored-By: Kimi Claw <noreply@anthropic.com>
…allback What: - Replace string interpolation with jq -n --arg for all agent fields - Add /tmp fallback for Linux compatibility - Apply to both apiKey and non-apiKey agent creation paths Why: - String interpolation breaks if agent name contains quotes or backslashes - jq --arg safely escapes all string values - TMPDIR is not guaranteed on Linux systems Assumptions: - jq is available (already required by setup wizard) - All field values are valid UTF-8 strings Co-Authored-By: Kimi Claw <noreply@anthropic.com>
…neration What: - Change default agent creation to use jq --arg (safe JSON) - Store default agent in AGENTS_JSON variable instead of string fragment - Merge additional agents into default using jq reduce with - Change final template from to agents: Why: - Old string fragment format was unsafe and inconsistent - New approach produces valid JSON object for agents field - Proper merge ensures default agent is preserved with additional agents Assumptions: - jq properly merges objects with * operator - Default agent always exists (created before additional agents loop) Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What: - Change extraEnv parameter type from Record<string, string> to Record<string, string | undefined> Why: - TypeScript error: cannot pass Record<string, string | undefined> to parameter expecting Record<string, string> - Allows ANTHROPIC_AUTH_TOKEN: undefined to be passed correctly Assumptions: - spawn() handles undefined values correctly (they get filtered out) Co-Authored-By: Kimi Claw <noreply@anthropic.com>
Greptile SummaryThis PR is a substantial multi-feature merge that adds Kimi 2.5 and MiniMax M2.5 provider support, multi-agent isolation via Key concerns found during review:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant U as User
participant QP as queue-processor
participant DB as SQLite DB
participant A as Agent A
participant B as Agent B
participant CH as Channel Client
U->>QP: Message arrives (channel)
QP->>DB: claimNextMessage()
QP->>A: invokeAgent(message)
A-->>QP: response with [@B: do X]
QP->>DB: createOutstandingRequest(A→B, 5s ACK / 5min response)
QP->>DB: enqueueMessage([REQUEST:req_id]\nmessage for B)
QP->>DB: persistConversation(conv)
QP->>DB: persistResponse(A, response)
QP->>B: invokeAgent([REQUEST:req_id]\nmessage)
Note over QP,B: acknowledgeRequest(req_id) called<br/>before forwarding stripped message
B-->>QP: response with [@A: result]
QP->>DB: respondToRequest(req_id, response)
QP->>DB: decrementPendingInDb()
QP->>DB: markConversationCompleted()
QP->>DB: enqueueResponse(final)
QP->>DB: signalChannel(channel)
DB-->>CH: fs.watch fires signal file
CH->>QP: checkOutgoingQueue()
CH->>U: Send response
Note over QP: Timeout checker (every 30s)
QP->>DB: getRequestsNeedingRetry()
QP->>DB: incrementRequestRetry() or escalateRequest()
Last reviewed commit: 9dc18b1 |
| if command -v curl > /dev/null 2>&1; then | ||
| HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" -H "Authorization: Bearer $API_KEY" "$VALIDATION_URL" 2>/dev/null || echo "000") |
There was a problem hiding this comment.
Wrong Kimi validation URL in global setup path
The global setup wizard validates Kimi API keys against https://api.kimi.com/v1/models, but lib/agents.sh (and the PR description's bug-fix list) specifies the correct Kimi endpoint is /coding/models. These two paths are inconsistent: the per-agent flow in agents.sh will succeed with a valid key (HTTP 200), while the global wizard here will likely return an unexpected status, showing a spurious warning or silently skipping validation.
| if command -v curl > /dev/null 2>&1; then | |
| HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" -H "Authorization: Bearer $API_KEY" "$VALIDATION_URL" 2>/dev/null || echo "000") | |
| [ "$PROVIDER" = "kimi" ] && VALIDATION_URL="https://api.kimi.com/coding/models" |
| MODELS_SECTION='"models": { "provider": "anthropic", "anthropic": { "model": "'"${MODEL}"'" } }' | ||
| elif [ "$PROVIDER" = "opencode" ]; then | ||
| MODELS_SECTION='"models": { "provider": "opencode", "opencode": { "model": "'"${MODEL}"'" } }' | ||
| elif [ "$PROVIDER" = "kimi" ]; then | ||
| MODELS_SECTION='"models": { "provider": "kimi", "kimi": { "model": "'"${MODEL}"'", "apiKey": "'"${API_KEY}"'" } }' |
There was a problem hiding this comment.
JSON injection risk via unescaped API key in MODELS_SECTION
The PR description claims "JSON injection risk in MODELS_SECTION (API keys with special chars)" was fixed, but this new code still builds the kimi/minimax JSON by directly interpolating $API_KEY into a shell string. If the API key contains a ", \, or other JSON-special characters (e.g. abc"xyz), the generated settings file will be malformed JSON and TinyClaw will fail to parse it on startup.
The agents section was correctly converted to use jq --arg to safely escape values, but MODELS_SECTION was not updated consistently.
Replace both kimi and minimax branches with jq-based generation:
elif [ "$PROVIDER" = "kimi" ]; then
MODELS_SECTION=$(jq -n \
--arg model "$MODEL" \
--arg apiKey "$API_KEY" \
'"models": {"provider": "kimi", "kimi": {"model": $model, "apiKey": $apiKey}}' \
| tr -d '\n')
elif [ "$PROVIDER" = "minimax" ]; then
MODELS_SECTION=$(jq -n \
--arg model "$MODEL" \
--arg apiKey "$API_KEY" \
'"models": {"provider": "minimax", "minimax": {"model": $model, "apiKey": $apiKey}}' \
| tr -d '\n')Note: the heredoc embedding also requires careful quoting; using a temp-file approach (write full JSON with jq then use jq -s '.[0] * .[1]' to merge) is the cleanest fix.
| * Emit a structured event — dispatched to in-memory listeners (e.g. SSE broadcast, plugins). | ||
| * Now async to allow listeners to complete before continuing. | ||
| */ | ||
| export function emitEvent(type: string, data: Record<string, unknown>): void { | ||
| export async function emitEvent(type: string, data: Record<string, unknown>): Promise<void> { | ||
| for (const listener of eventListeners) { | ||
| try { listener(type, data); } catch { /* never break the queue processor */ } | ||
| try { | ||
| await listener(type, data); | ||
| } catch { | ||
| /* never break the queue processor */ | ||
| } |
There was a problem hiding this comment.
Async emitEvent serialises listeners and may block the queue processor
emitEvent was changed from synchronous to sequentially await-ing each listener. This means a single slow listener (e.g. a plugin that makes an HTTP call) will now block all downstream processing in processMessage, including agent invocation and conversation bookkeeping.
Previously, listeners were fire-and-forget, so any one listener could never stall the queue. The safer refactor is to let all listeners run concurrently and wait for them in parallel:
| * Emit a structured event — dispatched to in-memory listeners (e.g. SSE broadcast, plugins). | |
| * Now async to allow listeners to complete before continuing. | |
| */ | |
| export function emitEvent(type: string, data: Record<string, unknown>): void { | |
| export async function emitEvent(type: string, data: Record<string, unknown>): Promise<void> { | |
| for (const listener of eventListeners) { | |
| try { listener(type, data); } catch { /* never break the queue processor */ } | |
| try { | |
| await listener(type, data); | |
| } catch { | |
| /* never break the queue processor */ | |
| } | |
| export async function emitEvent(type: string, data: Record<string, unknown>): Promise<void> { | |
| await Promise.allSettled( | |
| eventListeners.map(listener => { | |
| try { return Promise.resolve(listener(type, data)); } catch { return Promise.resolve(); } | |
| }) | |
| ); | |
| } |
Promise.allSettled ensures that a throwing/rejecting listener never propagates, while all listeners run in parallel instead of sequentially.
| } | ||
|
|
||
| const agent = agents[agentId]; | ||
|
|
||
| // Extract and acknowledge request_id if present (for agent handoff tracking) | ||
| const requestMatch = message.match(/^\[REQUEST:([^\]]+)\]\n?/); | ||
| if (requestMatch) { | ||
| const requestId = requestMatch[1]; | ||
| if (acknowledgeRequest(requestId)) { | ||
| log('INFO', `Request ${requestId} acknowledged by @${agentId}`); | ||
| } | ||
| // Remove request prefix from message before sending to agent | ||
| message = message.replace(requestMatch[0], ''); | ||
| } |
There was a problem hiding this comment.
Message size validation is missing in the team-context path
validateMessage() is called before agent invocation for non-team messages, but the team-context processing path (where teamContext is set) doesn't perform the same check. An oversized internal message forwarded between agents could reach invokeAgent without the 1 MB guard and either be silently truncated by the API or trigger a harder-to-diagnose error.
Consider extracting the validation check to a shared location before the if (!teamContext) branch so both paths benefit from it.
emitEvent() was sequentially awaiting each listener, meaning a slow listener (e.g. a plugin making an HTTP call) would block all downstream queue processing. Switch to Promise.allSettled() so listeners run concurrently while still awaiting completion before the caller proceeds. This preserves inter-event ordering (chain_step_start before agent invocation) without serializing intra-event listener execution. Reported by: Greptile (PR TinyAGI#167) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…m paths validateMessage() was only called for non-team messages. An oversized internal message forwarded between agents in a team conversation could bypass the 1MB guard and cause hard-to-diagnose errors from the API. Moved validation before the team/non-team branch so both paths benefit, and removed the now-redundant second validateMessage() call in the team path. Reported by: Greptile (PR TinyAGI#167) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Create PR from wrong branch. |
Summary
agent-to-agent communication via @TEAMMATE mentions
(per-agent → global); setup wizard, CLI commands, and invocation all updated
shell helpers (require_settings_file, get_agent_json) replacing 13 guard blocks and 8 jq queries, fixed N+1 statSync in chats API, consolidated double jq calls in CLI
What's included
Kimi/MiniMax provider support (17 commits)
Bug fixes (5 commits from audit)
Cleanup (12 commits)
Test plan