Skip to content

Multi-agent support from PR#163 + Support for Kimi2.5 and Minimax2.5 via Claude Code + Clean up#167

Closed
dpbmaverick98 wants to merge 50 commits intoTinyAGI:mainfrom
dpbmaverick98:kimi-mini-support
Closed

Multi-agent support from PR#163 + Support for Kimi2.5 and Minimax2.5 via Claude Code + Clean up#167
dpbmaverick98 wants to merge 50 commits intoTinyAGI:mainfrom
dpbmaverick98:kimi-mini-support

Conversation

@dpbmaverick98
Copy link

Summary

  • Multi-agent architecture from PR#163 — agents can have individual providers, models, working directories, and system prompts; conversations are isolated per-agent via CLAUDE_CONFIG_DIR;
    agent-to-agent communication via @TEAMMATE mentions
  • Kimi 2.5 and MiniMax M2.5 support — both providers expose Anthropic-compatible APIs, routed through Claude Code via ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY env vars; two-level API key fallback
    (per-agent → global); setup wizard, CLI commands, and invocation all updated
  • Codebase cleanup — removed duplicate provisioning logic (~60 lines), centralized scattered patterns (WORKSPACE_DEFAULT_PATH, generateId(), writeJsonFile()/readJsonFile(), Provider union type), added
    shell helpers (require_settings_file, get_agent_json) replacing 13 guard blocks and 8 jq queries, fixed N+1 statSync in chats API, consolidated double jq calls in CLI

What's included

Kimi/MiniMax provider support (17 commits)

  • src/lib/types.ts — Provider union type, apiKey field on AgentConfig, kimi/minimax sections in Settings
  • src/lib/config.ts — resolveApiKey(), getProviderBaseUrl(), provider auto-detection
  • src/lib/invoke.ts — extraEnv parameter on runCommand(), kimi/minimax branch in invokeAgent() with full env isolation
  • lib/agents.sh — interactive agent creation for kimi/minimax with API key prompts and validation
  • lib/setup-wizard.sh — global setup flow for kimi/minimax with jq-safe JSON generation
  • tinyclaw.sh — provider and model commands extended for all 5 providers

Bug fixes (5 commits from audit)

  • lib/agents.sh — critical elif after else bash syntax error that broke kimi/minimax model selection
  • lib/setup-wizard.sh — wrong Kimi validation URL (/v1/models → /coding/models) in global setup path
  • lib/setup-wizard.sh — JSON injection risk in MODELS_SECTION (API keys with special chars)
  • tinyclaw.sh — tinyclaw model display falling through to anthropic for kimi/minimax/opencode

Cleanup (12 commits)

  • Removed 60-line duplicate provisionAgentWorkspace() from routes/agents.ts
  • WORKSPACE_DEFAULT_PATH constant replaces 6 inline require('os').homedir() calls
  • writeJsonFile/readJsonFile helpers replace 4+2 inline JSON I/O sites
  • generateId() replaces 5 hand-rolled Math.random() patterns across routes and channels
  • parseJsonField() eliminates repeated ternary JSON.parse guards
  • Provider union type enforces valid provider strings at compile time
  • Shell helpers reduce agents.sh and teams.sh by ~60 lines
  • Performance: readdirSync({ withFileTypes: true }) in chats API, single-jq model display, single-write .env generation

Test plan

  • npm run build passes
  • bash -n lib/agents.sh && bash -n lib/setup-wizard.sh && bash -n tinyclaw.sh — all shell scripts parse cleanly
  • tinyclaw setup with kimi provider — validation URL hits /coding/models, API key with special chars written correctly
  • tinyclaw agent add with kimi/minimax — model selection shows correct defaults (kimi2.5 / MiniMax-M2.5)
  • tinyclaw model — displays correct provider/model for all 5 provider types
  • tinyclaw provider kimi --api-key — sets global key, falls back correctly per-agent
  • Send message to kimi/minimax agent — ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY injected, ANTHROPIC_AUTH_TOKEN cleared
  • Agent sessions isolated — -c flag resumes correct conversation per-agent via CLAUDE_CONFIG_DIR

Kimi Claw and others added 30 commits March 5, 2026 19:17
What:
- Add three new tables: conversations, conversation_responses, conversation_pending_agents
- Add 11 new functions for conversation state management
- Follow existing SQLite patterns (WAL mode, transactions, indexes)

Why:
Previously, all conversation state was stored only in memory (Map<string, Conversation>).
This meant that if the queue-processor crashed or was restarted during a team
conversation, all active conversation state was lost. Agents would continue
processing their messages, but the conversation would never complete because
the pending counter and response aggregation were gone.

This change persists conversation state to SQLite, enabling:
1. Restart recovery - conversations can be resumed after crash
2. State inspection - active conversations can be queried via API
3. Debuggability - conversation history is preserved

Assumptions:
- Conversations are short-lived (minutes, not days), so we don't need to
  persist the full Conversation object (Sets, Maps). We persist the minimal
  state needed to reconstruct: counters, IDs, and responses.
- Files referenced in conversations are not persisted (they're ephemeral).
- The existing in-memory conversations Map is still used for fast access
  during normal operation; the DB is the source of truth for recovery.

Pattern compliance:
- Uses transaction().immediate() for atomic operations (like claimNextMessage)
- Uses INSERT OR REPLACE for upserts
- Uses ON DELETE CASCADE for cleanup
- Follows existing naming conventions and timestamp formats
…very

What:
- Remove agentProcessingChains Map that enforced sequential processing per agent
- Refactor processMessage to use fire-and-forget pattern for invokeAgent
- Add handleSimpleResponse and handleTeamResponse async handlers
- Add handleTeamError for error recovery in team contexts
- Add startup recovery logic to load active conversations from DB
- Add conversation pruning maintenance interval

Why:
Previously, the queue-processor used a Promise chain per agent (agentProcessingChains)
to ensure messages were processed sequentially. This caused the "freeze" problem:
if agent A was processing a long request (e.g., 30s Claude API call), no other
messages to agent A could be processed until it completed.

This change makes invokeAgent fire-and-forget:
1. processMessage starts invokeAgent and returns immediately
2. The response is handled asynchronously by handleSimpleResponse/handleTeamResponse
3. Multiple messages to the same agent can be in-flight simultaneously
4. The queue processor never blocks on slow API calls

Additionally, conversation state is now persisted to SQLite (from previous commit)
and recovered on startup. This means if the queue-processor restarts during a team
conversation, it will resume where it left off instead of losing all state.

Assumptions:
- invokeAgent is idempotent enough that reprocessing after a crash is safe
- The DB transaction in decrementPendingInDb prevents race conditions
- In-memory conversations Map is still used for fast access; DB is for recovery
- Fire-and-forget is acceptable because we have retry logic via dead letter queue

Breaking changes:
- Removed per-agent sequential processing guarantee. Previously messages to the
  same agent were guaranteed to process sequentially. Now they process concurrently.
  This is actually the desired behavior (no freezing), but it means agents must
  handle concurrent requests if they share state.

Pattern compliance:
- Uses async/await for response handlers (cleaner than callbacks)
- Uses DB functions from previous commit for persistence
- Maintains existing event emission for observability
- Preserves all existing error handling and logging
What:
- Add src/lib/signals.ts with file-based signaling system
- Modify enqueueResponse to signal channel when response is ready
- Update Discord, Telegram, and WhatsApp clients to use push notifications
- Add 10-second fallback polling for reliability

Why:
Previously, channel clients polled /api/responses/pending every 1-2 seconds.
This caused unnecessary latency (average 0.5-1s delay) and wasted CPU/IO on
both the client and server.

This change implements push notifications via file system:
1. When enqueueResponse is called, it writes a signal file (.tinyclaw/signals/{channel}.ready)
2. Channel clients use fs.watch() to get notified immediately
3. Response latency drops from ~1s to near-zero
4. Fallback polling every 10s catches any missed signals

Assumptions:
- File system watch (fs.watch) is reliable enough for this use case
- Signal files are cleaned up after processing to prevent duplicate triggers
- 10-second fallback is acceptable for missed signals (rare)
- All three channel clients (Discord, Telegram, WhatsApp) are on the same machine

Trade-offs:
- File-based signaling only works for local processes (same machine)
- If we need distributed deployment later, this would need to be replaced
  with something like Redis pub/sub or NATS
- File system watches can be unreliable on some platforms (we have fallback)

Pattern compliance:
- Uses existing TINYCLAW_HOME for signal directory
- Follows existing error handling patterns
- Maintains backward compatibility (polling still works)
- Clean shutdown with unwatch() on SIGINT/SIGTERM
…ing guarantee

What:
- Make emitEvent() async to allow awaiting event listener completion
- Update EventListener type to support async listeners
- Add await to all emitEvent() calls in queue-processor.ts:
  - response_ready (handleSimpleResponse)
  - chain_handoff (handleTeamResponse)
  - team_chain_start (processMessage)
- Make completeConversation() async and await team_chain_end emission
- Wrap conversation recovery in async recoverConversations() function
- Move startup logging into async IIFE to properly await emitEvent

Why:
The visualizer relies on event ordering: chain_step_start → chain_step_done →
response_ready. Without await, events could be emitted in order but processed
out of order due to async listener scheduling.

This was a critical issue found in the NATS implementation (missing awaits on
publishEvent calls). The same pattern exists here - emitEvent was fire-and-forget,
so the visualizer could receive events out of sequence under high concurrency.

By awaiting emitEvent, we guarantee:
1. Events are processed by listeners before continuing
2. Visualizer sees events in correct order
3. SSE clients receive events sequentially

Assumptions:
- Event listeners are fast enough that awaiting them won't block processing
- The slight overhead of await is acceptable for ordering guarantees
- Listeners that need to be fire-and-forget should internally queue work

Breaking changes:
- emitEvent() now returns Promise<void> instead of void
- completeConversation() now returns Promise<void>
- Code using these functions must now await them

Pattern compliance:
- Matches the fix applied in NATS branch (adding awaits to publishEvent)
- Uses async/await consistently throughout the codebase
- Maintains error handling (try/catch around await)
What:
- Add next_retry_at column to messages table for scheduling retries
- Update failMessage() to calculate exponential backoff with jitter
- Update claimNextMessage() to respect next_retry_at timestamp
- Add migration for existing databases (ALTER TABLE)

Why:
Previously, failed messages were immediately retried (status reset to 'pending').
Under high load or during outages, this caused a "thundering herd" problem:
all failed messages would retry simultaneously, overwhelming the system.

This change implements exponential backoff with jitter:
- Retry 1: ~100ms delay
- Retry 2: ~200ms delay
- Retry 3: ~400ms delay
- Retry 4: ~800ms delay
- Retry 5: ~1600ms delay (capped at 30s)

Plus 0-100ms random jitter to spread out retries and prevent synchronized
retry storms.

Assumptions:
- Messages that fail temporarily (rate limits, network blips) will succeed
  after a short delay
- Spreading retries over time is better than immediate retry
- 5 retries with exponential backoff is sufficient for transient failures

Implementation details:
- ORDER BY clause prioritizes messages without next_retry_at (new messages)
- Then orders by next_retry_at to process earliest scheduled first
- Messages with future next_retry_at are skipped until their time comes

Pattern compliance:
- Uses same transaction pattern as claimNextMessage for atomicity
- Maintains backward compatibility (next_retry_at is nullable)
- Follows existing logging conventions
What:
- Add src/lib/heartbeat.ts with heartbeat read/write functions
- Queue-processor writes heartbeat every 5 seconds with timestamp, pid, uptime
- Channel clients check heartbeat staleness in fallback polling loop
- Atomic file write (temp + rename) to prevent corruption
- Clean shutdown removes heartbeat file

Why:
File-based signaling (signals.ts) has no way to detect when the queue-processor
crashes. If queue-processor dies:
- Signal files stop being written (but clients don't know)
- Clients keep watching, unaware of the crash
- 10-second fallback polling continues but never gets new responses

With heartbeat monitoring:
- Channel clients detect stale heartbeat (default: 15s threshold)
- Log warning when queue-processor may have crashed
- Users can see the issue and restart the service

This is simpler than NATS's consumer iterator monitoring but achieves the same
goal: detecting when the message processor is unhealthy.

Assumptions:
- 5-second heartbeat interval is frequent enough for detection
- 15-second staleness threshold (3 missed heartbeats) is reasonable
- File system timestamps are accurate enough for health checks
- Channel clients should log warnings but not auto-restart (user decision)

Pattern compliance:
- Uses same TINYCLAW_HOME directory as other state files
- Follows existing error handling (log and continue)
- Atomic write pattern prevents corrupted heartbeat files
- Cleanup on SIGINT/SIGTERM for graceful shutdown
What:
- Add MAX_MESSAGE_SIZE constant (1MB - Claude API limit)
- Add validateMessage() function to check message size
- Validate message before both invokeAgent calls (simple and team contexts)
- Fail message immediately with clear error if too large

Why:
Previously, messages larger than 1MB would be sent to Claude API, which would
reject them with an error. The error would trigger retry logic, wasting
resources on a message that can never succeed.

With validation:
1. Message size checked before any API call
2. Oversized messages fail immediately (no retry)
3. Clear error message logged for debugging
4. Prevents wasted API calls and retry cycles

Assumptions:
- 1MB is the appropriate limit for Claude API
- Message size is the primary validation needed (other validations may be added)
- Failing immediately is better than retrying oversized messages

Pattern compliance:
- Uses existing failMessage() for consistency
- Logs error with context for debugging
- Returns early (guard clause pattern)
- Non-breaking change (new validation, no API changes)
… with conversation lock

**What**
Added withConversationLock() to handleTeamResponse() and handleTeamError() to prevent
race conditions when multiple agents finish simultaneously for the same conversation.

**Why (Critical Race Condition)**
When Agent A and Agent B both complete for same conversation concurrently:
- Both call handleTeamResponse(conv) with shared conv object reference
- Both modify conv.totalMessages++, conv.pending+= without synchronization (NOT ATOMIC)
- Both can reach if (newPending === 0) and call completeConversation(conv) twice
- Results in: lost updates, duplicate completion events, corrupted state

Example timeline:
  Agent A finishes → handleTeamResponse starts
    - persistResponse(conv_id, agentA, responseA)
    - conv.totalMessages++ (read=5, write=6)
  Agent B finishes → handleTeamResponse starts (same conv reference)
    - persistResponse(conv_id, agentB, responseB)
    - conv.totalMessages++ (read=5, write=6)  ← Lost Agent A's increment!

Result: conv.totalMessages = 6 instead of 7. Conversation state corrupted.

**Solution**
Wrapped function body with withConversationLock(conv.id) which:
- Serializes updates: only one agent modifies conv at a time
- Prevents concurrent modifications to same conversation
- Ensures only one agent reaches completion check

**Similar Fix Applied To**
- handleTeamResponse(): Wraps entire response handling logic
- handleTeamError(): Same pattern for error handling

**Assumptions**
1. Fire-and-forget pattern is maintained (invoke is still async)
2. Lock overhead acceptable (milliseconds per conversation)
3. Conversation objects exist long enough for all agents to complete
4. Lock gracefully handles conversation deletion by cleanup in conversation.ts

**Testing Considerations**
- Test with 3+ agents finishing within milliseconds of each other
- Verify team_chain_end event emitted exactly once
- Check conversation state consistency in database
- Monitor for deadlocks (lock implementation has timeout handling)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…or handling

**What**
Changed clearSignal() from check-then-delete pattern to try-delete with selective
error handling. Now ignores ENOENT errors when file is already deleted.

**Why (Race Condition)**
Previous implementation used check-then-delete pattern:
```typescript
if (fs.existsSync(signalFile)) {
    fs.unlinkSync(signalFile);  // TOCTOU: file deleted between check and delete
}
```

This creates a Time-Of-Check-Time-Of-Use (TOCTOU) race condition:
1. Process A checks: file exists
2. Process B checks: file exists
3. Process A deletes file
4. Process B tries to delete: ENOENT error
5. Error not caught, may propagate and crash

Additionally, fs.existsSync can be slow on high-latency filesystems.

**Solution**
Direct try-delete approach with selective error handling:
```typescript
try {
    fs.unlinkSync(signalFile);
} catch (error: any) {
    if (error?.code !== 'ENOENT') {
        throw error;  // Re-throw unexpected errors
    }
    // Ignore ENOENT: normal when another process deleted first
}
```

Benefits:
- Atomic delete operation (no TOCTOU window)
- Faster (one syscall instead of two)
- Graceful: ignores benign ENOENT
- Still fails on real errors (permissions, disk full, etc.)

**When This Occurs**
When multiple channel clients process responses simultaneously:
- Telegram client calls clearSignal('telegram')
- WhatsApp client calls clearSignal('whatsapp')
- If same signal file, both try to delete → first succeeds, second gets ENOENT

Current likelihood: Low (different channels have different files) but possible
if signal file corruption or manual cleanup happens concurrently.

**Assumptions**
1. ENOENT is expected and benign (file already deleted)
2. Other errors (EACCES, EIO) should propagate and fail loudly
3. fs.unlinkSync is atomic (POSIX guarantee)
4. Process has correct permissions to delete signal files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…resilience

**What**
Added recoverStaleConversations() function to detect and recover conversations
that are stuck in 'active' state, marking them as 'completed' so they can be
purged and don't cause memory leaks.

Called on startup and periodically (every 5 minutes) during normal operation.

**Why**
Conversations can become stuck in 'active' state if:
1. queue-processor crashes while agents are processing
2. Network failure prevents agent response from being persisted
3. Bug in agent handler prevents proper completion
4. Database corruption in conversation_pending_agents table

Without recovery:
- In-memory conversations Map grows unbounded
- Stuck conversations never emit team_chain_end event
- Users see conversation as "in progress" forever
- Memory leak: conversations never garbage collected

With recovery:
- Conversations marked as 'completed' after 30 min of inactivity
- Allows pruneOldConversations() to delete them
- Prevents memory leaks and orphaned conversations
- Teams can be retried by user if truly needed

**Implementation Details**

```typescript
export function recoverStaleConversations(staleThresholdMs = 30 * 60 * 1000): number {
    const cutoff = Date.now() - staleThresholdMs;
    return getDb().prepare(`
        UPDATE conversations
        SET status = 'completed'
        WHERE status = 'active' AND updated_at < ?
    `).run(cutoff).changes;
}
```

**Assumptions**
1. 30-minute threshold is reasonable for detecting stuck conversations
2. Marking as 'completed' is safe (responses already persisted to DB)
3. Periodic recovery (every 5 min) catches stuck conversations quickly
4. Users can retry conversation if legitimate work was interrupted

**Trade-offs**
- Possible data loss if agent is legitimately processing for 30+ min
  (Mitigation: user can retry conversation, which is rare use case)
- Memory will grow to peak of ~30 min of stuck conversations
  (Acceptable: periodic pruning cleans them up)

**Testing Considerations**
- Verify conversations marked as completed can be queried
- Check team_chain_end event emitted when recovery completes conversation
- Monitor logs for false positives (legitimate long-running conversations)
- Test crash scenarios to verify recovery works

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…documentation

**What**
Fixed two issues in recoverStaleConversations():
1. Don't update updated_at when marking as completed (keeps original timestamp)
2. Enhanced documentation explaining why team_chain_end is NOT emitted

**Why Issue TinyAGI#1: Pruning Timestamp Reset**

Previous code:
```typescript
UPDATE conversations
SET status = 'completed', updated_at = ?  ← WRONG: resets timestamp
WHERE status = 'active' AND updated_at < ?
```

Problem timeline:
- T=0: Conversation starts, updated_at = T0
- T=30min: Conversation gets stuck (no updates)
- T=30min: Recovery runs, marks completed, sets updated_at = T30
- T=30min+24h: pruneOldConversations() looks for updated_at < 24h ago
- Result: Conversation not pruned until T=30min+24h (stays in DB 24+ hours)

Better approach:
```typescript
UPDATE conversations
SET status = 'completed'  ← Keep original updated_at timestamp
WHERE status = 'active' AND updated_at < ?
```

Now pruning works correctly:
- Stale conversation marked completed at T=30min
- Original updated_at = T0 (30+ min ago)
- pruneOldConversations() deletes it when updated_at < 24h ago (works!)

**Why Issue TinyAGI#2: Missing team_chain_end Event**

Recovery completion is NOT a natural completion:
- Natural completion: All agents finish, responses aggregated, user gets result
- Stale recovery: Conversation abandoned after crash, responses may be incomplete

Implications:
- Visualizer won't show recovery as "completed" (correct - it's artificial)
- Events not sent (prevents false positives in monitoring)
- Users understand recovery = lost work, not success

Alternative considered: Emit team_chain_end with recovery flag
- Rejected: Would confuse visualizer and monitoring
- Recovery should be silent cleanup, not broadcast as completion

**Assumptions**
1. Keeping original updated_at is correct behavior (allows proper pruning)
2. Silent recovery is acceptable (users can retry if needed)
3. 30-minute stale threshold correctly identifies stuck conversations
4. Not emitting events prevents false positives in event-based systems

**Testing**
Verify:
1. Stale conversation marked as completed
2. completed_at timestamp NOT changed (still ~30min old)
3. pruneOldConversations() deletes it after 24h from original time
4. No team_chain_end event in logs for recovered conversations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…covery visibility

What:
- Reduce stale conversation threshold from 30min to 10min (Gap 1)
- Add getStaleConversations() to get details of stuck conversations
- Add WARN-level logging with team/conversation details on recovery (Gap 2)
- Emit crash_recovery event for visualizer/monitoring (Gap 2)

Why:
Gap 1 (Slow Detection): 30min threshold meant users could lose up to 30 minutes
of work if agent crashed. 10min reduces data loss window while still giving
slow agents reasonable grace period.

Gap 2 (Silent Recovery): Recovery was invisible (INFO level). Ops couldn't
tell if completion was normal or crash recovery. WARN logs + events provide
visibility for monitoring and alerting.

Implementation:
- getStaleConversations() returns {id, teamId, duration} for each stuck conv
- Startup recovery: WARN log with 🔴 CRASH RECOVERY prefix + event per conv
- Periodic recovery: WARN log with 🔴 PERIODIC RECOVERY prefix + events
- Events include conversationId, teamId, stuckForMs, recoveredAt/source

Assumptions:
- 10min is acceptable grace period for slow agents (2x NATS heartbeat)
- WARN level is appropriate for crash recovery (not ERROR since it's expected)
- Events emitted before actual recovery (state change happens after logging)

Risk: None (logging only, no behavior change)
Testing: Kill agent mid-processing, verify WARN logs + events after 10min
What:
- Add backup.sh script for daily automated SQLite backups
- Backups stored in ~/.tinyclaw/backups/ with 7-day retention
- Add database integrity check (PRAGMA integrity_check) on startup
- Copy WAL files if present (WAL mode consistency)
- Verify backup is readable before considering it valid

Why:
SQLite database is a single point of failure. Without backups, corruption or
accidental deletion means total data loss. With backups, worst case is losing
last 24 hours of conversation state (acceptable for production use).

Usage:
  ./backup.sh                    # Manual backup
  crontab -e                     # Add to cron for daily backups
  0 2 * * * /path/to/backup.sh   # Daily at 2 AM

Recovery:
  cp ~/.tinyclaw/backups/tinyclaw_YYYYMMDD_HHMMSS.db ~/.tinyclaw/tinyclaw.db
  rm ~/.tinyclaw/tinyclaw.db-wal ~/.tinyclaw/tinyclaw.db-shm 2>/dev/null || true

Assumptions:
- 7-day retention is sufficient for debugging corruption causes
- Daily backups are frequent enough (conversations are recoverable)
- Storage is cheap (~1-5 MB per backup, 7 backups = ~35 MB max)
- Manual recovery is acceptable (rare event, ops can handle)

Risk: None (additive, no behavior changes)
Testing: Run backup.sh, verify 7 daily backups exist and are readable
What:
- Add outstanding_requests table with request_id, conversation_id, from_agent, to_agent
- Add status field: pending | acked | responded | failed | escalated
- Add deadline tracking: ack_deadline (5s default), response_deadline (5min default)
- Add retry tracking: retry_count, max_retries (default 3)
- Add 10 functions for request lifecycle management

Why:
This implements the primitive request-reply pattern with timeouts to solve the
"ping pong" message drop problem. When agent A asks agent B to do something:

1. Create outstanding request with deadlines
2. Agent B must ACK (acknowledge) within timeout
3. Agent B must RESPOND with result within timeout
4. If deadlines expire → retry or escalate

This is how distributed systems worked before fancy protocols - just timeouts
and retries at the application level.

Functions added:
- createOutstandingRequest() - Create new request when handoff happens
- acknowledgeRequest() - Agent B confirms receipt
- respondToRequest() - Agent B provides result
- failRequest() - Mark permanent failure
- escalateRequest() - Escalate to human
- getRequestsNeedingRetry() - Find expired pending requests
- getRequestsNeedingEscalation() - Find expired acked requests
- incrementRequestRetry() - Retry with new deadline
- getRequest() - Lookup by ID
- getPendingRequestsForConversation() - Get all pending for conv
- pruneOldRequests() - Cleanup old completed requests

Assumptions:
- 5 second ACK timeout is reasonable for agent processing
- 5 minute response timeout balances speed vs complex tasks
- 3 retries before escalation is sufficient
- SQLite is fast enough for this tracking (no separate service needed)

Pattern compliance:
- Uses same SQLite patterns (WAL, transactions, indexes)
- Foreign key to conversations table with CASCADE delete
- Timestamps in milliseconds (consistent with rest of codebase)
- Debug/Warn logging for observability
… handoffs

What:
- Modify enqueueInternalMessage() to create outstanding request when agent A asks agent B to do something
- Include request_id in the message payload as [REQUEST:xxx] prefix
- Import createOutstandingRequest from db.ts

Why:
This is the integration point that actually uses the outstanding_requests table.
Previously, agent handoffs were fire-and-forget - no tracking if agent responds.
Now we create request with ACK deadline (5s) and response deadline (5min).

The request_id in the message allows the receiving agent to acknowledge and respond.

Assumptions:
- Agents can parse [REQUEST:xxx] prefix from messages
- 5 second ACK timeout is enough for agent to receive and parse
- 5 minute response timeout is enough for agent to process task

Breaking changes:
- Internal messages now include [REQUEST:xxx] prefix
- Backward compatible (old agents can ignore the prefix)
What:
- Add checkRequestTimeouts() function to detect expired ACK and response deadlines
- Import outstanding request functions from db.ts
- Extract and acknowledge request_id from messages when agent receives them
- Add request_escalated event for monitoring
- Integrate timeout checking into periodic maintenance (every 5 min)

Why:
This completes the request-reply pattern implementation:
1. When agent A sends message to agent B, request is created with deadlines
2. When agent B receives message, request is acknowledged (ACK)
3. If no ACK within 5s → retry with extended deadline
4. If no response within 5min → escalate to human

This prevents the ping pong drop problem by:
- Detecting when agent B doesn't receive the message (no ACK)
- Detecting when agent B receives but doesn't respond (timeout)
- Escalating instead of silently dropping

Assumptions:
- Request ID is in format [REQUEST:xxx] at start of message
- Agents can still process messages even with prefix (or we strip it)
- 5 minute check interval is frequent enough for timeouts
- Escalation is logged and emitted as event for monitoring

Breaking changes:
- None - this is additive monitoring on top of existing flow
What:
- Import getPendingRequestsForConversation from db.ts
- In handleTeamResponse, check if agent's response completes an outstanding request
- If matching request found (same conversation, agent was target, status=acked),
  call respondToRequest to mark it complete

Why:
The request-reply pattern requires the response to be tracked. Previously we:
1. Created request when agent A mentioned agent B
2. Acknowledged when agent B received
3. But never marked complete when agent B responded

This meant requests would stay in 'acked' state forever, eventually escalating
even though the agent actually responded.

Now when agent B responds, we find the matching request and mark it complete.

Assumptions:
- Agent responds within the same conversation
- Only one pending request per agent per conversation (find() returns first)
- Response content is stored in the request record for audit

Risk: Low - additive check, doesn't change response handling
Delete duplicate log('INFO', ) line.
Was introduced when adding request ACK handling code.

Risk: None (deleting duplicate)
Testing: Verify log appears once per message
…Request

Previously:
- acknowledgeRequest checked ack_deadline >= now
- respondToRequest checked response_deadline >= now

Problem: If ACK/response arrives 1ms after deadline, silently fails.
The timeout checker would retry, but agent already processed it.

Fix: Remove deadline checks from write path. Let timeout checker handle
expired requests. Accept valid work even if slightly late.

Also added better logging for already-acked/already-responded cases.

Risk: Low - timeout checker still runs, just won't reject late-but-valid ACKs
Testing: Send request, wait for deadline, verify ACK still accepted
…e agent

Previously: find() returned first match, potentially wrong request.

Now:
- getPendingRequestsForConversation() orders by created_at ASC (FIFO)
- filter() returns all acked requests for the agent
- All matching requests marked as responded

If agent B responds, it's responding to everything it was asked.

Risk: Low - marking more requests complete is safer than wrong one
Testing: Have agent A mention agent B twice, verify both marked complete
… checks

Changes:
- Bump max_retries from 3 to 5 for request retries (gives recoverStaleMessages more time)
- Add pruneOldRequests() to maintenance loop (hourly cleanup)
- Run checkRequestTimeouts every 30s instead of 5min (faster failure detection)
- Separate timeout checks from main maintenance interval

Why:
- 3 retries was too aggressive given 5min check interval
- Old requests never got cleaned up (memory leak)
- 5min check interval meant 5-10min delay detecting ACK timeouts

Risk: Low - additive maintenance, conservative retry bump
Testing: Verify timeout checks run every 30s, old requests pruned after 24h
What:
- Add docs/AGENT_COMMUNICATION_PROTOCOL.md with full protocol documentation
- Document database schema, state machine, API reference
- Document integration points (enqueue, processing, response, timeout)
- Add configuration reference, monitoring guide, troubleshooting
- Add design decisions section explaining why primitive approach vs A2A/ACP
- Update README.md to reference new documentation

Why:
The request-reply protocol is a significant architectural addition. Without
documentation, future maintainers won't understand:
- Why outstanding_requests table exists
- How the timeout/escalation flow works
- When to use which API function
- How to debug issues

This documentation ensures the knowledge persists.

Assumptions:
- Documentation should be comprehensive enough for new team members
- Code examples should be copy-pasteable
- Design decisions should be explained (not just what, but why)

Risk: None (documentation only)
Testing: Verify markdown renders correctly, links work
… ordering

The visualizer relies on event ordering. Without await, chain_step_start
can race with chain_step_done, causing UI to show stale state.

Risk: None (consistent with other awaited emitEvent calls)
Testing: Verify visualizer shows correct agent processing state
Previously: When agent B errored, outstanding request stayed in acked state,
would eventually escalate via timeout checker.

Now: When agent B errors, matching requests are proactively marked as failed.
This gives failRequest() a caller and provides cleaner audit trail.

Risk: Low - additive, marks state faster
Testing: Trigger agent error, verify request marked failed not escalated
- Remove unused getRequest import
- Move pruneOldRequests from 5-min to hourly interval (consistent with other prunes)

Risk: None (cleanup only)
Testing: Verify builds, no runtime changes
- Add error handling section documenting failRequest() usage
- Update pruneOldRequests interval from 5min to 1 hour
- Add failRequest() to API reference
- Update integration points to include handleTeamError

Risk: None (documentation only)
- Change repo from TinyAGI/tinyclaw to dpbmaverick98/tinyclaw
- Change default branch from main to sql-experiment
- Add --branch flag to git clone so it clones the correct branch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Agents receiving teammate messages now see explicit instructions to
respond using [@sender: reply] syntax, preventing responses from going
directly to the user instead of back to the requesting agent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The chain_step_done event was dropped during the fire-and-forget refactor,
breaking the visualizer which listens for it to mark agents as "done".
Added emission in both handleSimpleResponse and handleTeamResponse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changed GITHUB_REPO from personal fork to TinyAGI/tinyclaw and
DEFAULT_BRANCH from sql-experiment to main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Devain Pal Bansal and others added 20 commits March 6, 2026 14:15
Prevents unhandled promise rejections from silently swallowing errors
in fire-and-forget event emissions (message_received, agent_routed,
crash_recovery, request_escalated).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sequential cp of .db + .db-wal + .db-shm is not safe for a live
WAL-mode database. Replaced with sqlite3 .backup which guarantees
a consistent snapshot.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Schema declared DEFAULT 3 but createOutstandingRequest() always
inserts 5. Updated schema to match the actual value and docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
What:
- Add 'kimi' and 'minimax' to provider union type comments
- Add 'apiKey?: string' to AgentConfig for per-agent API key override
- Add 'kimi' and 'minimax' sections to Settings.models with model and apiKey fields
- Update provider comment to include new providers

Why:
- Kimi and MiniMax require API key authentication via ANTHROPIC_AUTH_TOKEN
- Users may want different API keys per agent (e.g., different Kimi accounts)
- Two-level key resolution: agent-specific → global → error

Assumptions:
- API keys stored as plain text in settings.json (acceptable for local use)
- Only kimi2.5 and MiniMax-M2.5 models supported initially
- Provider uses Claude Code binary with custom ANTHROPIC_BASE_URL

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
…nimax

What:
- Add resolveApiKey(agent, settings) with two-level fallback:
  1. Agent-specific apiKey → 2. Global provider apiKey → 3. Empty
- Add getProviderBaseUrl(provider) returning custom endpoints for kimi/minimax
- Add providerRequiresApiKey(provider) boolean check
- Update getDefaultAgentFromModels() to handle kimi/minimax with defaults
- Update auto-detect provider logic to include kimi/minimax

Why:
- Centralize API key resolution logic for consistent behavior
- Support per-agent API key override (different accounts/keys per agent)
- Abstract provider-specific configuration (URLs, auth requirements)
- Enable runtime validation of API key presence

Assumptions:
- Kimi endpoint: https://api.kimi.com/v1
- MiniMax endpoint: https://api.minimax.io/anthropic
- Default models: kimi2.5, MiniMax-M2.5
- Empty string return allows caller to provide contextual error message

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Add kimi/minimax branch in invokeAgent() with custom env setup
- Import resolveApiKey, getProviderBaseUrl, providerRequiresApiKey from config
- Set ANTHROPIC_AUTH_TOKEN, ANTHROPIC_BASE_URL, ANTHROPIC_MODEL env vars
- Add clear error message when API key is missing with fix instructions
- Add authentication error detection (401/unauthorized) with helpful message
- Use spawn with custom env instead of runCommand for env control

Why:
- Kimi/MiniMax require custom environment variables to work with Claude Code
- ANTHROPIC_BASE_URL redirects Claude Code to provider's API endpoint
- ANTHROPIC_AUTH_TOKEN is used instead of ANTHROPIC_API_KEY for these providers
- Per-agent API key support requires runtime resolution, not global env

Assumptions:
- Claude Code binary handles the Anthropic-compatible API protocol
- 5 minute timeout (3000000ms) sufficient for these providers
- Error message parsing for 401/auth errors is provider-agnostic enough

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Add Kimi (4) and MiniMax (5) to provider selection menu
- Add API key prompt with validation for kimi/minimax providers
- Add model selection for kimi2.5 and MiniMax-M2.5
- Add per-agent API key support in additional agents flow
- Show masked global key (sk-...xxxx) with option to override
- Validate per-agent keys with HTTP check
- Store API keys in settings.json models.kimi/minimax.apiKey
- Store per-agent API keys in agent.apiKey field

Why:
- Kimi and MiniMax require API key authentication
- Users need interactive setup for these providers
- Per-agent API keys allow different accounts/keys per agent
- Validation catches invalid keys early in setup process

Assumptions:
- curl available for validation (graceful fallback if not)
- Validation endpoints: api.kimi.com/v1/models, api.minimax.io/anthropic/v1/models
- Masked key format: sk-...xxxx (first 4 + last 4 chars)
- User can choose to continue even if validation fails

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Add Kimi (4) and MiniMax (5) to provider selection in agent_add()
- Add API key prompt with global key detection and override option
- Show masked global key (sk-...xxxx) when available
- Validate per-agent API keys with HTTP check
- Add kimi2.5 and MiniMax-M2.5 model selection
- Build agent JSON with optional apiKey field using jq

Why:
- Users can add agents with kimi/minimax providers via CLI
- Consistent with setup-wizard flow for API key handling
- Per-agent API keys allow different keys per agent
- Validation catches invalid keys at creation time

Assumptions:
- Global key lookup via .models.kimi.apiKey or .models.minimax.apiKey
- Same validation endpoints as setup-wizard
- jq handles conditional apiKey inclusion in JSON

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Update 'tinyclaw provider' command to support kimi/minimax with --api-key flag
- Add kimi and minimax cases with API key validation and storage
- Update 'tinyclaw model' command to support kimi2.5 and MiniMax-M2.5
- Show API key configuration status in provider display
- Parse --api-key and --model flags in any order
- Update help text with new provider/model examples
- Update agent provider help to include --api-key examples

Why:
- Users can switch providers via CLI with API key authentication
- Model command allows bulk-updating all agents of a provider
- Consistent CLI interface for all provider operations
- Help text guides users on correct syntax

Assumptions:
- API key is required when switching to kimi/minimax provider
- Flags can appear in any order (--api-key before or after --model)
- Help text serves as primary documentation for users

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Change Kimi base URL from https://api.kimi.com/v1 to https://api.kimi.com/coding
- This matches the correct endpoint from cc-mirror's provider configuration

Why:
- The /v1 endpoint returns 404, causing all Kimi invocations to fail
- /coding is the correct path for Kimi's Anthropic-compatible API

Assumptions:
- Kimi uses /coding path consistently across all API operations
- MiniMax endpoint at /anthropic is already correct

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Change ANTHROPIC_AUTH_TOKEN to ANTHROPIC_API_KEY for kimi/minimax
- Add CC_MIRROR_UNSET_AUTH_TOKEN to clear inherited AUTH_TOKEN
- Add missing model env vars: OPUS, HAIKU, SMALL_FAST
- Add CLAUDE_CONFIG_DIR per agent for conversation isolation
- Refactor runCommand to accept extraEnv parameter
- Use runCommand for kimi/minimax instead of duplicated spawn block
- Remove duplicate getSettings import
- Remove unused TINYCLAW_HOME import

Why:
- Kimi/MiniMax use API key auth, not auth token (Bearer vs header)
- Missing model env vars caused internal operations to route to wrong model
- CLAUDE_CONFIG_DIR prevents agent A from resuming agent B's session
- DRY: runCommand with extraEnv eliminates 30+ lines of duplicated code

Assumptions:
- All model aliases (sonnet, opus, haiku) map to same kimi/minimax model
- CC_MIRROR_UNSET_AUTH_TOKEN prevents auth header conflicts
- runCommand error handling is sufficient for auth error detection

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Change Kimi validation URL from /v1/models to /coding/models
- Add safer JSON building using temp files and jq for agent data

Why:
- The /v1 endpoint returns 404, causing false validation failures
- String interpolation of API keys into JSON risks injection if keys contain quotes

Assumptions:
- /coding/models endpoint exists and returns 200 for valid keys
- Temp file approach works across different shell environments

Note: Full JSON safety refactor recommended for future - this is a minimal fix

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Change Kimi validation URL from /v1/models to /coding/models
- Remove pointless read -rp for model choice (was hardcoded anyway)
- Simplify model selection output for kimi/minimax

Why:
- Wrong validation URL caused false validation failures
- Unused AGENT_MODEL_CHOICE variable was confusing dead code

Assumptions:
- Single model per provider (kimi2.5, MiniMax-M2.5) for now

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Add while loop to parse --model and --api-key flags in any order
- Add kimi case with required --api-key validation
- Add minimax case with required --api-key validation
- Update help text to include kimi/minimax examples
- Use jq to safely set provider, model, and apiKey fields

Why:
- Users can change existing agents to kimi/minimax via CLI
- API key is required for these providers
- Flag parsing in any order matches main CLI behavior

Assumptions:
- API key is stored in agent config for per-agent override
- jq safely handles API key string escaping

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Change Kimi base URL from https://api.kimi.com/coding to https://api.kimi.com/coding/

Why:
- Without trailing slash, URL joining behavior is ambiguous
- Could produce https://api.kimi.com/codingv1/messages instead of /coding/v1/messages
- Trailing slash makes path concatenation unambiguous

Assumptions:
- Claude Code or underlying HTTP client does proper URL joining
- Kimi API accepts URLs with trailing slash

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Remove CC_MIRROR_UNSET_AUTH_TOKEN: '1' (does nothing in tinyclaw)
- Add ANTHROPIC_AUTH_TOKEN: undefined to clear inherited value
- Change extraEnv type to Record<string, string | undefined>

Why:
- CC_MIRROR_UNSET_AUTH_TOKEN is a cc-mirror wrapper convention
- In tinyclaw, we directly control env via runCommand
- User may have ANTHROPIC_AUTH_TOKEN set in shell from main Claude setup
- Setting to undefined explicitly clears it to prevent conflicts with API_KEY

Assumptions:
- runCommand properly handles undefined values (deletes from env)
- Clearing AUTH_TOKEN prevents auth header conflicts

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Remove standalone import { getSettings } from './config'
- Add getSettings to existing import from './config'

Why:
- Two separate import statements from same module is redundant
- Consolidating into single import is cleaner

Assumptions:
- No functional change, just code organization

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
…allback

What:
- Replace string interpolation with jq -n --arg for all agent fields
- Add /tmp fallback for Linux compatibility
- Apply to both apiKey and non-apiKey agent creation paths

Why:
- String interpolation breaks if agent name contains quotes or backslashes
- jq --arg safely escapes all string values
- TMPDIR is not guaranteed on Linux systems

Assumptions:
- jq is available (already required by setup wizard)
- All field values are valid UTF-8 strings

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
…neration

What:
- Change default agent creation to use jq --arg (safe JSON)
- Store default agent in AGENTS_JSON variable instead of string fragment
- Merge additional agents into default using jq reduce with
- Change final template from  to agents:

Why:
- Old string fragment format was unsafe and inconsistent
- New approach produces valid JSON object for agents field
- Proper merge ensures default agent is preserved with additional agents

Assumptions:
- jq properly merges objects with * operator
- Default agent always exists (created before additional agents loop)

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
What:
- Change extraEnv parameter type from Record<string, string> to Record<string, string | undefined>

Why:
- TypeScript error: cannot pass Record<string, string | undefined> to parameter expecting Record<string, string>
- Allows ANTHROPIC_AUTH_TOKEN: undefined to be passed correctly

Assumptions:
- spawn() handles undefined values correctly (they get filtered out)

Co-Authored-By: Kimi Claw <noreply@anthropic.com>
@greptile-apps
Copy link

greptile-apps bot commented Mar 8, 2026

Greptile Summary

This PR is a substantial multi-feature merge that adds Kimi 2.5 and MiniMax M2.5 provider support, multi-agent isolation via CLAUDE_CONFIG_DIR, a primitive request-reply protocol for reliable agent handoffs (with ACKs, retries, and escalation), conversation persistence for crash recovery, file-based push signalling to replace 1-second polling, and a range of code-quality cleanups (helper consolidation, generateId(), writeJsonFile/readJsonFile, etc.).

Key concerns found during review:

  • JSON injection in lib/setup-wizard.sh MODELS_SECTION — the Kimi and MiniMax apiKey values are interpolated directly into a shell string to form JSON, so any key containing " or \ will produce a malformed settings.json at setup time. The agents section was correctly migrated to jq --arg, but MODELS_SECTION was not.
  • Wrong Kimi validation URL in lib/setup-wizard.sh global path — the global setup wizard hits /v1/models, while lib/agents.sh (and the PR's own stated bug-fix list) use the correct /coding/models endpoint. The inconsistency means validation will likely return a non-200 status in the wizard even for valid keys.
  • Sequential await in emitEvent (src/lib/logging.ts) — converting the event emitter to await each listener in series means a single slow plugin listener will now stall the queue processor's entire processing loop; Promise.allSettled for parallel execution is the safer approach.
  • Message validation skipped for team-context messages — the 1 MB validateMessage() guard is only applied to non-team messages; oversized internal agent-to-agent messages bypass this check.

Confidence Score: 3/5

  • Safe to merge with caution — the JSON injection bug in the setup wizard can corrupt settings files for Kimi/MiniMax users, and the wrong validation URL means API key validation silently fails in the global setup path.
  • The core TypeScript logic (invoke, db, queue-processor, signals) is well-structured and the major multi-agent and persistence features appear sound. However, two bugs in setup-wizard.sh directly affect the new Kimi/MiniMax user-facing setup flow (one of the headline features), and the async emitEvent change introduces a latency risk in the queue processor that is harder to test.
  • lib/setup-wizard.sh (JSON injection + wrong URL) and src/lib/logging.ts (sequential async emitEvent) need attention before merging.

Important Files Changed

Filename Overview
lib/setup-wizard.sh Adds Kimi/MiniMax provider support to the global setup wizard; contains two bugs: wrong Kimi validation URL (/v1/models instead of /coding/models) and JSON injection risk in MODELS_SECTION where the API key is interpolated directly into a shell string instead of being escaped via jq --arg.
lib/agents.sh Adds interactive agent creation and provider-switching support for Kimi/MiniMax with correct /coding/models validation URL and safe jq --arg JSON building; the elif-after-else syntax bug noted in the PR description is correctly fixed.
src/lib/logging.ts emitEvent changed from synchronous fire-and-forget to sequential await, meaning a slow listener will now stall all downstream queue processing; should use Promise.allSettled for parallel execution instead.
src/lib/invoke.ts Adds Kimi/MiniMax invocation via Claude Code with custom ANTHROPIC_BASE_URL and full env isolation (ANTHROPIC_AUTH_TOKEN cleared); two-level API key fallback and 401 error handling are implemented correctly.
src/lib/db.ts Large addition of conversation persistence tables (conversations, conversation_responses, conversation_pending_agents) and outstanding request tracking (outstanding_requests) with proper FK constraints, indexes, and migration guards; exponential backoff with jitter added to failMessage.
src/queue-processor.ts Major refactor extracts handleSimpleResponse, handleTeamResponse, and handleTeamError for async processing; adds [REQUEST:xxx] prefix parsing, conversation persistence, and timeout tracking; message validation missing in team-context path.
src/lib/conversation.ts Integrates outstanding request creation into enqueueInternalMessage; completeConversation made async; [REQUEST:xxx] prefix correctly prepended before enqueueing to target agent.
tinyclaw.sh Extends provider and model commands for kimi/minimax; fixes falling-through display logic and single-jq model display; flag parsing converted to a while loop to support arbitrary flag ordering.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant QP as queue-processor
    participant DB as SQLite DB
    participant A as Agent A
    participant B as Agent B
    participant CH as Channel Client

    U->>QP: Message arrives (channel)
    QP->>DB: claimNextMessage()
    QP->>A: invokeAgent(message)
    A-->>QP: response with [@B: do X]

    QP->>DB: createOutstandingRequest(A→B, 5s ACK / 5min response)
    QP->>DB: enqueueMessage([REQUEST:req_id]\nmessage for B)
    QP->>DB: persistConversation(conv)
    QP->>DB: persistResponse(A, response)

    QP->>B: invokeAgent([REQUEST:req_id]\nmessage)
    Note over QP,B: acknowledgeRequest(req_id) called<br/>before forwarding stripped message
    B-->>QP: response with [@A: result]

    QP->>DB: respondToRequest(req_id, response)
    QP->>DB: decrementPendingInDb()
    QP->>DB: markConversationCompleted()
    QP->>DB: enqueueResponse(final)
    QP->>DB: signalChannel(channel)

    DB-->>CH: fs.watch fires signal file
    CH->>QP: checkOutgoingQueue()
    CH->>U: Send response

    Note over QP: Timeout checker (every 30s)
    QP->>DB: getRequestsNeedingRetry()
    QP->>DB: incrementRequestRetry() or escalateRequest()
Loading

Last reviewed commit: 9dc18b1

Comment on lines +148 to +149
if command -v curl > /dev/null 2>&1; then
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" -H "Authorization: Bearer $API_KEY" "$VALIDATION_URL" 2>/dev/null || echo "000")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong Kimi validation URL in global setup path

The global setup wizard validates Kimi API keys against https://api.kimi.com/v1/models, but lib/agents.sh (and the PR description's bug-fix list) specifies the correct Kimi endpoint is /coding/models. These two paths are inconsistent: the per-agent flow in agents.sh will succeed with a valid key (HTTP 200), while the global wizard here will likely return an unexpected status, showing a spurious warning or silently skipping validation.

Suggested change
if command -v curl > /dev/null 2>&1; then
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" -H "Authorization: Bearer $API_KEY" "$VALIDATION_URL" 2>/dev/null || echo "000")
[ "$PROVIDER" = "kimi" ] && VALIDATION_URL="https://api.kimi.com/coding/models"

Comment on lines 529 to +533
MODELS_SECTION='"models": { "provider": "anthropic", "anthropic": { "model": "'"${MODEL}"'" } }'
elif [ "$PROVIDER" = "opencode" ]; then
MODELS_SECTION='"models": { "provider": "opencode", "opencode": { "model": "'"${MODEL}"'" } }'
elif [ "$PROVIDER" = "kimi" ]; then
MODELS_SECTION='"models": { "provider": "kimi", "kimi": { "model": "'"${MODEL}"'", "apiKey": "'"${API_KEY}"'" } }'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON injection risk via unescaped API key in MODELS_SECTION

The PR description claims "JSON injection risk in MODELS_SECTION (API keys with special chars)" was fixed, but this new code still builds the kimi/minimax JSON by directly interpolating $API_KEY into a shell string. If the API key contains a ", \, or other JSON-special characters (e.g. abc"xyz), the generated settings file will be malformed JSON and TinyClaw will fail to parse it on startup.

The agents section was correctly converted to use jq --arg to safely escape values, but MODELS_SECTION was not updated consistently.

Replace both kimi and minimax branches with jq-based generation:

elif [ "$PROVIDER" = "kimi" ]; then
    MODELS_SECTION=$(jq -n \
        --arg model "$MODEL" \
        --arg apiKey "$API_KEY" \
        '"models": {"provider": "kimi", "kimi": {"model": $model, "apiKey": $apiKey}}' \
        | tr -d '\n')
elif [ "$PROVIDER" = "minimax" ]; then
    MODELS_SECTION=$(jq -n \
        --arg model "$MODEL" \
        --arg apiKey "$API_KEY" \
        '"models": {"provider": "minimax", "minimax": {"model": $model, "apiKey": $apiKey}}' \
        | tr -d '\n')

Note: the heredoc embedding also requires careful quoting; using a temp-file approach (write full JSON with jq then use jq -s '.[0] * .[1]' to merge) is the cleanest fix.

Comment on lines 25 to +34
* Emit a structured event — dispatched to in-memory listeners (e.g. SSE broadcast, plugins).
* Now async to allow listeners to complete before continuing.
*/
export function emitEvent(type: string, data: Record<string, unknown>): void {
export async function emitEvent(type: string, data: Record<string, unknown>): Promise<void> {
for (const listener of eventListeners) {
try { listener(type, data); } catch { /* never break the queue processor */ }
try {
await listener(type, data);
} catch {
/* never break the queue processor */
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async emitEvent serialises listeners and may block the queue processor

emitEvent was changed from synchronous to sequentially await-ing each listener. This means a single slow listener (e.g. a plugin that makes an HTTP call) will now block all downstream processing in processMessage, including agent invocation and conversation bookkeeping.

Previously, listeners were fire-and-forget, so any one listener could never stall the queue. The safer refactor is to let all listeners run concurrently and wait for them in parallel:

Suggested change
* Emit a structured event dispatched to in-memory listeners (e.g. SSE broadcast, plugins).
* Now async to allow listeners to complete before continuing.
*/
export function emitEvent(type: string, data: Record<string, unknown>): void {
export async function emitEvent(type: string, data: Record<string, unknown>): Promise<void> {
for (const listener of eventListeners) {
try { listener(type, data); } catch { /* never break the queue processor */ }
try {
await listener(type, data);
} catch {
/* never break the queue processor */
}
export async function emitEvent(type: string, data: Record<string, unknown>): Promise<void> {
await Promise.allSettled(
eventListeners.map(listener => {
try { return Promise.resolve(listener(type, data)); } catch { return Promise.resolve(); }
})
);
}

Promise.allSettled ensures that a throwing/rejecting listener never propagates, while all listeners run in parallel instead of sequentially.

Comment on lines 394 to +407
}

const agent = agents[agentId];

// Extract and acknowledge request_id if present (for agent handoff tracking)
const requestMatch = message.match(/^\[REQUEST:([^\]]+)\]\n?/);
if (requestMatch) {
const requestId = requestMatch[1];
if (acknowledgeRequest(requestId)) {
log('INFO', `Request ${requestId} acknowledged by @${agentId}`);
}
// Remove request prefix from message before sending to agent
message = message.replace(requestMatch[0], '');
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message size validation is missing in the team-context path

validateMessage() is called before agent invocation for non-team messages, but the team-context processing path (where teamContext is set) doesn't perform the same check. An oversized internal message forwarded between agents could reach invokeAgent without the 1 MB guard and either be silently truncated by the API or trigger a harder-to-diagnose error.

Consider extracting the validation check to a shared location before the if (!teamContext) branch so both paths benefit from it.

dpbmaverick98 pushed a commit to dpbmaverick98/tinyclaw that referenced this pull request Mar 8, 2026
emitEvent() was sequentially awaiting each listener, meaning a slow
listener (e.g. a plugin making an HTTP call) would block all downstream
queue processing. Switch to Promise.allSettled() so listeners run
concurrently while still awaiting completion before the caller proceeds.

This preserves inter-event ordering (chain_step_start before agent
invocation) without serializing intra-event listener execution.

Reported by: Greptile (PR TinyAGI#167)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dpbmaverick98 pushed a commit to dpbmaverick98/tinyclaw that referenced this pull request Mar 8, 2026
…m paths

validateMessage() was only called for non-team messages. An oversized
internal message forwarded between agents in a team conversation could
bypass the 1MB guard and cause hard-to-diagnose errors from the API.

Moved validation before the team/non-team branch so both paths benefit,
and removed the now-redundant second validateMessage() call in the team
path.

Reported by: Greptile (PR TinyAGI#167)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dpbmaverick98
Copy link
Author

Create PR from wrong branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant