Skip to content

Unified agent logging and fixes queue processor#171

Open
michaelerobertsjr wants to merge 6 commits intoTinyAGI:mainfrom
michaelerobertsjr:t3code/pino-unified-agent-logging
Open

Unified agent logging and fixes queue processor#171
michaelerobertsjr wants to merge 6 commits intoTinyAGI:mainfrom
michaelerobertsjr:t3code/pino-unified-agent-logging

Conversation

@michaelerobertsjr
Copy link
Contributor

Summary

Fix the queue processor so pending messages drain continuously instead of advancing only when a new message is enqueued, and document options for handling non-delivery response channels (heartbeat, web, and cli).

Changes

  • Updated the queue processor to reschedule queue draining after each per-agent processing chain completes
  • Added startup queue draining so existing pending rows begin processing even without new enqueue events
  • Kept stale-message recovery draining aligned with the new queue drain scheduler
  • Added documentation covering response delivery/ack options for heartbeat, web, and cli
  • Verified the live backlog now decreases after restart instead of remaining stuck behind heartbeat accumulation

Type of Change

  • Bug fix
  • New feature
  • Enhancement to existing feature
  • Breaking change
  • Documentation update
  • Refactor / code cleanup

Testing

  • Ran npm run build:main
  • Restarted TinyClaw locally with ./tinyclaw.sh restart
  • Verified the agent backlog began draining after restart (pending count decreased and processing advanced)
  • Confirmed queue.log showed consecutive heartbeat messages being processed without requiring fresh enqueue events

Checklist

  • I have tested these changes locally
  • My changes don't introduce new warnings or errors
  • I have updated documentation if needed

Mike Roberts added 2 commits March 7, 2026 18:44
- Switch queue, API, channels, daemon, and heartbeat to structured NDJSON logging
- Add log rotation (10 MB, 5 archives) and expose merged history via `GET /api/logs`
- Wire `LOG_LEVEL` through runtime config and update docs/TinyOffice log references
- Route API logs to a dedicated `api` runtime and make log reads async, filtered, and scalable
- Add normalized log levels in shell scripts and preserve severity in structured logs
- Ensure queue backlog drains reliably via microtask scheduling and stale-message recovery triggers
- Centralize graceful shutdown/error exits for Discord, Telegram, and WhatsApp clients
- Add queue DB query/count helpers and document response delivery options for non-delivery channels
@michaelerobertsjr michaelerobertsjr changed the title T3code/pino unified agent logging Unified agent logging and fixes queue processor Mar 8, 2026
@greptile-apps
Copy link

greptile-apps bot commented Mar 8, 2026

Greptile Summary

This PR fixes the queue processor's continuous-drain behavior and unifies all runtime logging under pino's structured NDJSON format, replacing ad-hoc log() calls across every component with a consistent createLogger / logError API.

Key changes:

  • Queue drain fixscheduleQueueDrain() (via queueMicrotask) is called after each per-agent processing chain completes and at startup, so the backlog decreases continuously rather than only advancing on new message:enqueued events.
  • Unified structured logging — pino loggers with rotating file streams (10 MB / 5 archives) replace plain appendFileSync calls across the queue processor, API server, and all three channel clients.
  • Events moduleonEvent / emitEvent extracted from logging.ts into a new src/lib/events.ts to cleanly separate live-event dispatch from file-based logging.
  • /api/logs and /api/queue/rows — new endpoints expose filterable structured log entries and typed queue row queries to the frontend; TinyOffice gains a Queue Browser tab.
  • Documentationdocs/RESPONSE-DELIVERY-OPTIONS.md documents the unacked heartbeat/web/cli response problem with three concrete design options.
  • Issues found:
    • LogLevel in src/lib/logging.ts omits 'trace' and 'fatal', which pino can emit at runtime; the frontend type in api.ts already includes all six levels, creating an inconsistency.
    • mergeEntryLists in logging.ts uses Date.parse without a NaN guard: a single malformed time field will permanently block all subsequent entries from that runtime list being merged into the output.
    • PUT /api/agents/:id persists the agent to settings.json unconditionally, then returns HTTP 500 if workspace provisioning fails — clients treating the 500 as a total failure will be wrong since the record is already saved.

Confidence Score: 4/5

  • Safe to merge with minor follow-up items; the core queue-drain fix and logging refactor are solid.
  • The queue-drain bug fix is correct and the pino migration is thorough. Three minor issues were found: a type-narrowness gap in LogLevel, an edge-case NaN bug in the multi-runtime log merger, and a misleading HTTP 500 response when agent provisioning fails (settings are already saved). None of these are regressions or blockers, but two of the three (NaN merge + 500 response) can cause observable incorrect behavior in edge cases.
  • src/lib/logging.ts (LogLevel type and mergeEntryLists NaN guard) and src/server/routes/agents.ts (HTTP 500 vs saved settings mismatch).

Important Files Changed

Filename Overview
src/queue-processor.ts Core bug fix: adds scheduleQueueDrain() via queueMicrotask after each per-agent chain completes and at startup, ensuring the backlog drains continuously rather than only on new enqueue events. Logging migrated to pino. Change is correct and well-scoped.
src/lib/logging.ts Major rewrite: introduces pino-based structured logging with RotatingFileStream, createLogger, readLogEntries, and multi-runtime log merging. Two issues found: LogLevel type excludes 'trace' and 'fatal' pino levels; mergeEntryLists can silently drop entries after any entry with an unparseable timestamp.
src/lib/events.ts New file extracting onEvent/emitEvent from the former logging module. Clean separation; the unsubscribe function returned by onEvent is a useful addition. No issues found.
src/lib/db.ts Adds conversation_id column migration, typed row interfaces (QueueMessageRow, QueueResponseRow, QueueRowCounts), getQueueMessages, getQueueResponses, and getQueueRowCounts. LIKE-search uses proper ESCAPE clause and normalizeSearchTerm escapes special characters. No issues found.
src/server/routes/agents.ts Provisioning failure now returns HTTP 500 and ok: false, but mutateSettings is called before provisioning — the agent is persisted regardless. Clients treating the 500 as a total failure will be wrong. See inline comment.
src/server/routes/logs.ts Endpoint rewritten to call readLogEntries with rich filtering (source, level, channel, agentId, messageId, conversationId, search). Limit clamped to [1, 1000] with NaN guard — previous PR thread's NaN issue is addressed here.
src/server/routes/queue.ts New /api/queue/rows endpoint with filterable message/response queries and aggregate counts. Limit clamped to [1, 500] with NaN guard. No issues found.
lib/common.sh Adds structured shell logging helpers (write_structured_log, rotate_log_file, normalize_log_level, log_level_priority). The jq -nc call produces a trailing newline (noted in prior thread). LOG_DIR must be set in the calling shell before log() is invoked, consistent with prior usage.
src/channels/discord-client.ts Replaces local log() with createLogger/logError. unhandledRejection now logs and continues (per previous review thread resolution); uncaughtException still exits via shutdownDiscord(1). No new issues.
tinyoffice/src/app/logs/page.tsx New Queue Browser tab with status toggles, search, row counts, and message/response cards. Renders structured LogEntry objects with level/source/component badges replacing the old raw-line display. No issues found.
docs/RESPONSE-DELIVERY-OPTIONS.md New documentation file clearly describing the unacked heartbeat/web/cli response problem and three design options with pros/cons. Well-structured reference for future work.
tinyoffice/src/lib/api.ts Adds typed interfaces (QueueMessageRow, QueueResponseRow, QueueRowsResponse, LogEntry) and getQueueRows API function. LogEntry.level correctly includes all six pino levels (tracefatal), unlike the backend LogLevel type.

Sequence Diagram

sequenceDiagram
    participant API as API Server<br/>(messages route)
    participant QE as queueEvents
    participant QP as processQueue()
    participant Chain as agentProcessingChain
    participant DB as SQLite DB
    participant Log as RotatingFileStream<br/>(pino)

    Note over QP: Startup
    QP->>DB: recoverStaleMessages()
    QP->>QP: scheduleQueueDrain() [microtask]
    QP->>DB: getPendingMessages()
    QP->>Chain: build per-agent promise chain

    Note over API,QE: New inbound message
    API->>DB: enqueueMessage()
    API->>QE: emit message:enqueued
    QE->>QP: scheduleQueueDrain() [microtask]
    QP->>DB: getPendingMessages()
    QP->>Chain: append processMessage() to chain
    Chain->>Log: logger.info(structured entry)
    Chain->>DB: dbCompleteMessage()
    Chain->>QP: scheduleQueueDrain() [microtask — keeps draining]

    Note over QP,DB: Stale recovery (every 5 min)
    QP->>DB: recoverStaleMessages()
    DB-->>QP: count > 0
    QP->>QP: scheduleQueueDrain()
Loading

Comments Outside Diff (3)

  1. src/lib/logging.ts, line 2202 (link)

    LogLevel type is narrower than what pino can emit

    LogLevel is defined as 'debug' | 'info' | 'warn' | 'error', but pino natively emits two additional level labels: 'trace' and 'fatal'. Because formatters.level is set to (label) => ({ level: label }), the serialized JSON always contains the pino label string — so 'fatal' level entries can appear in the log files at runtime.

    The frontend api.ts already accounts for all six levels:

    level: "trace" | "debug" | "info" | "warn" | "error" | "fatal";

    Aligning the backend type would close this gap:

    As it stands, any 'fatal' entry read back through readLogEntries carries a type of LogLevel that technically excludes 'fatal', which can mislead callers that switch/discriminate on the level field.

  2. src/server/routes/agents.ts, line 3206-3210 (link)

    HTTP 500 returned but agent settings are already persisted

    mutateSettings (which writes settings.json) is called unconditionally before workspace provisioning is attempted. If provisionAgentWorkspace throws, the agent IS saved to settings, but the route returns HTTP 500 with ok: false. Any client that treats the 500 as a complete failure and retries will create duplicate writes, and any client that shows an error to the user will confuse them — a subsequent page refresh will reveal the agent was actually created.

    Consider returning a 200 (or 207 Multi-Status) with a structured partial-success body so the caller knows the record was saved but the workspace setup needs attention:

    // After provisioning attempt:
    if (provisionError) {
        return c.json({
            ok: true,       // record was saved
            provisioned: false,
            provisionSteps,
            provisionError,
            agent: settings.agents![agentId],
            warning: 'Agent saved but workspace provisioning failed',
        }, 200);  // or 207
    }

    Returning 500 is semantically "server encountered an unrecoverable error", which is at odds with the fact that the primary operation (persisting the agent) succeeded.

  3. src/lib/logging.ts, line 2543-2572 (link)

    mergeEntryLists silently drops all subsequent entries in a list if one entry has an unparseable timestamp

    const parsed = Date.parse(String(entry.time));
    if (parsed > nextTime) { ... }

    Date.parse returns NaN for any malformed or missing time value. Because NaN > x is always false for any x (including -Infinity), an entry with a bad timestamp can never "win" the comparison. Its list position is never advanced, which permanently blocks every later entry in the same runtime list from being selected — they stay behind the stuck entry until all other lists are exhausted and the loop exits early.

    In the normal case (pino ISO timestamps and shell date -u output) this won't fire, but a single corrupted NDJSON line in a log file is enough to silently truncate the merged output from that source.

    A minimal guard:

    const parsed = Date.parse(String(entry.time));
    if (Number.isNaN(parsed)) {
        // Skip entries with unreadable timestamps; advance position so
        // later entries in this list are not permanently blocked.
        positions[index] += 1;
        continue;  // restart inner loop
    }
    if (parsed > nextTime) { ... }

    Alternatively, treat a NaN timestamp as 0 (epoch) so the entry sorts last but is still emitted.

Last reviewed commit: 00e0639

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9af11f40d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +474 to +478
if (options.messageId) {
clauses.push('message_id = ?');
params.push(options.messageId);
}
const searchTerm = normalizeSearchTerm(options.search);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply conversationId filter to response queue queries

/api/queue/rows accepts a conversationId filter and passes it through, but getQueueResponses never applies that field, so response rows from unrelated conversations are still returned when callers filter by conversation. This makes the queue browser/API output internally inconsistent (messages are filtered, responses are not) and can mislead debugging for team handoff chains that share channels/senders.

Useful? React with 👍 / 👎.

- Store and expose `conversationId` for queued responses across DB, API, and UI types
- Include `conversationId` when enqueueing completed conversation responses and allow filtering/searching by it
- Prevent extra blank JSON log lines, validate/clamp log limits, and lower noisy queue status logs to debug
- Stop auto-shutdown on unhandled rejections in Discord/Telegram/WhatsApp clients
@michaelerobertsjr
Copy link
Contributor Author

Follow-up fixes addressed from review:

Verification performed:

  • npm run build:main
  • npm run build in tinyoffice
  • bash -n lib/common.sh lib/heartbeat-cron.sh lib/daemon.sh tinyclaw.sh

Runtime verification:

  • Restarted TinyClaw and confirmed agent backlog began draining (pending decreased and processing advanced) without requiring fresh enqueue events.

@michaelerobertsjr
Copy link
Contributor Author

@greptileai please take a look

Mike Roberts added 2 commits March 8, 2026 14:32
- Source heartbeat cron logging from `lib/common.sh` to reuse level parsing and rotation
- Replace inline heartbeat JSON logging with `write_structured_log`
- Tighten TypeScript logging source types to runtime log files and document safe stream rotation
- Catch ENOENT when reading rotated log files to avoid transient failures
- Split Discord message excerpt into debug logs and stop logging pairing codes
- Return HTTP 500 from PUT /api/agents/:id when provisioning fails
@michaelerobertsjr
Copy link
Contributor Author

@greptileai please have a look

- Escape `%`, `_`, and `\` in queue search terms and use `LIKE ... ESCAPE '\\'`
- Make log level filtering case-insensitive in API parsing and log matching
- Preserve explicit `limit=0` by checking `limit !== undefined` in TinyOffice API params
@michaelerobertsjr
Copy link
Contributor Author

@greptileai please take a final pass

@michaelerobertsjr
Copy link
Contributor Author

@greptileai are there any additional comments?

@greptile-apps
Copy link

greptile-apps bot commented Mar 8, 2026

The only remaining open item is the P2 conversationId filter issue in db.ts:497. It would make the queue browser consistent (currently messages are filtered by conversation but responses aren't), but it's not blocking. React to that comment if you want it included in this PR—otherwise, this looks good to merge.

@michaelerobertsjr
Copy link
Contributor Author

That comment is stale. I already fixed it in this PR.

The response-side conversationId issue was addressed by:

adding conversation_id to responses in src/lib/db.ts
persisting it from team conversation completion in src/lib/conversation.ts applying the filter in getQueueResponses() in src/lib/db.ts

So the queue browser is now consistent for new response rows. The only caveat is historical rows created before the migration still have conversation_id = NULL, so they can’t be retroactively filtered by conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant