Unified agent logging and fixes queue processor by michaelerobertsjr · Pull Request #171 · TinyAGI/tinyagi

michaelerobertsjr · 2026-03-08T21:03:24Z

Summary

Fix the queue processor so pending messages drain continuously instead of advancing only when a new message is enqueued, and document options for handling non-delivery response channels (heartbeat, web, and cli).

Changes

Updated the queue processor to reschedule queue draining after each per-agent processing chain completes
Added startup queue draining so existing pending rows begin processing even without new enqueue events
Kept stale-message recovery draining aligned with the new queue drain scheduler
Added documentation covering response delivery/ack options for heartbeat, web, and cli
Verified the live backlog now decreases after restart instead of remaining stuck behind heartbeat accumulation

Type of Change

Testing

Ran npm run build:main
Restarted TinyClaw locally with ./tinyclaw.sh restart
Verified the agent backlog began draining after restart (pending count decreased and processing advanced)
Confirmed queue.log showed consecutive heartbeat messages being processed without requiring fresh enqueue events

Checklist

I have tested these changes locally
My changes don't introduce new warnings or errors
I have updated documentation if needed

- Switch queue, API, channels, daemon, and heartbeat to structured NDJSON logging - Add log rotation (10 MB, 5 archives) and expose merged history via `GET /api/logs` - Wire `LOG_LEVEL` through runtime config and update docs/TinyOffice log references

- Route API logs to a dedicated `api` runtime and make log reads async, filtered, and scalable - Add normalized log levels in shell scripts and preserve severity in structured logs - Ensure queue backlog drains reliably via microtask scheduling and stale-message recovery triggers - Centralize graceful shutdown/error exits for Discord, Telegram, and WhatsApp clients - Add queue DB query/count helpers and document response delivery options for non-delivery channels

greptile-apps · 2026-03-08T21:10:02Z

Greptile Summary

This PR fixes the queue processor's continuous-drain behavior and unifies all runtime logging under pino's structured NDJSON format, replacing ad-hoc log() calls across every component with a consistent createLogger / logError API.

Key changes:

Queue drain fix — scheduleQueueDrain() (via queueMicrotask) is called after each per-agent processing chain completes and at startup, so the backlog decreases continuously rather than only advancing on new message:enqueued events.
Unified structured logging — pino loggers with rotating file streams (10 MB / 5 archives) replace plain appendFileSync calls across the queue processor, API server, and all three channel clients.
Events module — onEvent / emitEvent extracted from logging.ts into a new src/lib/events.ts to cleanly separate live-event dispatch from file-based logging.
/api/logs and /api/queue/rows — new endpoints expose filterable structured log entries and typed queue row queries to the frontend; TinyOffice gains a Queue Browser tab.
Documentation — docs/RESPONSE-DELIVERY-OPTIONS.md documents the unacked heartbeat/web/cli response problem with three concrete design options.
Issues found:
- LogLevel in src/lib/logging.ts omits 'trace' and 'fatal', which pino can emit at runtime; the frontend type in api.ts already includes all six levels, creating an inconsistency.
- mergeEntryLists in logging.ts uses Date.parse without a NaN guard: a single malformed time field will permanently block all subsequent entries from that runtime list being merged into the output.
- PUT /api/agents/:id persists the agent to settings.json unconditionally, then returns HTTP 500 if workspace provisioning fails — clients treating the 500 as a total failure will be wrong since the record is already saved.

Confidence Score: 4/5

Safe to merge with minor follow-up items; the core queue-drain fix and logging refactor are solid.
The queue-drain bug fix is correct and the pino migration is thorough. Three minor issues were found: a type-narrowness gap in LogLevel, an edge-case NaN bug in the multi-runtime log merger, and a misleading HTTP 500 response when agent provisioning fails (settings are already saved). None of these are regressions or blockers, but two of the three (NaN merge + 500 response) can cause observable incorrect behavior in edge cases.
src/lib/logging.ts (LogLevel type and mergeEntryLists NaN guard) and src/server/routes/agents.ts (HTTP 500 vs saved settings mismatch).

Important Files Changed

Filename	Overview
src/queue-processor.ts	Core bug fix: adds `scheduleQueueDrain()` via `queueMicrotask` after each per-agent chain completes and at startup, ensuring the backlog drains continuously rather than only on new enqueue events. Logging migrated to pino. Change is correct and well-scoped.
src/lib/logging.ts	Major rewrite: introduces pino-based structured logging with `RotatingFileStream`, `createLogger`, `readLogEntries`, and multi-runtime log merging. Two issues found: `LogLevel` type excludes `'trace'` and `'fatal'` pino levels; `mergeEntryLists` can silently drop entries after any entry with an unparseable timestamp.
src/lib/events.ts	New file extracting `onEvent`/`emitEvent` from the former logging module. Clean separation; the unsubscribe function returned by `onEvent` is a useful addition. No issues found.
src/lib/db.ts	Adds `conversation_id` column migration, typed row interfaces (`QueueMessageRow`, `QueueResponseRow`, `QueueRowCounts`), `getQueueMessages`, `getQueueResponses`, and `getQueueRowCounts`. LIKE-search uses proper ESCAPE clause and `normalizeSearchTerm` escapes special characters. No issues found.
src/server/routes/agents.ts	Provisioning failure now returns HTTP 500 and `ok: false`, but `mutateSettings` is called before provisioning — the agent is persisted regardless. Clients treating the 500 as a total failure will be wrong. See inline comment.
src/server/routes/logs.ts	Endpoint rewritten to call `readLogEntries` with rich filtering (source, level, channel, agentId, messageId, conversationId, search). Limit clamped to [1, 1000] with NaN guard — previous PR thread's NaN issue is addressed here.
src/server/routes/queue.ts	New `/api/queue/rows` endpoint with filterable message/response queries and aggregate counts. Limit clamped to [1, 500] with NaN guard. No issues found.
lib/common.sh	Adds structured shell logging helpers (`write_structured_log`, `rotate_log_file`, `normalize_log_level`, `log_level_priority`). The `jq -nc` call produces a trailing newline (noted in prior thread). `LOG_DIR` must be set in the calling shell before `log()` is invoked, consistent with prior usage.
src/channels/discord-client.ts	Replaces local `log()` with `createLogger`/`logError`. `unhandledRejection` now logs and continues (per previous review thread resolution); `uncaughtException` still exits via `shutdownDiscord(1)`. No new issues.
tinyoffice/src/app/logs/page.tsx	New Queue Browser tab with status toggles, search, row counts, and message/response cards. Renders structured `LogEntry` objects with level/source/component badges replacing the old raw-line display. No issues found.
docs/RESPONSE-DELIVERY-OPTIONS.md	New documentation file clearly describing the unacked `heartbeat`/`web`/`cli` response problem and three design options with pros/cons. Well-structured reference for future work.
tinyoffice/src/lib/api.ts	Adds typed interfaces (`QueueMessageRow`, `QueueResponseRow`, `QueueRowsResponse`, `LogEntry`) and `getQueueRows` API function. `LogEntry.level` correctly includes all six pino levels (`trace`…`fatal`), unlike the backend `LogLevel` type.

Sequence Diagram

sequenceDiagram
    participant API as API Server<br/>(messages route)
    participant QE as queueEvents
    participant QP as processQueue()
    participant Chain as agentProcessingChain
    participant DB as SQLite DB
    participant Log as RotatingFileStream<br/>(pino)

    Note over QP: Startup
    QP->>DB: recoverStaleMessages()
    QP->>QP: scheduleQueueDrain() [microtask]
    QP->>DB: getPendingMessages()
    QP->>Chain: build per-agent promise chain

    Note over API,QE: New inbound message
    API->>DB: enqueueMessage()
    API->>QE: emit message:enqueued
    QE->>QP: scheduleQueueDrain() [microtask]
    QP->>DB: getPendingMessages()
    QP->>Chain: append processMessage() to chain
    Chain->>Log: logger.info(structured entry)
    Chain->>DB: dbCompleteMessage()
    Chain->>QP: scheduleQueueDrain() [microtask — keeps draining]

    Note over QP,DB: Stale recovery (every 5 min)
    QP->>DB: recoverStaleMessages()
    DB-->>QP: count > 0
    QP->>QP: scheduleQueueDrain()

Comments Outside Diff (3)

src/lib/logging.ts, line 2202 (link)

LogLevel type is narrower than what pino can emit

LogLevel is defined as 'debug' | 'info' | 'warn' | 'error', but pino natively emits two additional level labels: 'trace' and 'fatal'. Because formatters.level is set to (label) => ({ level: label }), the serialized JSON always contains the pino label string — so 'fatal' level entries can appear in the log files at runtime.

The frontend api.ts already accounts for all six levels:
```
level: "trace" | "debug" | "info" | "warn" | "error" | "fatal";
```
Aligning the backend type would close this gap:

As it stands, any 'fatal' entry read back through readLogEntries carries a type of LogLevel that technically excludes 'fatal', which can mislead callers that switch/discriminate on the level field.
src/server/routes/agents.ts, line 3206-3210 (link)

HTTP 500 returned but agent settings are already persisted

mutateSettings (which writes settings.json) is called unconditionally before workspace provisioning is attempted. If provisionAgentWorkspace throws, the agent IS saved to settings, but the route returns HTTP 500 with ok: false. Any client that treats the 500 as a complete failure and retries will create duplicate writes, and any client that shows an error to the user will confuse them — a subsequent page refresh will reveal the agent was actually created.

Consider returning a 200 (or 207 Multi-Status) with a structured partial-success body so the caller knows the record was saved but the workspace setup needs attention:
```
// After provisioning attempt:
if (provisionError) {
    return c.json({
        ok: true,       // record was saved
        provisioned: false,
        provisionSteps,
        provisionError,
        agent: settings.agents![agentId],
        warning: 'Agent saved but workspace provisioning failed',
    }, 200);  // or 207
}
```
Returning 500 is semantically "server encountered an unrecoverable error", which is at odds with the fact that the primary operation (persisting the agent) succeeded.
src/lib/logging.ts, line 2543-2572 (link)

mergeEntryLists silently drops all subsequent entries in a list if one entry has an unparseable timestamp
```
const parsed = Date.parse(String(entry.time));
if (parsed > nextTime) { ... }
```
Date.parse returns NaN for any malformed or missing time value. Because NaN > x is always false for any x (including -Infinity), an entry with a bad timestamp can never "win" the comparison. Its list position is never advanced, which permanently blocks every later entry in the same runtime list from being selected — they stay behind the stuck entry until all other lists are exhausted and the loop exits early.

In the normal case (pino ISO timestamps and shell date -u output) this won't fire, but a single corrupted NDJSON line in a log file is enough to silently truncate the merged output from that source.

A minimal guard:
```
const parsed = Date.parse(String(entry.time));
if (Number.isNaN(parsed)) {
    // Skip entries with unreadable timestamps; advance position so
    // later entries in this list are not permanently blocked.
    positions[index] += 1;
    continue;  // restart inner loop
}
if (parsed > nextTime) { ... }
```
Alternatively, treat a NaN timestamp as 0 (epoch) so the entry sorts last but is still emitted.

_{Last reviewed commit: 00e0639}

src/server/routes/logs.ts

src/channels/discord-client.ts

lib/common.sh

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9af11f40d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-08T21:11:28Z

src/lib/db.ts

+    if (options.messageId) {
+        clauses.push('message_id = ?');
+        params.push(options.messageId);
+    }
+    const searchTerm = normalizeSearchTerm(options.search);


Apply conversationId filter to response queue queries

/api/queue/rows accepts a conversationId filter and passes it through, but getQueueResponses never applies that field, so response rows from unrelated conversations are still returned when callers filter by conversation. This makes the queue browser/API output internally inconsistent (messages are filtered, responses are not) and can mislead debugging for team handoff chains that share channels/senders.

Useful? React with 👍 / 👎.

- Store and expose `conversationId` for queued responses across DB, API, and UI types - Include `conversationId` when enqueueing completed conversation responses and allow filtering/searching by it - Prevent extra blank JSON log lines, validate/clamp log limits, and lower noisy queue status logs to debug - Stop auto-shutdown on unhandled rejections in Discord/Telegram/WhatsApp clients

michaelerobertsjr · 2026-03-08T21:22:13Z

Follow-up fixes addressed from review:

Fixed queue draining in src/queue-processor.ts so the processor now keeps scheduling itself after each per-agent chain completes. This removes the starvation pattern where one enqueue event only advanced one message per agent.
Added a startup drain trigger so existing pending rows begin processing immediately after boot, even if no new enqueue event arrives.
Added a decision note at docs/RESPONSE-DELIVERY-OPTIONS.md covering options for heartbeat, web, and cli response handling.
Hardened /api/logs limit parsing in src/server/routes/logs.ts and added a defensive finite-limit fallback in src/lib/logging.ts to prevent NaN from bypassing log read bounds.
Adjusted channel fatal-error handling so unhandledRejection logs and continues, while uncaughtException remains fatal in:
Removed the extra blank-line newline from shell NDJSON log writers in:
- lib/common.sh
- lib/heartbeat-cron.sh
Fixed conversationId filtering for queue-browser response rows by adding conversation_id persistence and filtering in:
Downgraded the /api/queue/rows poll log from info to debug in src/server/routes/queue.ts to avoid api.log noise from the TinyOffice queue tab’s 5-second polling.

Verification performed:

npm run build:main
npm run build in tinyoffice
bash -n lib/common.sh lib/heartbeat-cron.sh lib/daemon.sh tinyclaw.sh

Runtime verification:

Restarted TinyClaw and confirmed agent backlog began draining (pending decreased and processing advanced) without requiring fresh enqueue events.

michaelerobertsjr · 2026-03-08T21:22:25Z

@greptileai please take a look

src/lib/logging.ts

- Source heartbeat cron logging from `lib/common.sh` to reuse level parsing and rotation - Replace inline heartbeat JSON logging with `write_structured_log` - Tighten TypeScript logging source types to runtime log files and document safe stream rotation

- Catch ENOENT when reading rotated log files to avoid transient failures - Split Discord message excerpt into debug logs and stop logging pairing codes - Return HTTP 500 from PUT /api/agents/:id when provisioning fails

michaelerobertsjr · 2026-03-08T21:35:43Z

@greptileai please have a look

- Escape `%`, `_`, and `\` in queue search terms and use `LIKE ... ESCAPE '\\'` - Make log level filtering case-insensitive in API parsing and log matching - Preserve explicit `limit=0` by checking `limit !== undefined` in TinyOffice API params

michaelerobertsjr · 2026-03-08T21:49:51Z

@greptileai please take a final pass

michaelerobertsjr · 2026-03-08T21:50:50Z

@greptileai are there any additional comments?

greptile-apps · 2026-03-08T21:51:37Z

The only remaining open item is the P2 conversationId filter issue in db.ts:497. It would make the queue browser consistent (currently messages are filtered by conversation but responses aren't), but it's not blocking. React to that comment if you want it included in this PR—otherwise, this looks good to merge.

michaelerobertsjr · 2026-03-08T21:55:38Z

That comment is stale. I already fixed it in this PR.

The response-side conversationId issue was addressed by:

adding conversation_id to responses in src/lib/db.ts
persisting it from team conversation completion in src/lib/conversation.ts applying the filter in getQueueResponses() in src/lib/db.ts

So the queue browser is now consistent for new response rows. The only caveat is historical rows created before the migration still have conversation_id = NULL, so they can’t be retroactively filtered by conversation.

Mike Roberts added 2 commits March 7, 2026 18:44

michaelerobertsjr changed the title ~~T3code/pino unified agent logging~~ Unified agent logging and fixes queue processor Mar 8, 2026

greptile-apps bot reviewed Mar 8, 2026

View reviewed changes

src/server/routes/logs.ts Outdated Show resolved Hide resolved

src/channels/discord-client.ts Show resolved Hide resolved

lib/common.sh Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 8, 2026

View reviewed changes

greptile-apps bot reviewed Mar 8, 2026

View reviewed changes

src/lib/logging.ts Outdated Show resolved Hide resolved

src/lib/logging.ts Show resolved Hide resolved

Mike Roberts added 2 commits March 8, 2026 14:32

Harden logging paths and return 500 on agent provision errors

a8c60a1

- Catch ENOENT when reading rotated log files to avoid transient failures - Split Discord message excerpt into debug logs and stop logging pairing codes - Return HTTP 500 from PUT /api/agents/:id when provisioning fails

michaelerobertsjr closed this Mar 8, 2026

michaelerobertsjr reopened this Mar 8, 2026

github-actions bot mentioned this pull request Mar 9, 2026

🦞 OpenClaw 生态日报 2026-03-09 duanyytop/agents-radar#107

Open

Conversation

michaelerobertsjr commented Mar 8, 2026

Summary

Changes

Type of Change

Testing

Checklist

Uh oh!

greptile-apps bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (3)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

michaelerobertsjr commented Mar 8, 2026

Uh oh!

michaelerobertsjr commented Mar 8, 2026

Uh oh!

Uh oh!

Uh oh!

michaelerobertsjr commented Mar 8, 2026

Uh oh!

michaelerobertsjr commented Mar 8, 2026

Uh oh!

michaelerobertsjr commented Mar 8, 2026

Uh oh!

greptile-apps bot commented Mar 8, 2026

Uh oh!

michaelerobertsjr commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 8, 2026 •

edited

Loading