Skip to content

feat: structured JSON logging + K8s probe docs (COW-994)#87

Merged
lgahdl merged 9 commits into
developfrom
luizhatem/cow-994-f19-k8s-probes-and-json-structured-logging
Jun 8, 2026
Merged

feat: structured JSON logging + K8s probe docs (COW-994)#87
lgahdl merged 9 commits into
developfrom
luizhatem/cow-994-f19-k8s-probes-and-json-structured-logging

Conversation

@lgahdl

@lgahdl lgahdl commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Grant Review Finding

[F19] Document K8s liveness/readiness probes and add JSON structured logging with chainId — MAJOR (production observability / handoff) (COW-994)

Two related gaps: (1) K8s probe documentation — the indexer exposes /healthz (liveness — always 200 once up) and /ready (readiness — 200 only when fully synced). Using /ready for liveness (as the Docker healthcheck does) means a still-indexing pod with a multi-hour cold start gets killed before it ever serves traffic. The probe mapping isn't documented anywhere. (2) JSON structured logging — logs are pretty-printed in production; for a multi-chain product whose logs will be ingested into a log aggregator and filtered by chainId, JSON-structured logging with chainId/event/block labels per line is needed.


Summary

Structured JSON logging:

  • New src/application/helpers/cowLogger.ts — thin cowLog(level, msg, fields) utility that emits one newline-delimited JSON line per call
  • All console.log/console.warn calls in blockHandler.ts (18 call sites across C1–C5) and settlement.ts migrated to cowLog(); each line now carries chainId and block as proper JSON fields, enabling log aggregator filtering by chain without regex
  • pnpm start updated with --log-format json so Ponder's own internal log lines are also JSON in production; pnpm dev keeps pretty format for local readability

K8s probe documentation (docs/deployment.md):

  • Documents /healthz (liveness — always 200 once up) vs /ready (readiness — 200 only when synced) distinction
  • Adds sample K8s manifest snippet with correct livenessProbe/readinessProbe mapping
  • Explains why /ready must NOT be used as liveness probe (would kill indexing pod before it finishes cold-start sync)
  • Adds structured logging section explaining the --log-format json behavior

Test plan

  • pnpm typecheck — passes
  • pnpm test — 19/19 passing
  • No console.log/console.warn calls remain in blockHandler.ts or settlement.ts
  • Each cowLog call includes chainId and block fields
  • pnpm start script includes --log-format json

🤖 Generated with Claude Code

Structured logging:
- Add src/application/helpers/cowLogger.ts — thin JSON logger that emits
  one JSON line per call with { time, level, msg, ...fields }
- Migrate all console.log/warn calls in blockHandler.ts and settlement.ts
  to cowLog(); each line now carries chainId and block as proper JSON fields
  so log aggregators can filter by chain without regex parsing
- Add --log-format json to pnpm start so Ponder's own log lines are also
  JSON in production; pnpm dev keeps pretty format for local readability

K8s probe docs (docs/deployment.md):
- Document /healthz (liveness) vs /ready (readiness) distinction
- Add sample K8s manifest snippet with correct probe mapping
- Explain why /ready must NOT be used as a liveness probe
- Add structured logging section explaining --log-format json behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@linear-code

linear-code Bot commented Jun 1, 2026

Copy link
Copy Markdown

COW-994

@lgahdl lgahdl changed the base branch from main to develop June 2, 2026 13:51
…COW-994)

The decodeAbiParameters call (orderUid, sellToken, buyToken, sellAmount,
buyAmount) was only used for logging. COW-991 removes this block separately;
removing it here too prevents a conflict when both PRs land — otherwise the
cowLog fields reference variables that no longer exist, crashing the indexer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgahdl

lgahdl commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

The new src/application/helpers/cowLogger.ts has a 7-line multi-paragraph JSDoc block at the top. The project convention (from agent_docs/code-patterns.md) is: no multi-paragraph docstrings or multi-line comment blocks — one short line max.

/**
 * Structured JSON logger for handler code. Outputs one JSON line per call so
 * log aggregators (Datadog, CloudWatch, etc.) can filter by chainId, handler,
 * block number, or any other field without regex parsing.
 *
 * Ponder's own log lines are controlled by --log-format (pretty|json) on the
 * CLI. These handler lines are always JSON so they remain parseable regardless
 * of Ponder's format setting.
 */

This should be one line, e.g.:

/** Structured JSON logger — emits one ndjson line per call for log aggregators. */

The deployment doc already explains the JSON logging setup; the inline comment doesn't need to repeat it.

@lgahdl

lgahdl commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Documentation / CI review

cowLogger.ts multi-line comment block — not a violation.
The file opens with a 9-line /** ... */ JSDoc block. The project's agent_docs/code-patterns.md does not define a "max one short comment line" rule — that convention does not appear in any project docs. The existing helpers (orderbookClient.ts, withTimeout.ts, pollResultErrors.ts) all use multi-line JSDoc file headers. The comment in cowLogger.ts is consistent with the established pattern and accurately explains why the handler lines stay JSON regardless of Ponder's --log-format. No change needed.

K8s probe documentation in docs/deployment.md — accurate and complete.
The new "Kubernetes Probes" section correctly distinguishes /healthz (liveness — always 200 once up) from /ready (readiness — only 200 when fully synced), provides a concrete YAML snippet with sensible periodSeconds/failureThreshold values, and includes the important warning against using /ready as liveness. This is exactly what ops needs.

One gap: initialDelaySeconds not set on the liveness probe.
The YAML snippet sets livenessProbe.periodSeconds: 30 and failureThreshold: 3 but omits initialDelaySeconds. On a cold-start pod the process can take a few seconds to bind the port. Without initialDelaySeconds (or a startupProbe), the liveness probe can fire before the server is up and kill the pod immediately. Recommend adding at least initialDelaySeconds: 10 to the liveness probe snippet, or noting that a startupProbe is the preferred K8s pattern for long-starting containers.

Structured logging section — accurate. The pnpm start--log-format json change and the per-line chainId/block field guarantee are correctly documented.

--log-warn / log level flag not documented.
pnpm start now passes --log-format json but there is no mention of how to control verbosity in production (e.g. --log-level warn). The existing PINO_LOG_LEVEL entry in the env-var table covers Ponder's log level already — just worth confirming those two interact correctly and the table entry is still accurate with the new --log-format json flag. Low priority, but worth a quick check.

Overall: Two minor items — add initialDelaySeconds to the liveness probe YAML, and confirm PINO_LOG_LEVEL still works alongside --log-format json. No blockers.

@lgahdl

lgahdl commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Review: Structured logging + K8s probe docs (COW-994)

cowLogger.ts — multi-line JSDoc block

The file opens with an 8-line /** … */ JSDoc comment. Other helper files in this project (e.g. withTimeout.ts, pollResultErrors.ts) use similar multi-line blocks, so this is consistent with the actual codebase convention — not a deviation. No issue here.

cowLog uses console.log for all levels including warn and error

The function always calls console.log(JSON.stringify({level, …})). The level field in the JSON payload lets aggregators filter by severity. However, warn and error level messages do not go to stderr — they go to stdout. This matters if any log routing in the deployment pipeline distinguishes stdout/stderr streams (e.g. Docker, some Kubernetes logging drivers, systemd journal). If you want warn/error to reach stderr, change the implementation to:

const log = level === "info" ? console.log : level === "warn" ? console.warn : console.error;
log(JSON.stringify({ time: Date.now(), level, msg, ...fields }));

K8s probe documentation accuracy

/healthz is confirmed in src/api/index.ts:18 as a real endpoint returning { status: "ok" }. /ready is referenced in the Docker health check comment; verifying it exists as a Ponder built-in is left to the deployer, but the documentation correctly distinguishes liveness vs readiness semantics. The warning against using /ready for liveness is accurate and important given the hours-long cold-start sync time.

--log-format json in pnpm start

Correct placement. pnpm dev intentionally left without --log-format json for local readability.

Migration completeness

All 18 console.log/console.warn call sites in blockHandler.ts and settlement.ts are migrated. No missed call sites visible in the diff.

One actionable item: consider routing warn/error to console.warn/console.error in cowLog. Everything else looks good.

… K8s probes (COW-994)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@yvesfracari yvesfracari left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anxo mentioned only the blockHandler and settlement logs but there are others that are other files that emit logs and should be using the new logger.

Would actually save this on an ADR.

Comment thread docs/deployment.md Outdated

| Endpoint | Semantic | Returns 200 when |
|----------|----------|-----------------|
| `/healthz` | **Liveness** — is the process alive? | Always, once the server starts |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't /health a better fit here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — updated to /health (Ponder's built-in). Simpler and more idiomatic than our /healthz wrapper for a liveness probe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to /health — Ponder's built-in liveness endpoint, always 200 once the server starts. Simpler and more idiomatic than our /healthz wrapper for a liveness probe.

Comment thread docs/deployment.md
Comment thread docs/deployment.md Outdated
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 18 # 3-minute window before marking unready

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time to deploy would be way bigger, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the comment was misleading. A readiness probe failure does NOT kill the pod — it just removes it from load-balancer rotation. On a cold start the pod will be NotReady for hours; that's expected. Updated the comment and added a paragraph clarifying this so operators aren't surprised.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — a cold-start pod can be NotReady for hours. Clarified that readiness failure doesn't kill the pod (just removes it from rotation) and that hours-long NotReady is expected on a cold start. Also removed the specific failureThreshold/periodSeconds values per your follow-up comment — those depend on the cluster's SLOs.

Comment thread src/application/helpers/cowLogger.ts Outdated
…tion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lgahdl added a commit that referenced this pull request Jun 4, 2026
…reep

- Replace manual DiscreteStatus union with (typeof discreteOrderStatusEnum.enumValues)[number]
  so the type stays in sync with the schema automatically
- Delete cowLogger.ts — structured logging belongs in COW-994 (PR #87),
  not in this fix PR; replace the one cowLog call with a plain console.log

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lgahdl and others added 2 commits June 4, 2026 21:00
…readiness semantics

- Rename cowLogger.ts to logger.ts and cowLog() to log() — the module name
  already provides context; the cow prefix was misleading
- Rename `for (const log of receipt.logs)` loop variable to txLog to avoid
  shadowing the newly imported log function
- K8s liveness probe: use /health (Ponder built-in) instead of /healthz
- readinessProbe: fix misleading failureThreshold comment; add paragraph
  clarifying that NotReady does not kill the pod — cold-start sync takes hours

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ostgreSQL section

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgahdl lgahdl requested a review from yvesfracari June 5, 2026 00:06

@yvesfracari yvesfracari left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my latest review comment about not using the new logger on other parts of the code. Also, this log should also live on a ADR so it is used later as well. Maybe it even worth to create a lint/rule that no console.log is used or something like this.

Comment thread docs/deployment.md Outdated
Comment on lines +110 to +123
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3 # marks pod unready (not killed) — cold-start sync takes hours

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is just an example but I am not sure if all values here make sense for the app context. I would prefer to remove it since the k8s operator should know how to set it properly, we just have to give him the app information so he can decide it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all the specific timing values (periodSeconds, failureThreshold, initialDelaySeconds) — the YAML now just shows the path and port, and a note explains that the operator picks the timing based on their cluster SLOs.

…W-994)

Extends the log() migration from blockHandler.ts and settlement.ts to all
remaining call sites in composableCow.ts, setup.ts, orderbookClient.ts, and
uidPrecompute.ts. No console.log/warn/error remain in src/application/ outside
logger.ts itself. Documents the scope and usage convention in deployment.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgahdl

lgahdl commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Migrated all remaining console.log/warn/error call sites in src/application/ to log() from logger.ts:

  • composableCow.ts — 4 calls (unknown handler, created, decoded, decode failed)
  • setup.ts — 1 call (cache ready)
  • orderbookClient.ts — 9 calls (no API URL, fetch, fetch result, timeouts, errors)
  • uidPrecompute.ts — 5 calls (missing params, invalid math, too many parts, all terminal, StopLoss missing params)

No console.* calls remain in src/application/ outside logger.ts itself. src/api/ (Hono routes) is intentionally left as-is.

Also expanded the Structured Logging section in docs/deployment.md with the convention (scope, usage example, stdout/stderr routing) instead of a separate ADR.

lgahdl and others added 2 commits June 8, 2026 10:20
…-994)

Keep only path and port — the operator decides periodSeconds, failureThreshold,
and initialDelaySeconds based on their cluster SLOs. Add a note explaining this.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgahdl lgahdl requested a review from yvesfracari June 8, 2026 13:41
Comment thread docs/deployment.md
Comment on lines +132 to +141
**Convention:** all code under `src/application/` uses `log()` from `src/application/helpers/logger.ts` instead of `console.log/warn/error` directly. The `src/api/` layer (Hono routes) is exempt — Hono handles its own logging. Example:

```ts
import { log } from "../helpers/logger";

log("info", "c2:confirmed", { chainId, orderUid, block: String(event.block.number) });
log("warn", "c2:timeout", { chainId, block: String(event.block.number) });
```

`warn` and `error` level messages go to `stderr`; `info` goes to `stdout`. The `level` field in the JSON payload is what log aggregators use to route and alert.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this part is necessary actually

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lint or pre-commit routine would be more effective

@lgahdl lgahdl merged commit 152ab57 into develop Jun 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants