Skip to content

release: To Prod#1252

Merged
joelorzet merged 46 commits into
prodfrom
staging
May 14, 2026
Merged

release: To Prod#1252
joelorzet merged 46 commits into
prodfrom
staging

Conversation

@suisuss
Copy link
Copy Markdown

@suisuss suisuss commented May 14, 2026

No description provided.

suisuss and others added 28 commits May 13, 2026 15:21
…outing regressions

Replace ENCODING_ERROR_RE with DISPATCH_FAILURE_RE which also rejects the
"missing revert data" / data="0x" pattern -- the KEEP-456 failure mode
where the previous loose guard silently tolerated calls into a
non-existent SuperToken proxy.

- create-pool: convert to expect(msg).toBe("") (verified simulates on Sepolia)
- connect-pool/wrap/unwrap/create-flow/update-flow/delete-flow/
  distribute/distribute-flow/update-member-units: tighten to
  DISPATCH_FAILURE_RE (their inputs cannot satisfy positive simulation
  without on-chain state)
- Add 5 always-on regex regression tests so CI catches future
  "simplifications" of DISPATCH_FAILURE_RE that would silently restore
  the KEEP-456 hole

Note: the ticket's suggested CALL_EXCEPTION.*data="0x" pattern is
order-dependent and does not match real ethers v6 error strings (data=
comes before code=CALL_EXCEPTION). Use data="0x" alone -- precise
because real reverts have hex content between the quotes.
…esses against canonical metadata

- Append BNB Smart Chain (56) and Avalanche C-Chain (43114) to SUPERFLUID_CHAIN_IDS.
  Both use the canonical CFAv1/GDAv1 forwarder addresses and have existing chain
  DB rows in seed-chains.ts.
- Rewrite the SUPERFLUID_CHAIN_IDS docblock: drop the inaccurate "every chain
  Superfluid supports" claim, document the Avalanche Fuji (43113) CFA deviation,
  and point at the new cross-check test as the regression gate.
- Add a vitest cross-check that, for every chain in SUPERFLUID_CHAIN_IDS, asserts
  the pinned CFA/GDA address equals @superfluid-finance/metadata's
  contractsV1.cfaV1Forwarder / gdaV1Forwarder. Fuji-style deviant chains now fail
  at PR time instead of silently mis-routing.
- Add @superfluid-finance/metadata as a devDependency.
- Rename two test descriptions hardcoded to "six chains".
…nt-subscription failure

When the upstream WSS enters a half-open state — TCP/ping-pong alive,
getBlockNumber works, eth_subscribe accepted, but no newHeads delivered —
the existing connect-level fallback never fires because primary's
getBlockNumber keeps succeeding. KEEP-544's noBlockTimer correctly
reconnects every 300s, but every reconnect goes back to the same broken
URL.

Track consecutive BLOCK_ADVANCE_TIMEOUT_MS firings on the current URL as
silentReconnects. Reset on a real height advance in processBlockRange,
on start, and on stop. In reconnectWithBackoff, before the next connect
attempt, call maybeFlipUrlPreference() to swap currentUrlIndex when
silentReconnects >= SILENT_FAILOVER_THRESHOLD (default 2, env-tunable).
connect() now honours currentUrlIndex by reordering its candidate list
so the preferred URL is tried first while keeping primary/fallback
labels stable in logs. The existing primaryProbeTimer already covers
swapping back once primary recovers.

Reset the counter on flip so the new URL gets a full threshold of its
own before flipping back; if both are silent, the monitor alternates
which surfaces the real failure (both upstreams down) to operators via
the existing log signal.
…ent metadata

The on-IPFS agent card and the .well-known/agent-card.json A2A endpoint
both shipped with a generic description ("Web3 workflow automation
platform...") and no inline skills, no keywords, no structured payment
fields. Result: keyword search against 8004scan for "keeperhub"
returned zero results; downstream indexers (agentcash search, x402scan)
have no searchable surface to match against.

Changes:

- Description rewritten to match the D-066 wedge wording: execution
  layer, x402 on Base, MPP on Tempo, Managed DeFi, onchain audit trail.
- Inline skills array (8 entries) on both surfaces with matching IDs:
  workflow_discovery, workflow_invocation, ai_workflow_generation,
  protocol_actions, onchain_execution, templates, execution_monitoring,
  reputation_feedback (new — completes the ERC-8004 feedback
  read/write symmetry).
- Per-skill descriptions and tags include the searchable terms agents
  query for (x402, mpp, usdc, base, tempo, aave, safe, defi, swap,
  bridge, stake, contract, etc.).
- keywords array (30 tags) added to data/agent-registry.json.
- Structured payment block (x402 on Base + MPP on Tempo with payTo
  addresses) added to data/agent-registry.json.
- New service entry pointing at /api/openapi.json (forthcoming) and
  https://docs.keeperhub.com.

After merge, run scripts/pin-agent-card.ts to pin the new content to
Pinata and scripts/update-agent-uri.ts to update the onchain tokenURI.

Pairs with techops-services/infrastructure PR #194 (Cloudflare bypass
for AI-vendor UAs on .well-known paths) so the live A2A endpoint is
reachable from Claude/ChatGPT/Perplexity browse-on-behalf-of-user.
The previous payment.payTo addresses implied 100% of per-call USDC
flows to the platform wallet. In reality KeeperHub is a multi-creator
marketplace: each listed workflow advertises its own payTo in the
402 response and at /api/mcp/workflows/<slug>, and settlement splits
70% creator / 30% platform per lib/earnings/queries.ts.

Replaces the misleading payment.payTo with:

  - "marketplace" block describing the multi-creator model, the
    platformFeePercent (30), creatorSharePercent (70), and the
    platformFeeRecipient address per chain.
  - "payment" block keeping the protocol/network/asset declarations
    (x402 on Base USDC, MPP on Tempo USDC.e) but dropping the
    overclaiming payTo — that's per-workflow runtime data, not
    agent-level identity data.

platformFeePercent is hardcoded today because the agent card is
static IPFS content; if the platform-fee env var changes, re-pin.
…overability

feat: KEEP-554 enrich ERC-8004 agent card with skills, keywords, payment metadata
fix: update Aave V3 data provider addresses
…n in the metadata cross-check

Three assertions inside the canonical metadata cross-check block:

- Fuji (43113) is intentionally absent from SUPERFLUID_CHAIN_IDS.
- Fuji's CFAv1Forwarder address in @superfluid-finance/metadata is
  NOT equal to CFA_FORWARDER_ADDRESS -- this captures the documented
  deviation as a fact under test.
- Fuji's GDAv1Forwarder IS canonical -- documents the asymmetric
  shape of the deviation (CFA-only) and catches the rare case of
  GDA also drifting on Fuji.

Together these pin both the trap (CFA is non-canonical upstream) and
our response to it (Fuji is excluded from the chain list). If
Superfluid ever redeploys Fuji's CFA at the canonical address, the
"is not canonical" assertion fails -- and that failure is positive:
the trap is gone and Fuji can safely join SUPERFLUID_CHAIN_IDS.
…n-coverage

feat(superfluid): KEEP-463 expand chain coverage and cross-check addresses against canonical metadata
The previous wording suggested the blocker was finding a Superfluid pool
address. The actual blocker is that TEST_ADDRESS has no deployed code at
all -- the GDA dispatches into the pool address as a contract call, and
any deployed contract implementing the pool interface would suffice.
… tests

The previous regex shape tests pinned hardcoded ethers v6.16.0 sample
strings. A future ethers upgrade that changed error formatting would
have silently invalidated DISPATCH_FAILURE_RE without failing these
tests -- the on-chain block would lose its KEEP-456 guard and the only
signal would have been a real routing bug slipping into prod.

Replace each sample with a call to `ethers.makeError(...)` that
constructs the same shape we need to match. The assertion target is now
String(error), which mirrors what estimateGasError actually returns, so
the test is self-updating across ethers upgrades: if ethers changes
wording, these tests fail at upgrade time and reviewers update the
regex alongside the dependency bump.
…ntext

Tighten the bare `data="0x"` alternation to `,\s*data="0x"`, which
requires the field-separator comma that precedes top-level
CALL_EXCEPTION params. Nested JSON uses `"data": value` (with colon-
space), so the new anchor distinguishes top-level from nested without
false-positiving if a future ethers version (or quirky calldata)
produced a transaction object whose data was literally "0x" while the
top-level revert data was populated.

Add a regression test that exercises exactly that hypothetical: a
revert with populated top-level data but empty nested transaction.data
must NOT match -- it's a business revert, not a routing bug.
Directly satisfies acceptance criterion #2 -- "no test passes when its
action is silently re-routed to a non-existent contract method" --
without relying on the regex-shape proxy.

Takes a currently-passing action (grant-flow-operator, which routes
correctly to the CFAv1Forwarder) and overrides its destination to
SEPOLIA_FUSDC: a real ERC20 contract that exists on chain but has no
Superfluid methods. Verified empirically: this dispatch produces
`Error: missing revert data ... code=CALL_EXCEPTION` -- exactly the
KEEP-456 shape. The new test asserts DISPATCH_FAILURE_RE matches it.

If anyone weakens or removes DISPATCH_FAILURE_RE in a way that lets
KEEP-456 through, this test fails against the live RPC. The regex-shape
tests (run in CI) guard the regex against silent strengthening; this
test (RPC-gated) guards against silent weakening by exercising a real
misroute end-to-end.
…nsaction

CI typecheck (TS2741) caught three transaction objects in the regex
shape tests that were missing the required `data` field on
CallExceptionTransaction. Local tsc with skipLibCheck did not surface
this. No behavioral change -- the regex tests assert on the top-level
error string, not the nested transaction shape.
…-tests

test: KEEP-459 harden on-chain dispatch assertions against routing regressions
WebSocketProvider.ready in ethers v6 is a synchronous boolean getter, not
a Promise; awaiting it resolves immediately and lets openProvider proceed
before the ws upgrade actually completes. Replace with getBlockNumber(),
which internally calls _waitUntilReady(). Since _waitUntilReady() never
rejects on socket failure (only resolves on open), wrap the call in a
Promise.race against an explicit ws-error listener and a 10s connect
timeout, matching the pattern PR #988 used in the block dispatcher.

Without the race, an unreachable host (DNS NXDOMAIN, ECONNREFUSED) hangs
the connect attempt indefinitely instead of walking to the fallback URL.

Update the two MockProvider stubs in unit tests to expose getBlockNumber
in place of the unused ready Promise, and baseline the no-subscriber
heartbeat assertion after connect so the initial connect probe is not
counted as a heartbeat ping.
…ready-await

fix: KEEP-349 race connect against ws-error + timeout in openProvider
joelorzet and others added 5 commits May 14, 2026 11:50
…itor liveness, reconnect, and SQS enqueue paths

Expose a per-chain Prometheus surface at :3000/metrics so operators can alert
on silent subscriptions without waiting for users to report broken workflows.
The block-dispatcher previously emitted only console logs and a /health
boolean; today's prod incident only surfaced because a user noticed their
workflow stopped firing several hours later.

Metrics added (all keeperhub_block_dispatcher_*):

  Gauges (per chain, except chains_monitored)
    seconds_since_last_block        scrape-time, primary alert signal
    socket_age_seconds              scrape-time, debug
    is_alive                        mirrors ChainMonitor.isAlive()
    is_reconnecting                 reconnect-with-backoff in progress
    has_active_subscription         eth_subscribe completed
    current_url_index               0=primary, 1=fallback (KEEP-557 flips)
    silent_reconnects_current       consecutive block-advance timeouts
    last_processed_block            highest block processed
    workflows_tracked               block-trigger workflows per chain
    chains_monitored                pod-level

  Counters
    blocks_received_total           rate gives delivery rate per chain
    blocks_matched_total            workflow trigger fires (no workflow_id label)
    ws_closes_total                 reason: upstream_close|pong_timeout|
                                    block_advance_timeout|socket_age_recycle|
                                    silent_failover|ping_send_failure|
                                    primary_probe_recovered
    reconnects_total                outcome: success|exhausted
    url_flips_total                 direction: to_fallback|to_primary
    sqs_enqueue_total               outcome: success|error
    unhandled_rejections_total      ethers v6 eth_unsubscribe etc.

  Histograms
    reconnect_duration_ms           handleDisconnect to subscription_active
    block_lag_seconds               wall_clock - block.timestamp on receive

Wiring is a single new lib/metrics.ts module with one process-wide prom-client
Registry. ChainMonitor calls thin record*/set* helpers at each lifecycle
point; no business logic depends on prom-client. seconds_since_last_block and
socket_age_seconds use prom-client collect() callbacks so they always reflect
the latest value at scrape time without per-block emission.

Deploy: prometheus.io/scrape annotations added to staging and prod
block-dispatcher-values.yaml. The /metrics endpoint is wired to the same
express server that already serves /health, no new container port.

Tests: 14 new cases across metrics.test.ts (counters, gauges, scrape-time
callbacks, histograms) and chain-monitor.test.ts (integration: emit block ->
counters and gauges advance; ws close -> ws_close counter labeled correctly).
80/80 vitest pass.

Docs: lib/metrics/METRICS_REFERENCE.md gets a new section 6 (BLOCK DISPATCHER)
documenting every metric with description, labels, and alert thresholds.

Out of scope (separate PR in techops-infrastructure repo): Grafana dashboard
JSON, Terraform alert rules, and ALERTS_REFERENCE.md entries that consume
these metrics.
prom-client gauges retain the last value of every label combination they
have ever seen. Without an explicit .remove({chain}) call when a
ChainMonitor stops, the seconds_since_last_block (and every other
per-chain gauge) would emit the chain's last value forever — including
chains whose workflows were deleted or chains the reconciler tore down
because they were zombie. That would cause the new 'Block Dispatcher
Chain Silent' alert to fire indefinitely on chains we no longer monitor.

forgetChain now removes every per-chain gauge labelset so the chain
disappears entirely from /metrics output. ChainMonitor.stop() no longer
needs to flip individual gauges to 0 first — forgetChain wipes the
labels regardless. Counters and histograms keep their cumulative
history; rate()/increase() handle counter resets correctly anyway.

Test updated to cover all nine per-chain gauges and asserts an
unaffected chain is not impacted by the cleanup. 80/80 pass.
…-silent-wss-failover

fix(block-dispatcher): KEEP-557 auto-failover to fallback WSS on silent-subscription failure
…-metrics

feat(block-dispatcher): KEEP-557 Prometheus metrics for chain monitor liveness, reconnects, and SQS enqueue
@joelorzet joelorzet merged commit aaaea67 into prod May 14, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants