Conversation
…outing regressions
Replace ENCODING_ERROR_RE with DISPATCH_FAILURE_RE which also rejects the
"missing revert data" / data="0x" pattern -- the KEEP-456 failure mode
where the previous loose guard silently tolerated calls into a
non-existent SuperToken proxy.
- create-pool: convert to expect(msg).toBe("") (verified simulates on Sepolia)
- connect-pool/wrap/unwrap/create-flow/update-flow/delete-flow/
distribute/distribute-flow/update-member-units: tighten to
DISPATCH_FAILURE_RE (their inputs cannot satisfy positive simulation
without on-chain state)
- Add 5 always-on regex regression tests so CI catches future
"simplifications" of DISPATCH_FAILURE_RE that would silently restore
the KEEP-456 hole
Note: the ticket's suggested CALL_EXCEPTION.*data="0x" pattern is
order-dependent and does not match real ethers v6 error strings (data=
comes before code=CALL_EXCEPTION). Use data="0x" alone -- precise
because real reverts have hex content between the quotes.
…esses against canonical metadata - Append BNB Smart Chain (56) and Avalanche C-Chain (43114) to SUPERFLUID_CHAIN_IDS. Both use the canonical CFAv1/GDAv1 forwarder addresses and have existing chain DB rows in seed-chains.ts. - Rewrite the SUPERFLUID_CHAIN_IDS docblock: drop the inaccurate "every chain Superfluid supports" claim, document the Avalanche Fuji (43113) CFA deviation, and point at the new cross-check test as the regression gate. - Add a vitest cross-check that, for every chain in SUPERFLUID_CHAIN_IDS, asserts the pinned CFA/GDA address equals @superfluid-finance/metadata's contractsV1.cfaV1Forwarder / gdaV1Forwarder. Fuji-style deviant chains now fail at PR time instead of silently mis-routing. - Add @superfluid-finance/metadata as a devDependency. - Rename two test descriptions hardcoded to "six chains".
…nt-subscription failure When the upstream WSS enters a half-open state — TCP/ping-pong alive, getBlockNumber works, eth_subscribe accepted, but no newHeads delivered — the existing connect-level fallback never fires because primary's getBlockNumber keeps succeeding. KEEP-544's noBlockTimer correctly reconnects every 300s, but every reconnect goes back to the same broken URL. Track consecutive BLOCK_ADVANCE_TIMEOUT_MS firings on the current URL as silentReconnects. Reset on a real height advance in processBlockRange, on start, and on stop. In reconnectWithBackoff, before the next connect attempt, call maybeFlipUrlPreference() to swap currentUrlIndex when silentReconnects >= SILENT_FAILOVER_THRESHOLD (default 2, env-tunable). connect() now honours currentUrlIndex by reordering its candidate list so the preferred URL is tried first while keeping primary/fallback labels stable in logs. The existing primaryProbeTimer already covers swapping back once primary recovers. Reset the counter on flip so the new URL gets a full threshold of its own before flipping back; if both are silent, the monitor alternates which surfaces the real failure (both upstreams down) to operators via the existing log signal.
…ent metadata
The on-IPFS agent card and the .well-known/agent-card.json A2A endpoint
both shipped with a generic description ("Web3 workflow automation
platform...") and no inline skills, no keywords, no structured payment
fields. Result: keyword search against 8004scan for "keeperhub"
returned zero results; downstream indexers (agentcash search, x402scan)
have no searchable surface to match against.
Changes:
- Description rewritten to match the D-066 wedge wording: execution
layer, x402 on Base, MPP on Tempo, Managed DeFi, onchain audit trail.
- Inline skills array (8 entries) on both surfaces with matching IDs:
workflow_discovery, workflow_invocation, ai_workflow_generation,
protocol_actions, onchain_execution, templates, execution_monitoring,
reputation_feedback (new — completes the ERC-8004 feedback
read/write symmetry).
- Per-skill descriptions and tags include the searchable terms agents
query for (x402, mpp, usdc, base, tempo, aave, safe, defi, swap,
bridge, stake, contract, etc.).
- keywords array (30 tags) added to data/agent-registry.json.
- Structured payment block (x402 on Base + MPP on Tempo with payTo
addresses) added to data/agent-registry.json.
- New service entry pointing at /api/openapi.json (forthcoming) and
https://docs.keeperhub.com.
After merge, run scripts/pin-agent-card.ts to pin the new content to
Pinata and scripts/update-agent-uri.ts to update the onchain tokenURI.
Pairs with techops-services/infrastructure PR #194 (Cloudflare bypass
for AI-vendor UAs on .well-known paths) so the live A2A endpoint is
reachable from Claude/ChatGPT/Perplexity browse-on-behalf-of-user.
The previous payment.payTo addresses implied 100% of per-call USDC
flows to the platform wallet. In reality KeeperHub is a multi-creator
marketplace: each listed workflow advertises its own payTo in the
402 response and at /api/mcp/workflows/<slug>, and settlement splits
70% creator / 30% platform per lib/earnings/queries.ts.
Replaces the misleading payment.payTo with:
- "marketplace" block describing the multi-creator model, the
platformFeePercent (30), creatorSharePercent (70), and the
platformFeeRecipient address per chain.
- "payment" block keeping the protocol/network/asset declarations
(x402 on Base USDC, MPP on Tempo USDC.e) but dropping the
overclaiming payTo — that's per-workflow runtime data, not
agent-level identity data.
platformFeePercent is hardcoded today because the agent card is
static IPFS content; if the platform-fee env var changes, re-pin.
…overability feat: KEEP-554 enrich ERC-8004 agent card with skills, keywords, payment metadata
fix: update Aave V3 data provider addresses
…n in the metadata cross-check Three assertions inside the canonical metadata cross-check block: - Fuji (43113) is intentionally absent from SUPERFLUID_CHAIN_IDS. - Fuji's CFAv1Forwarder address in @superfluid-finance/metadata is NOT equal to CFA_FORWARDER_ADDRESS -- this captures the documented deviation as a fact under test. - Fuji's GDAv1Forwarder IS canonical -- documents the asymmetric shape of the deviation (CFA-only) and catches the rare case of GDA also drifting on Fuji. Together these pin both the trap (CFA is non-canonical upstream) and our response to it (Fuji is excluded from the chain list). If Superfluid ever redeploys Fuji's CFA at the canonical address, the "is not canonical" assertion fails -- and that failure is positive: the trap is gone and Fuji can safely join SUPERFLUID_CHAIN_IDS.
…n-coverage feat(superfluid): KEEP-463 expand chain coverage and cross-check addresses against canonical metadata
The previous wording suggested the blocker was finding a Superfluid pool address. The actual blocker is that TEST_ADDRESS has no deployed code at all -- the GDA dispatches into the pool address as a contract call, and any deployed contract implementing the pool interface would suffice.
… tests The previous regex shape tests pinned hardcoded ethers v6.16.0 sample strings. A future ethers upgrade that changed error formatting would have silently invalidated DISPATCH_FAILURE_RE without failing these tests -- the on-chain block would lose its KEEP-456 guard and the only signal would have been a real routing bug slipping into prod. Replace each sample with a call to `ethers.makeError(...)` that constructs the same shape we need to match. The assertion target is now String(error), which mirrors what estimateGasError actually returns, so the test is self-updating across ethers upgrades: if ethers changes wording, these tests fail at upgrade time and reviewers update the regex alongside the dependency bump.
…ntext Tighten the bare `data="0x"` alternation to `,\s*data="0x"`, which requires the field-separator comma that precedes top-level CALL_EXCEPTION params. Nested JSON uses `"data": value` (with colon- space), so the new anchor distinguishes top-level from nested without false-positiving if a future ethers version (or quirky calldata) produced a transaction object whose data was literally "0x" while the top-level revert data was populated. Add a regression test that exercises exactly that hypothetical: a revert with populated top-level data but empty nested transaction.data must NOT match -- it's a business revert, not a routing bug.
Directly satisfies acceptance criterion #2 -- "no test passes when its action is silently re-routed to a non-existent contract method" -- without relying on the regex-shape proxy. Takes a currently-passing action (grant-flow-operator, which routes correctly to the CFAv1Forwarder) and overrides its destination to SEPOLIA_FUSDC: a real ERC20 contract that exists on chain but has no Superfluid methods. Verified empirically: this dispatch produces `Error: missing revert data ... code=CALL_EXCEPTION` -- exactly the KEEP-456 shape. The new test asserts DISPATCH_FAILURE_RE matches it. If anyone weakens or removes DISPATCH_FAILURE_RE in a way that lets KEEP-456 through, this test fails against the live RPC. The regex-shape tests (run in CI) guard the regex against silent strengthening; this test (RPC-gated) guards against silent weakening by exercising a real misroute end-to-end.
…nsaction CI typecheck (TS2741) caught three transaction objects in the regex shape tests that were missing the required `data` field on CallExceptionTransaction. Local tsc with skipLibCheck did not surface this. No behavioral change -- the regex tests assert on the top-level error string, not the nested transaction shape.
…-tests test: KEEP-459 harden on-chain dispatch assertions against routing regressions
WebSocketProvider.ready in ethers v6 is a synchronous boolean getter, not a Promise; awaiting it resolves immediately and lets openProvider proceed before the ws upgrade actually completes. Replace with getBlockNumber(), which internally calls _waitUntilReady(). Since _waitUntilReady() never rejects on socket failure (only resolves on open), wrap the call in a Promise.race against an explicit ws-error listener and a 10s connect timeout, matching the pattern PR #988 used in the block dispatcher. Without the race, an unreachable host (DNS NXDOMAIN, ECONNREFUSED) hangs the connect attempt indefinitely instead of walking to the fallback URL. Update the two MockProvider stubs in unit tests to expose getBlockNumber in place of the unused ready Promise, and baseline the no-subscriber heartbeat assertion after connect so the initial connect probe is not counted as a heartbeat ping.
…ready-await fix: KEEP-349 race connect against ws-error + timeout in openProvider
…itor liveness, reconnect, and SQS enqueue paths
Expose a per-chain Prometheus surface at :3000/metrics so operators can alert
on silent subscriptions without waiting for users to report broken workflows.
The block-dispatcher previously emitted only console logs and a /health
boolean; today's prod incident only surfaced because a user noticed their
workflow stopped firing several hours later.
Metrics added (all keeperhub_block_dispatcher_*):
Gauges (per chain, except chains_monitored)
seconds_since_last_block scrape-time, primary alert signal
socket_age_seconds scrape-time, debug
is_alive mirrors ChainMonitor.isAlive()
is_reconnecting reconnect-with-backoff in progress
has_active_subscription eth_subscribe completed
current_url_index 0=primary, 1=fallback (KEEP-557 flips)
silent_reconnects_current consecutive block-advance timeouts
last_processed_block highest block processed
workflows_tracked block-trigger workflows per chain
chains_monitored pod-level
Counters
blocks_received_total rate gives delivery rate per chain
blocks_matched_total workflow trigger fires (no workflow_id label)
ws_closes_total reason: upstream_close|pong_timeout|
block_advance_timeout|socket_age_recycle|
silent_failover|ping_send_failure|
primary_probe_recovered
reconnects_total outcome: success|exhausted
url_flips_total direction: to_fallback|to_primary
sqs_enqueue_total outcome: success|error
unhandled_rejections_total ethers v6 eth_unsubscribe etc.
Histograms
reconnect_duration_ms handleDisconnect to subscription_active
block_lag_seconds wall_clock - block.timestamp on receive
Wiring is a single new lib/metrics.ts module with one process-wide prom-client
Registry. ChainMonitor calls thin record*/set* helpers at each lifecycle
point; no business logic depends on prom-client. seconds_since_last_block and
socket_age_seconds use prom-client collect() callbacks so they always reflect
the latest value at scrape time without per-block emission.
Deploy: prometheus.io/scrape annotations added to staging and prod
block-dispatcher-values.yaml. The /metrics endpoint is wired to the same
express server that already serves /health, no new container port.
Tests: 14 new cases across metrics.test.ts (counters, gauges, scrape-time
callbacks, histograms) and chain-monitor.test.ts (integration: emit block ->
counters and gauges advance; ws close -> ws_close counter labeled correctly).
80/80 vitest pass.
Docs: lib/metrics/METRICS_REFERENCE.md gets a new section 6 (BLOCK DISPATCHER)
documenting every metric with description, labels, and alert thresholds.
Out of scope (separate PR in techops-infrastructure repo): Grafana dashboard
JSON, Terraform alert rules, and ALERTS_REFERENCE.md entries that consume
these metrics.
prom-client gauges retain the last value of every label combination they
have ever seen. Without an explicit .remove({chain}) call when a
ChainMonitor stops, the seconds_since_last_block (and every other
per-chain gauge) would emit the chain's last value forever — including
chains whose workflows were deleted or chains the reconciler tore down
because they were zombie. That would cause the new 'Block Dispatcher
Chain Silent' alert to fire indefinitely on chains we no longer monitor.
forgetChain now removes every per-chain gauge labelset so the chain
disappears entirely from /metrics output. ChainMonitor.stop() no longer
needs to flip individual gauges to 0 first — forgetChain wipes the
labels regardless. Counters and histograms keep their cumulative
history; rate()/increase() handle counter resets correctly anyway.
Test updated to cover all nine per-chain gauges and asserts an
unaffected chain is not impacted by the cleanup. 80/80 pass.
…-silent-wss-failover fix(block-dispatcher): KEEP-557 auto-failover to fallback WSS on silent-subscription failure
…-metrics feat(block-dispatcher): KEEP-557 Prometheus metrics for chain monitor liveness, reconnects, and SQS enqueue
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.