[umbrella][P1] telegram channel reliability (#245 follow-up)
Goal: kill three reproducible failure modes that Vincent hit today (telegram pairing & DM reliability across multi-node restarts).
Origin: Vincent /goal — "telegram 优化,我服了". Dispatched as the deep cut of #245 connectivity hardening, plus a regression caught today on TMCode负责人 where allowFrom got wiped by an upgrade/restart cycle.
Key insight that resets the design space (vs #245 task E's earlier verdict):
connectTelegram is a self-contained polling loop inside agent-node (agent-node/src/cli.ts:2289, started unconditionally at module top-level by line 2530:
for (const channel of TELEGRAM_CHANNELS) connectTelegram(channel)).
It is not a Claude-code-cli internals concern — Claude never touches the poller. Every failure mode below is fixable in our own codebase (agent-node + agent-network), zero upstream PR dependency. The earlier "Claude-internal channel lifecycle is hard" framing was about a different surface (the resume-time MCP channel reattach), not the bot poller.
P1 — allowFrom persistence regression (today's TMCode负责人 incident)
Smoking gun — agent-network/bin/cli.ts:1465-1481:
function writeTelegramChannelConfig(nodeId, botToken, allowId) {
...
writeFileSync(join(channelDir, "access.json"), JSON.stringify({
dmPolicy: "allowlist",
allowFrom: [allowId], // ← obliterates any pre-existing allowFrom
groups: {}, // ← obliterates any group assignments
pending: {}, // ← obliterates any pending pairings
}, null, 2) + "\n");
}
Called from two sites that are part of routine UX flows:
cli.ts:1868 — anet node create wizard (initial setup)
cli.ts:4486 — anet channel add telegram <node> ... (CLI re-add, frequently re-run when troubleshooting)
Either path destroys user data on re-run. Any user who has been gradually built up onto allowFrom over weeks is reset to a single id (or empty list) the next time the operator re-runs the wizard or channel add.
Hard constraint to land first (P1, blocker for the rest):
writeTelegramChannelConfig MUST read the existing access.json if it exists and merge:
- keep prior
dmPolicy unless explicitly overridden
allowFrom: union with the new id, never replace
- keep prior
groups and pending
--bot-token only writes .env when explicitly passed
- Regression test in CI: seed a fake access.json with 3 ids in
allowFrom + 2 entries in pending + 1 entry in groups, run anet channel add telegram <node> --bot-token X --allow Y, assert the merged file still contains the original 3 + new Y in allowFrom, original pending and groups intact.
Mode 1 — restart 后 poller 不起 / 通道没 re-attach
Three independent root causes; need targeted fixes (not one mega-rewrite).
1a. anet channel add after anet node start — poller never picks up new channel
agent-node/src/cli.ts:446 evaluates TELEGRAM_CHANNELS at module top-level. Adding a channel to config.json while agent-node is running has zero effect; anet node resume doesn't help (it wakes Claude, not agent-node).
- Fix A2 (low, deferred): agent-node watches
config.json mtime → re-init TELEGRAM_CHANNELS and call connectTelegram(newChannel) for any new entry. Edge cases (token rotation, removal) make this fiddly; deferred behind A1.
1b. Bot token invalid → process.exit(1) (cli.ts:2303) kills the whole agent-node
- agent-node dies → tmux respawns it → same bad token → dies again → restart-storm.
- Fix A3 (low): replace
process.exit(1) with commhub_report_status({ status: "warn", note: "telegram bot token invalid (<channel-dir>)" }) + return from connectTelegram. agent-node stays alive, other channels keep working, the failing channel surfaces in anet doctor.
1c. Poller silently dead inside the running process
- The outer
while (true) catches getUpdates network errors (cli.ts:2357-2359), but a thrown exception from inside the queue drain (e.g. telegramSend failure causing unhandled promise rejection at line 2354) can leave processing = false with no caller — the loop continues but the queue never drains.
- Fix A1 (low, primary): agent-node maintains
<channel-dir>/health.json with {lastPoll, offset, msgsSinceBoot, lastError}. A watchdog timer in agent-node checks every 30s: if now - lastPoll > 90s → log + re-invoke connectTelegram(channel) + bump a restart counter. Counter exposed in anet channel status.
Sub-task scope estimate: ~80 LOC across A1 + A3, mostly in agent-node. A2 deferred — watchdog covers the symptom, hot-reload is a nice-to-have.
Mode 2 — 配对 UX (pairing + allowlist)
access.json was designed with dmPolicy / pending / groups (agent-network/bin/cli.ts:1474-1479), but agent-node/src/cli.ts:2204-2208 only ever reads allowFrom. dmPolicy and pending are written once and never read — a half-implemented pairing UX. Result:
allowFrom.length === 0 → wide open (everyone accepted).
allowFrom.length > 0 → strict allowlist; non-allowed senders are silently dropped. The bot gives no signal, the operator never sees the attempt — exactly Vincent's "够不着" complaint.
B1 — implement pending pairing (medium, primary)
Reject path in telegramAllowed:
- Write
pending[<userId>] = { firstSeen, lastMessage, label, attempts } to access.json.
- Reply to the user:
"等待管理员批准" + a copy of their own user id so they can show the operator.
- Log to commhub with status
waiting_pairing.
- Throttle: at most one
等待批准 message per <userId> per 24 h.
B3 — anet channel add wizard prompts the operator to pair (medium)
At creation:
- operator gives bot token, no
--allow
- wizard prints:
"打开 Telegram → 给 @<bot-username> 发任意消息(3 分钟内)→ 我把你的 id 写入 allowFrom"
- wizard polls
getUpdates until first message arrives → writes that from.id to allowFrom
- timeout returns to manual
--allow flow
B4 — anet doctor surfaces pending pairings prominently (low)
doctor already enumerates channels per #245 task E. Add: for each channel, if Object.keys(access.pending).length > 0, print a yellow banner:
⚠️ 节点 TMCode负责人 有 2 个待批配对:
alice (id=12345, last seen 14:32) — 运行 `anet channel approve TMCode负责人 alice`
bob (id=67890, last seen 14:35) — 运行 `anet channel approve TMCode负责人 bob`
plus a new sub-command anet channel approve <node> <user> that moves the id from pending to allowFrom.
B2 — commhub_telegram_approve MCP tool (deferred)
Same intent as B4's CLI sub-command, but exposed as MCP so the operator's agent can approve without leaving its turn. Defer unless B4 proves too clicky.
Mode 3 — 死活可见 (poller liveness)
anet doctor + anet channel status (added in #245 task E) currently show configuration (access.json contents, dmPolicy, allowFrom). They do not show whether the poller is currently polling.
C1 — write <channel-dir>/health.json from agent-node (low)
Shared file with Mode 1 A1's watchdog. Fields: lastPoll (ISO timestamp), offset, msgsSinceBoot, lastError (object or null), restartCount (watchdog interventions).
C2 — anet channel status shows freshness (low)
Decorate the existing anet channel status output:
Last poll: 12s ago (green ✅ if < 60 s)
Last poll: 23 min ago (yellow ⚠️ if 60 s – 10 min)
Last poll: 4 h ago (red 🔴 if > 10 min) — → poller likely dead, run anet node stop && anet node start
-
Watchdog restarts since boot: <n>
-
Last error: <message> if non-null
C3 — anet doctor integrates health.json (low)
For each enumerated channel, read health.json. If stale → red entry + fix suggestion. If lastError non-null → surface it.
Sub-task scope estimate: ~50 LOC across C1 + C2 + C3, shares health.json with Mode 1 A1.
9 sub-tasks (proposed)
- P1-merge —
writeTelegramChannelConfig reads-merge-writes (allowFrom union + preserve pending/groups/dmPolicy) + CI regression test
- P1-B1 — implement pending pairing in
telegramAllowed
- P1-B4 —
anet doctor pending banner + anet channel approve <node> <user>
- P2-A1 — health.json watchdog (
agent-node)
- P2-C1+C2+C3 — health.json read paths (
anet doctor / anet channel status)
- P3-A3 — friendly bot-token-invalid path (no
process.exit(1))
- P3-B3 —
anet channel add interactive pair wizard
- P4-A2 —
config.json hot-reload (optional, may be cut)
- integration — end-to-end test: spin up Docker bot fixture, exercise pair → allowlist → reject → pending → approve → allowed loop
Priority ordering (per Vincent's pain map)
- P1: P1-merge + P1-B1 + P1-B4 — resolves today's
allowFrom wipe + the "silently 够不着" complaint
- P2: P2-A1 + P2-C1+C2+C3 — resolves "看不出死活" + auto-recovery for in-process poller deaths
- P3: P3-A3 + P3-B3 — UX polish for invalid tokens + onboarding
- P4: P4-A2 — drop if not free
Constraints / non-goals
- All work lives in
agent-node + agent-network. No Claude / upstream dependency.
- No breaking changes to
access.json schema — existing nodes upgrade in place. The merge fix in P1-merge is itself a non-breaking read-then-write.
- No new MCP server. (B2 deferred behind B4's CLI.)
- Don't touch
~/.commhub or any prod hub state during dev/test. Use Docker fixtures.
Owner + ETA
- Owner: 通信工程马 (lead), may delegate one P2/P3 sub-task to 通信SDK马 depending on bandwidth.
- ETA: P1 (3 sub-tasks) — 1 evening, ~5-8 h. Full P2 — +1 evening. P3 — opportunistic. P4 — optional. No overnight rush.
Author-Agent: 通信工程马
[umbrella][P1] telegram channel reliability (#245 follow-up)
Goal: kill three reproducible failure modes that Vincent hit today (telegram pairing & DM reliability across multi-node restarts).
Origin: Vincent /goal — "telegram 优化,我服了". Dispatched as the deep cut of #245 connectivity hardening, plus a regression caught today on TMCode负责人 where
allowFromgot wiped by an upgrade/restart cycle.Key insight that resets the design space (vs #245 task E's earlier verdict):
connectTelegramis a self-contained polling loop insideagent-node(agent-node/src/cli.ts:2289, started unconditionally at module top-level by line 2530:for (const channel of TELEGRAM_CHANNELS) connectTelegram(channel)).It is not a Claude-code-cli internals concern — Claude never touches the poller. Every failure mode below is fixable in our own codebase (
agent-node+agent-network), zero upstream PR dependency. The earlier "Claude-internal channel lifecycle is hard" framing was about a different surface (the resume-time MCP channel reattach), not the bot poller.P1 —
allowFrompersistence regression (today's TMCode负责人 incident)Smoking gun —
agent-network/bin/cli.ts:1465-1481:Called from two sites that are part of routine UX flows:
cli.ts:1868—anet node createwizard (initial setup)cli.ts:4486—anet channel add telegram <node> ...(CLI re-add, frequently re-run when troubleshooting)Either path destroys user data on re-run. Any user who has been gradually built up onto
allowFromover weeks is reset to a single id (or empty list) the next time the operator re-runs the wizard orchannel add.Hard constraint to land first (P1, blocker for the rest):
writeTelegramChannelConfigMUST read the existingaccess.jsonif it exists and merge:dmPolicyunless explicitly overriddenallowFrom: union with the new id, never replacegroupsandpending--bot-tokenonly writes.envwhen explicitly passedallowFrom+ 2 entries inpending+ 1 entry ingroups, runanet channel add telegram <node> --bot-token X --allow Y, assert the merged file still contains the original 3 + new Y inallowFrom, originalpendingandgroupsintact.Mode 1 — restart 后 poller 不起 / 通道没 re-attach
Three independent root causes; need targeted fixes (not one mega-rewrite).
1a.
anet channel addafteranet node start— poller never picks up new channelagent-node/src/cli.ts:446evaluatesTELEGRAM_CHANNELSat module top-level. Adding a channel toconfig.jsonwhile agent-node is running has zero effect;anet node resumedoesn't help (it wakes Claude, not agent-node).config.jsonmtime → re-initTELEGRAM_CHANNELSand callconnectTelegram(newChannel)for any new entry. Edge cases (token rotation, removal) make this fiddly; deferred behind A1.1b. Bot token invalid →
process.exit(1)(cli.ts:2303) kills the whole agent-nodeprocess.exit(1)withcommhub_report_status({ status: "warn", note: "telegram bot token invalid (<channel-dir>)" })+returnfromconnectTelegram. agent-node stays alive, other channels keep working, the failing channel surfaces inanet doctor.1c. Poller silently dead inside the running process
while (true)catchesgetUpdatesnetwork errors (cli.ts:2357-2359), but a thrown exception from inside the queue drain (e.g.telegramSendfailure causing unhandled promise rejection at line 2354) can leaveprocessing = falsewith no caller — the loop continues but the queue never drains.<channel-dir>/health.jsonwith{lastPoll, offset, msgsSinceBoot, lastError}. A watchdog timer in agent-node checks every 30s: ifnow - lastPoll > 90s→ log + re-invokeconnectTelegram(channel)+ bump a restart counter. Counter exposed inanet channel status.Sub-task scope estimate: ~80 LOC across A1 + A3, mostly in agent-node. A2 deferred — watchdog covers the symptom, hot-reload is a nice-to-have.
Mode 2 — 配对 UX (pairing + allowlist)
access.jsonwas designed withdmPolicy / pending / groups(agent-network/bin/cli.ts:1474-1479), butagent-node/src/cli.ts:2204-2208only ever readsallowFrom.dmPolicyandpendingare written once and never read — a half-implemented pairing UX. Result:allowFrom.length === 0→ wide open (everyone accepted).allowFrom.length > 0→ strict allowlist; non-allowed senders are silently dropped. The bot gives no signal, the operator never sees the attempt — exactly Vincent's "够不着" complaint.B1 — implement pending pairing (medium, primary)
Reject path in
telegramAllowed:pending[<userId>] = { firstSeen, lastMessage, label, attempts }to access.json."等待管理员批准"+ a copy of their own user id so they can show the operator.waiting_pairing.等待批准message per<userId>per 24 h.B3 —
anet channel addwizard prompts the operator to pair (medium)At creation:
--allow"打开 Telegram → 给 @<bot-username> 发任意消息(3 分钟内)→ 我把你的 id 写入 allowFrom"getUpdatesuntil first message arrives → writes thatfrom.idtoallowFrom--allowflowB4 —
anet doctorsurfaces pending pairings prominently (low)doctoralready enumerates channels per #245 task E. Add: for each channel, ifObject.keys(access.pending).length > 0, print a yellow banner:plus a new sub-command
anet channel approve <node> <user>that moves the id frompendingtoallowFrom.B2 —
commhub_telegram_approveMCP tool (deferred)Same intent as B4's CLI sub-command, but exposed as MCP so the operator's agent can approve without leaving its turn. Defer unless B4 proves too clicky.
Mode 3 — 死活可见 (poller liveness)
anet doctor+anet channel status(added in #245 task E) currently show configuration (access.json contents, dmPolicy, allowFrom). They do not show whether the poller is currently polling.C1 — write
<channel-dir>/health.jsonfrom agent-node (low)Shared file with Mode 1 A1's watchdog. Fields:
lastPoll(ISO timestamp),offset,msgsSinceBoot,lastError(object or null),restartCount(watchdog interventions).C2 —
anet channel statusshows freshness (low)Decorate the existing
anet channel statusoutput:Last poll: 12s ago(green ✅ if < 60 s)Last poll: 23 min ago(yellowLast poll: 4 h ago(red 🔴 if > 10 min) —→ poller likely dead, run anet node stop && anet node startWatchdog restarts since boot: <n>Last error: <message>if non-nullC3 —
anet doctorintegrateshealth.json(low)For each enumerated channel, read
health.json. If stale → red entry + fix suggestion. IflastErrornon-null → surface it.Sub-task scope estimate: ~50 LOC across C1 + C2 + C3, shares
health.jsonwith Mode 1 A1.9 sub-tasks (proposed)
writeTelegramChannelConfigreads-merge-writes (allowFrom union + preserve pending/groups/dmPolicy) + CI regression testtelegramAllowedanet doctorpending banner +anet channel approve <node> <user>agent-node)anet doctor/anet channel status)process.exit(1))anet channel addinteractive pair wizardconfig.jsonhot-reload (optional, may be cut)Priority ordering (per Vincent's pain map)
allowFromwipe + the "silently 够不着" complaintConstraints / non-goals
agent-node+agent-network. No Claude / upstream dependency.access.jsonschema — existing nodes upgrade in place. The merge fix in P1-merge is itself a non-breaking read-then-write.~/.commhubor any prod hub state during dev/test. Use Docker fixtures.Owner + ETA
Author-Agent: 通信工程马