Skip to content

[umbrella][P1] telegram channel reliability (#245 follow-up) #246

Description

@s2agi

[umbrella][P1] telegram channel reliability (#245 follow-up)

Goal: kill three reproducible failure modes that Vincent hit today (telegram pairing & DM reliability across multi-node restarts).

Origin: Vincent /goal — "telegram 优化,我服了". Dispatched as the deep cut of #245 connectivity hardening, plus a regression caught today on TMCode负责人 where allowFrom got wiped by an upgrade/restart cycle.

Key insight that resets the design space (vs #245 task E's earlier verdict):

connectTelegram is a self-contained polling loop inside agent-node (agent-node/src/cli.ts:2289, started unconditionally at module top-level by line 2530:
for (const channel of TELEGRAM_CHANNELS) connectTelegram(channel)).
It is not a Claude-code-cli internals concern — Claude never touches the poller. Every failure mode below is fixable in our own codebase (agent-node + agent-network), zero upstream PR dependency. The earlier "Claude-internal channel lifecycle is hard" framing was about a different surface (the resume-time MCP channel reattach), not the bot poller.


P1 — allowFrom persistence regression (today's TMCode负责人 incident)

Smoking gunagent-network/bin/cli.ts:1465-1481:

function writeTelegramChannelConfig(nodeId, botToken, allowId) {
  ...
  writeFileSync(join(channelDir, "access.json"), JSON.stringify({
    dmPolicy: "allowlist",
    allowFrom: [allowId],         // ← obliterates any pre-existing allowFrom
    groups: {},                   // ← obliterates any group assignments
    pending: {},                  // ← obliterates any pending pairings
  }, null, 2) + "\n");
}

Called from two sites that are part of routine UX flows:

  • cli.ts:1868anet node create wizard (initial setup)
  • cli.ts:4486anet channel add telegram <node> ... (CLI re-add, frequently re-run when troubleshooting)

Either path destroys user data on re-run. Any user who has been gradually built up onto allowFrom over weeks is reset to a single id (or empty list) the next time the operator re-runs the wizard or channel add.

Hard constraint to land first (P1, blocker for the rest):

  • writeTelegramChannelConfig MUST read the existing access.json if it exists and merge:
    • keep prior dmPolicy unless explicitly overridden
    • allowFrom: union with the new id, never replace
    • keep prior groups and pending
  • --bot-token only writes .env when explicitly passed
  • Regression test in CI: seed a fake access.json with 3 ids in allowFrom + 2 entries in pending + 1 entry in groups, run anet channel add telegram <node> --bot-token X --allow Y, assert the merged file still contains the original 3 + new Y in allowFrom, original pending and groups intact.

Mode 1 — restart 后 poller 不起 / 通道没 re-attach

Three independent root causes; need targeted fixes (not one mega-rewrite).

1a. anet channel add after anet node start — poller never picks up new channel

  • agent-node/src/cli.ts:446 evaluates TELEGRAM_CHANNELS at module top-level. Adding a channel to config.json while agent-node is running has zero effect; anet node resume doesn't help (it wakes Claude, not agent-node).
  • Fix A2 (low, deferred): agent-node watches config.json mtime → re-init TELEGRAM_CHANNELS and call connectTelegram(newChannel) for any new entry. Edge cases (token rotation, removal) make this fiddly; deferred behind A1.

1b. Bot token invalid → process.exit(1) (cli.ts:2303) kills the whole agent-node

  • agent-node dies → tmux respawns it → same bad token → dies again → restart-storm.
  • Fix A3 (low): replace process.exit(1) with commhub_report_status({ status: "warn", note: "telegram bot token invalid (<channel-dir>)" }) + return from connectTelegram. agent-node stays alive, other channels keep working, the failing channel surfaces in anet doctor.

1c. Poller silently dead inside the running process

  • The outer while (true) catches getUpdates network errors (cli.ts:2357-2359), but a thrown exception from inside the queue drain (e.g. telegramSend failure causing unhandled promise rejection at line 2354) can leave processing = false with no caller — the loop continues but the queue never drains.
  • Fix A1 (low, primary): agent-node maintains <channel-dir>/health.json with {lastPoll, offset, msgsSinceBoot, lastError}. A watchdog timer in agent-node checks every 30s: if now - lastPoll > 90s → log + re-invoke connectTelegram(channel) + bump a restart counter. Counter exposed in anet channel status.

Sub-task scope estimate: ~80 LOC across A1 + A3, mostly in agent-node. A2 deferred — watchdog covers the symptom, hot-reload is a nice-to-have.


Mode 2 — 配对 UX (pairing + allowlist)

access.json was designed with dmPolicy / pending / groups (agent-network/bin/cli.ts:1474-1479), but agent-node/src/cli.ts:2204-2208 only ever reads allowFrom. dmPolicy and pending are written once and never read — a half-implemented pairing UX. Result:

  • allowFrom.length === 0 → wide open (everyone accepted).
  • allowFrom.length > 0 → strict allowlist; non-allowed senders are silently dropped. The bot gives no signal, the operator never sees the attempt — exactly Vincent's "够不着" complaint.

B1 — implement pending pairing (medium, primary)

Reject path in telegramAllowed:

  • Write pending[<userId>] = { firstSeen, lastMessage, label, attempts } to access.json.
  • Reply to the user: "等待管理员批准" + a copy of their own user id so they can show the operator.
  • Log to commhub with status waiting_pairing.
  • Throttle: at most one 等待批准 message per <userId> per 24 h.

B3 — anet channel add wizard prompts the operator to pair (medium)

At creation:

  1. operator gives bot token, no --allow
  2. wizard prints: "打开 Telegram → 给 @<bot-username> 发任意消息(3 分钟内)→ 我把你的 id 写入 allowFrom"
  3. wizard polls getUpdates until first message arrives → writes that from.id to allowFrom
  4. timeout returns to manual --allow flow

B4 — anet doctor surfaces pending pairings prominently (low)

doctor already enumerates channels per #245 task E. Add: for each channel, if Object.keys(access.pending).length > 0, print a yellow banner:

⚠️ 节点 TMCode负责人 有 2 个待批配对:
   alice (id=12345, last seen 14:32) — 运行 `anet channel approve TMCode负责人 alice`
   bob   (id=67890, last seen 14:35) — 运行 `anet channel approve TMCode负责人 bob`

plus a new sub-command anet channel approve <node> <user> that moves the id from pending to allowFrom.

B2 — commhub_telegram_approve MCP tool (deferred)

Same intent as B4's CLI sub-command, but exposed as MCP so the operator's agent can approve without leaving its turn. Defer unless B4 proves too clicky.


Mode 3 — 死活可见 (poller liveness)

anet doctor + anet channel status (added in #245 task E) currently show configuration (access.json contents, dmPolicy, allowFrom). They do not show whether the poller is currently polling.

C1 — write <channel-dir>/health.json from agent-node (low)

Shared file with Mode 1 A1's watchdog. Fields: lastPoll (ISO timestamp), offset, msgsSinceBoot, lastError (object or null), restartCount (watchdog interventions).

C2 — anet channel status shows freshness (low)

Decorate the existing anet channel status output:

  • Last poll: 12s ago (green ✅ if < 60 s)
  • Last poll: 23 min ago (yellow ⚠️ if 60 s – 10 min)
  • Last poll: 4 h ago (red 🔴 if > 10 min) — → poller likely dead, run anet node stop && anet node start
    • Watchdog restarts since boot: <n>
    • Last error: <message> if non-null

C3 — anet doctor integrates health.json (low)

For each enumerated channel, read health.json. If stale → red entry + fix suggestion. If lastError non-null → surface it.

Sub-task scope estimate: ~50 LOC across C1 + C2 + C3, shares health.json with Mode 1 A1.


9 sub-tasks (proposed)

  1. P1-mergewriteTelegramChannelConfig reads-merge-writes (allowFrom union + preserve pending/groups/dmPolicy) + CI regression test
  2. P1-B1 — implement pending pairing in telegramAllowed
  3. P1-B4anet doctor pending banner + anet channel approve <node> <user>
  4. P2-A1 — health.json watchdog (agent-node)
  5. P2-C1+C2+C3 — health.json read paths (anet doctor / anet channel status)
  6. P3-A3 — friendly bot-token-invalid path (no process.exit(1))
  7. P3-B3anet channel add interactive pair wizard
  8. P4-A2config.json hot-reload (optional, may be cut)
  9. integration — end-to-end test: spin up Docker bot fixture, exercise pair → allowlist → reject → pending → approve → allowed loop

Priority ordering (per Vincent's pain map)

  • P1: P1-merge + P1-B1 + P1-B4 — resolves today's allowFrom wipe + the "silently 够不着" complaint
  • P2: P2-A1 + P2-C1+C2+C3 — resolves "看不出死活" + auto-recovery for in-process poller deaths
  • P3: P3-A3 + P3-B3 — UX polish for invalid tokens + onboarding
  • P4: P4-A2 — drop if not free

Constraints / non-goals

  • All work lives in agent-node + agent-network. No Claude / upstream dependency.
  • No breaking changes to access.json schema — existing nodes upgrade in place. The merge fix in P1-merge is itself a non-breaking read-then-write.
  • No new MCP server. (B2 deferred behind B4's CLI.)
  • Don't touch ~/.commhub or any prod hub state during dev/test. Use Docker fixtures.

Owner + ETA

  • Owner: 通信工程马 (lead), may delegate one P2/P3 sub-task to 通信SDK马 depending on bandwidth.
  • ETA: P1 (3 sub-tasks) — 1 evening, ~5-8 h. Full P2 — +1 evening. P3 — opportunistic. P4 — optional. No overnight rush.

Author-Agent: 通信工程马

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions