[umbrella][P1] telegram channel reliability (#245 follow-up)

# [umbrella][P1] telegram channel reliability (#245 follow-up)

**Goal**: kill three reproducible failure modes that Vincent hit today (telegram pairing & DM reliability across multi-node restarts).

**Origin**: Vincent /goal — "telegram 优化，我服了". Dispatched as the deep cut of #245 connectivity hardening, plus a regression caught today on TMCode负责人 where `allowFrom` got wiped by an upgrade/restart cycle.

**Key insight that resets the design space** (vs #245 task E's earlier verdict):

`connectTelegram` is a self-contained polling loop **inside `agent-node`** (`agent-node/src/cli.ts:2289`, started unconditionally at module top-level by line 2530:
`for (const channel of TELEGRAM_CHANNELS) connectTelegram(channel)`).
It is **not** a Claude-code-cli internals concern — Claude never touches the poller. Every failure mode below is fixable in our own codebase (`agent-node` + `agent-network`), zero upstream PR dependency. The earlier "Claude-internal channel lifecycle is hard" framing was about a different surface (the resume-time MCP channel reattach), not the bot poller.

---

## P1 — `allowFrom` persistence regression (today's TMCode负责人 incident)

**Smoking gun** — `agent-network/bin/cli.ts:1465-1481`:
```ts
function writeTelegramChannelConfig(nodeId, botToken, allowId) {
  ...
  writeFileSync(join(channelDir, "access.json"), JSON.stringify({
    dmPolicy: "allowlist",
    allowFrom: [allowId],         // ← obliterates any pre-existing allowFrom
    groups: {},                   // ← obliterates any group assignments
    pending: {},                  // ← obliterates any pending pairings
  }, null, 2) + "\n");
}
```

Called from two sites that are part of routine UX flows:
- `cli.ts:1868` — `anet node create` wizard (initial setup)
- `cli.ts:4486` — `anet channel add telegram <node> ...` (CLI re-add, frequently re-run when troubleshooting)

**Either path destroys user data on re-run.** Any user who has been gradually built up onto `allowFrom` over weeks is reset to a single id (or empty list) the next time the operator re-runs the wizard or `channel add`.

**Hard constraint to land first** (P1, blocker for the rest):
- `writeTelegramChannelConfig` MUST read the existing `access.json` if it exists and merge:
  - keep prior `dmPolicy` unless explicitly overridden
  - **`allowFrom`: union with the new id, never replace**
  - keep prior `groups` and `pending`
- `--bot-token` only writes `.env` when explicitly passed
- **Regression test** in CI: seed a fake access.json with 3 ids in `allowFrom` + 2 entries in `pending` + 1 entry in `groups`, run `anet channel add telegram <node> --bot-token X --allow Y`, assert the merged file still contains the original 3 + new Y in `allowFrom`, original `pending` and `groups` intact.

---

## Mode 1 — restart 后 poller 不起 / 通道没 re-attach

Three independent root causes; need targeted fixes (not one mega-rewrite).

### 1a. `anet channel add` after `anet node start` — poller never picks up new channel
- `agent-node/src/cli.ts:446` evaluates `TELEGRAM_CHANNELS` at **module top-level**. Adding a channel to `config.json` while agent-node is running has zero effect; `anet node resume` doesn't help (it wakes Claude, not agent-node).
- **Fix A2** *(low, deferred)*: agent-node watches `config.json` mtime → re-init `TELEGRAM_CHANNELS` and call `connectTelegram(newChannel)` for any new entry. Edge cases (token rotation, removal) make this fiddly; deferred behind A1.

### 1b. Bot token invalid → `process.exit(1)` (cli.ts:2303) kills the whole agent-node
- agent-node dies → tmux respawns it → same bad token → dies again → restart-storm.
- **Fix A3** *(low)*: replace `process.exit(1)` with `commhub_report_status({ status: "warn", note: "telegram bot token invalid (<channel-dir>)" })` + `return` from `connectTelegram`. agent-node stays alive, other channels keep working, the failing channel surfaces in `anet doctor`.

### 1c. Poller silently dead inside the running process
- The outer `while (true)` catches `getUpdates` network errors (cli.ts:2357-2359), but a thrown exception from inside the queue drain (e.g. `telegramSend` failure causing unhandled promise rejection at line 2354) can leave `processing = false` with no caller — the loop continues but the queue never drains.
- **Fix A1** *(low, primary)*: agent-node maintains `<channel-dir>/health.json` with `{lastPoll, offset, msgsSinceBoot, lastError}`. A watchdog timer in agent-node checks every 30s: if `now - lastPoll > 90s` → log + re-invoke `connectTelegram(channel)` + bump a restart counter. Counter exposed in `anet channel status`.

**Sub-task scope estimate**: ~80 LOC across A1 + A3, mostly in agent-node. **A2 deferred** — watchdog covers the symptom, hot-reload is a nice-to-have.

---

## Mode 2 — 配对 UX (pairing + allowlist)

`access.json` was **designed** with `dmPolicy / pending / groups` (`agent-network/bin/cli.ts:1474-1479`), but `agent-node/src/cli.ts:2204-2208` only ever reads `allowFrom`. `dmPolicy` and `pending` are written once and never read — a half-implemented pairing UX. Result:

- `allowFrom.length === 0` → wide open (everyone accepted).
- `allowFrom.length > 0` → strict allowlist; non-allowed senders are **silently dropped**. The bot gives no signal, the operator never sees the attempt — exactly Vincent's "够不着" complaint.

### B1 — implement pending pairing *(medium, primary)*
Reject path in `telegramAllowed`:
- Write `pending[<userId>] = { firstSeen, lastMessage, label, attempts }` to access.json.
- Reply to the user: `"等待管理员批准"` + a copy of their own user id so they can show the operator.
- Log to commhub with status `waiting_pairing`.
- Throttle: at most one `等待批准` message per `<userId>` per 24 h.

### B3 — `anet channel add` wizard prompts the operator to pair *(medium)*
At creation:
1. operator gives bot token, no `--allow`
2. wizard prints: `"打开 Telegram → 给 @<bot-username> 发任意消息（3 分钟内）→ 我把你的 id 写入 allowFrom"`
3. wizard polls `getUpdates` until first message arrives → writes that `from.id` to `allowFrom`
4. timeout returns to manual `--allow` flow

### B4 — `anet doctor` surfaces pending pairings prominently *(low)*
`doctor` already enumerates channels per #245 task E. Add: for each channel, if `Object.keys(access.pending).length > 0`, print a yellow banner:
```
⚠️ 节点 TMCode负责人 有 2 个待批配对：
   alice (id=12345, last seen 14:32) — 运行 `anet channel approve TMCode负责人 alice`
   bob   (id=67890, last seen 14:35) — 运行 `anet channel approve TMCode负责人 bob`
```
plus a new sub-command `anet channel approve <node> <user>` that moves the id from `pending` to `allowFrom`.

### B2 — `commhub_telegram_approve` MCP tool *(deferred)*
Same intent as B4's CLI sub-command, but exposed as MCP so the operator's agent can approve without leaving its turn. Defer unless B4 proves too clicky.

---

## Mode 3 — 死活可见 (poller liveness)

`anet doctor` + `anet channel status` (added in #245 task E) currently show *configuration* (access.json contents, dmPolicy, allowFrom). They do **not** show whether the poller is currently polling.

### C1 — write `<channel-dir>/health.json` from agent-node *(low)*
Shared file with Mode 1 A1's watchdog. Fields: `lastPoll` (ISO timestamp), `offset`, `msgsSinceBoot`, `lastError` (object or null), `restartCount` (watchdog interventions).

### C2 — `anet channel status` shows freshness *(low)*
Decorate the existing `anet channel status` output:
- `Last poll: 12s ago` (green ✅ if < 60 s)
- `Last poll: 23 min ago` (yellow ⚠️ if 60 s – 10 min)
- `Last poll: 4 h ago` (red 🔴 if > 10 min) — `→ poller likely dead, run anet node stop && anet node start`
- + `Watchdog restarts since boot: <n>`
- + `Last error: <message>` if non-null

### C3 — `anet doctor` integrates `health.json` *(low)*
For each enumerated channel, read `health.json`. If stale → red entry + fix suggestion. If `lastError` non-null → surface it.

**Sub-task scope estimate**: ~50 LOC across C1 + C2 + C3, shares `health.json` with Mode 1 A1.

---

## 9 sub-tasks (proposed)

1. **P1-merge** — `writeTelegramChannelConfig` reads-merge-writes (allowFrom union + preserve pending/groups/dmPolicy) + **CI regression test**
2. **P1-B1** — implement pending pairing in `telegramAllowed`
3. **P1-B4** — `anet doctor` pending banner + `anet channel approve <node> <user>`
4. **P2-A1** — health.json watchdog (`agent-node`)
5. **P2-C1+C2+C3** — health.json read paths (`anet doctor` / `anet channel status`)
6. **P3-A3** — friendly bot-token-invalid path (no `process.exit(1)`)
7. **P3-B3** — `anet channel add` interactive pair wizard
8. **P4-A2** — `config.json` hot-reload (optional, may be cut)
9. **integration** — end-to-end test: spin up Docker bot fixture, exercise pair → allowlist → reject → pending → approve → allowed loop

## Priority ordering (per Vincent's pain map)

- **P1**: P1-merge + P1-B1 + P1-B4 — resolves today's `allowFrom` wipe + the "silently 够不着" complaint
- **P2**: P2-A1 + P2-C1+C2+C3 — resolves "看不出死活" + auto-recovery for in-process poller deaths
- **P3**: P3-A3 + P3-B3 — UX polish for invalid tokens + onboarding
- **P4**: P4-A2 — drop if not free

## Constraints / non-goals

- All work lives in `agent-node` + `agent-network`. **No Claude / upstream dependency**.
- No breaking changes to `access.json` schema — existing nodes upgrade in place. The merge fix in P1-merge is itself a non-breaking read-then-write.
- No new MCP server. (B2 deferred behind B4's CLI.)
- Don't touch `~/.commhub` or any prod hub state during dev/test. Use Docker fixtures.

## Owner + ETA

- **Owner**: 通信工程马 (lead), may delegate one P2/P3 sub-task to 通信SDK马 depending on bandwidth.
- **ETA**: P1 (3 sub-tasks) — 1 evening, ~5-8 h. Full P2 — +1 evening. P3 — opportunistic. P4 — optional. No overnight rush.

---

Author-Agent: 通信工程马


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[umbrella][P1] telegram channel reliability (#245 follow-up) #246

[umbrella][P1] telegram channel reliability (#245 follow-up)

P1 — `allowFrom` persistence regression (today's TMCode负责人 incident)

Mode 1 — restart 后 poller 不起 / 通道没 re-attach

1a. `anet channel add` after `anet node start` — poller never picks up new channel

1b. Bot token invalid → `process.exit(1)` (cli.ts:2303) kills the whole agent-node

1c. Poller silently dead inside the running process

Mode 2 — 配对 UX (pairing + allowlist)

B1 — implement pending pairing (medium, primary)

B3 — `anet channel add` wizard prompts the operator to pair (medium)

B4 — `anet doctor` surfaces pending pairings prominently (low)

B2 — `commhub_telegram_approve` MCP tool (deferred)

Mode 3 — 死活可见 (poller liveness)

C1 — write `<channel-dir>/health.json` from agent-node (low)

C2 — `anet channel status` shows freshness (low)

C3 — `anet doctor` integrates `health.json` (low)

9 sub-tasks (proposed)

Priority ordering (per Vincent's pain map)

Constraints / non-goals

Owner + ETA

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[umbrella][P1] telegram channel reliability (#245 follow-up) #246

Description

[umbrella][P1] telegram channel reliability (#245 follow-up)

P1 — allowFrom persistence regression (today's TMCode负责人 incident)

Mode 1 — restart 后 poller 不起 / 通道没 re-attach

1a. anet channel add after anet node start — poller never picks up new channel

1b. Bot token invalid → process.exit(1) (cli.ts:2303) kills the whole agent-node

1c. Poller silently dead inside the running process

Mode 2 — 配对 UX (pairing + allowlist)

B1 — implement pending pairing (medium, primary)

B3 — anet channel add wizard prompts the operator to pair (medium)

B4 — anet doctor surfaces pending pairings prominently (low)

B2 — commhub_telegram_approve MCP tool (deferred)

Mode 3 — 死活可见 (poller liveness)

C1 — write <channel-dir>/health.json from agent-node (low)

C2 — anet channel status shows freshness (low)

C3 — anet doctor integrates health.json (low)

9 sub-tasks (proposed)

Priority ordering (per Vincent's pain map)

Constraints / non-goals

Owner + ETA

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

P1 — `allowFrom` persistence regression (today's TMCode负责人 incident)

1a. `anet channel add` after `anet node start` — poller never picks up new channel

1b. Bot token invalid → `process.exit(1)` (cli.ts:2303) kills the whole agent-node

B3 — `anet channel add` wizard prompts the operator to pair (medium)

B4 — `anet doctor` surfaces pending pairings prominently (low)

B2 — `commhub_telegram_approve` MCP tool (deferred)

C1 — write `<channel-dir>/health.json` from agent-node (low)

C2 — `anet channel status` shows freshness (low)

C3 — `anet doctor` integrates `health.json` (low)