长时间运行后 polling 陷入死循环（httpx Pool timeout / Server disconnected）需要 kill 进程才能恢复

## 问题描述

CCBot v0.1.0（commit 865ab89）在长时间运行后会陷入"假死"状态——进程还活着但 Telegram polling 完全失能，新消息收不到，必须 kill 进程后由 launchd 重启才能恢复。重启后短期能正常工作，但很快又陷入同样状态。

## 现象

bot 进程不退出（launchctl list 状态 0），Telegram API 直接 `curl getMe` 也通，**但 bot 内部的 polling/请求全部 timeout**。stderr 里反复出现两类错误：

```
telegram.error.TimedOut: Pool timeout: All connections in the connection pool are occupied. 
Request was *not* sent to Telegram. Consider adjusting the connection pool size or the pool timeout.

httpcore.RemoteProtocolError: Server disconnected without sending a response.
telegram.error.NetworkError: httpx.RemoteProtocolError: Server disconnected without sending a response.

telegram.error.TimedOut: Timed out
ccbot.handlers.status_polling - DEBUG - Topic probe error for @2: Timed out
ccbot.handlers.status_polling - DEBUG - Topic probe error for @3: Timed out
...
```

`Pool timeout: All connections in the connection pool are occupied` 这条是关键——看起来 httpx 的连接池被占满后无法回收，后续所有请求都拿不到连接。

## 复现频率（实测数据）

我装了一个 launchd watchdog 每 5 分钟检测 stderr 静默 > 120s 自动 `kickstart` 重启。从昨天装好到现在的实际重启次数：

```
昨天（2026-04-25）: 9 次
今天（2026-04-26）截止 13:30: 62 次
```

今天部分时间分布（每 5 分钟稳定救一次 = 救完立刻又陷入死循环）：

```
09:17 09:23 09:28 09:33 09:38 09:43 09:48 09:53
10:03 10:08 10:13 10:18 10:23 10:28 10:33 10:38 10:43 10:48 10:53 10:58
11:03 11:08 11:18 11:23 11:28 11:34 11:44 11:49 11:54 11:59
12:04 12:09 12:14 12:19 12:24 12:29 12:34 12:39 12:44 12:49 12:54 12:59
13:04 13:09 13:14 13:19 13:25 13:30
```

也观察到偶尔有几小时健康期（昨晚 03:21 → 07:51 之间没死过），但触发后会进入"5 分钟连续死"的循环。

## 已尝试

- `MONITOR_POLL_INTERVAL=10`（默认 2）→ 频率没明显改善
- 回滚 `MONITOR_POLL_INTERVAL=2` → 一样
- 重启 bot 进程 → 短期正常，几分钟后又复发

POLL interval 跟假死频率关系不大。

## 环境

- macOS 26.4.1
- Python 3.13.7
- CCBot v0.1.0（uv tool install，commit 865ab89）
- python-telegram-bot 走 httpx 后端
- 网络环境：中国大陆，通过 Clash Verge TUN 模式路由出境（其他通过 httpx 的服务都正常，只有 ccbot 的 polling 持续 timeout）

## 期望

希望作者能看一下 httpx 连接池管理逻辑：

1. 是否在某些异常路径下没有正确释放连接（比如 `Server disconnected` 错误后 connection 没回到池）
2. 是否可以在检测到 `Pool timeout` 之类 fatal 状态时主动重建 client
3. 或暴露 `HTTPX_POOL_LIMIT` 之类配置项让用户调

## 临时缓解方案（分享给可能遇到相同问题的用户）

我自己写了个 launchd watchdog 每 5 分钟跑一次自动重启，bot 总体可用：

```bash
#!/bin/bash
# ~/.ccbot/watchdog.sh - 假死自检 + 自动重启
LOG="$HOME/.ccbot/stderr.log"
LABEL="com.dazhi.ccbot"

# bot 进程不在 → launchd 自己处理
launchctl list 2>/dev/null | grep -q "$LABEL" || exit 0

# stderr.log 静默 > 120s = 假死
mtime=$(stat -f %m "$LOG" 2>/dev/null || stat -c %Y "$LOG")
age=$(( $(date +%s) - mtime ))
[ "$age" -gt 120 ] && launchctl kickstart -k "gui/$(id -u)/$LABEL" && exit 0

# 近 100 行错误占比 > 50% = 假死
errors=$(tail -100 "$LOG" 2>/dev/null | grep -cE 'Timed out|RemoteProtocolError|Server disconnected|ConnectError|ReadError')
[ "$errors" -ge 50 ] && launchctl kickstart -k "gui/$(id -u)/$LABEL"
```

但这只是临时止血，真正的连接池泄漏问题还得从 ccbot 内部修。

感谢作者！工具整体设计很好（tmux 桥接 + 跟桌面 session 互通），就是这个连接池问题让长时间运行不太稳定。希望能修。


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

长时间运行后 polling 陷入死循环（httpx Pool timeout / Server disconnected）需要 kill 进程才能恢复 #79

问题描述

现象

复现频率（实测数据）

已尝试

环境

期望

临时缓解方案（分享给可能遇到相同问题的用户）

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

长时间运行后 polling 陷入死循环（httpx Pool timeout / Server disconnected）需要 kill 进程才能恢复 #79

Description

问题描述

现象

复现频率（实测数据）

已尝试

环境

期望

临时缓解方案（分享给可能遇到相同问题的用户）

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions