Skip to content

retention: archive stale results/*.txt on startup (D half of flood fix)#354

Open
sonichi wants to merge 3 commits intomainfrom
feat/retention-sweep
Open

retention: archive stale results/*.txt on startup (D half of flood fix)#354
sonichi wants to merge 3 commits intomainfrom
feat/retention-sweep

Conversation

@sonichi
Copy link
Copy Markdown
Owner

@sonichi sonichi commented Apr 15, 2026

Summary

First half of the D+C DM-fallback redesign you approved after the 2026-04-15 DM-flood post-mortem. Lands D (separate retention policy) as a standalone hygiene PR. C (event-driven fs.watch wiring for the fallback delivery path) will stack on top as a follow-up.

What this does

  • src/archive-stale-results.py (new, 85 lines) — one-shot script that walks results/*.txt, moves any file whose mtime is older than RETENTION_HOURS (default 24) into results/archive-YYYY-MM-DD/. Honors DRY_RUN=1 for preview. Skips files already under archive-* subdirs and the non-.txt calls/ subdirectory.
  • src/startup.sh (+6 lines) — invokes the script right after mkdir -p tasks results data, i.e. before any long-running service starts iterating results/. On fresh boots the backlog gets swept out of the delivery path at startup.

Why this first, independent of any DM-fallback rewiring

Today's incident proved results/ accumulates long-dead files indefinitely (task-/question-/briefing-/insight-/friction- from voice work that was never spoken because voice wasn't connected). Any future code that scans results/ for delivery — reverted #352, #349's polling loop, or anything we do next — will flood again on first tick unless the directory is actually kept small. This PR makes the directory small independent of the fallback question. Even if we never wire fallback again, bounded results/ is an unambiguous win.

Retention window = 24h

Judgment call, overridable per-run via RETENTION_HOURS=N:

  • <1h is too aggressive (kills same-day results)
  • 7d doesn't bound aggressively enough
  • 24h = "older than one full day is effectively dead mail"

Test plan

  • bash -n src/startup.sh clean
  • ast.parse clean on archive-stale-results.py
  • Dry run (DRY_RUN=1) against fresh clean results/ — "no stale files to archive"
  • Fabricated a 2-day-old results/task-test-old.txt, dry-run correctly flagged it, real run moved it to results/archive-2026-04-15/
  • results/calls/ subdirectory is skipped (f.is_file() check)

Not in this PR

  • C (event-driven fs.watch wiring) — stacks on this as follow-up
  • Any new DM-fallback delivery code — still paused per post-mortem

🤖 Generated with Claude Code

Fixes #362

First half of the D+C DM-fallback redesign owner approved after the
2026-04-15 DM-flood post-mortem. This PR lands D (separate retention
policy) as a standalone hygiene win. C (event-driven fs.watch wiring
for the DM fallback) will stack on top as a follow-up PR.

### What this does

- Adds `src/archive-stale-results.py` — a small one-shot script that
  walks `results/*.txt`, moves any file whose mtime is older than
  `$RETENTION_HOURS` (default 24) under `results/archive-YYYY-MM-DD/`.
  Files inside existing `archive-*` subdirectories are never touched.
  Honors `DRY_RUN=1` for safe preview.
- Invokes it from `src/startup.sh` right after `mkdir -p tasks results
  data` — i.e. before any long-running service starts iterating
  `results/`. Any accumulated backlog from a prior quiet day gets
  swept out of the service iteration paths at boot time.

### Why this first, before any DM fallback rewiring

Today's incident proved that `results/` accumulates long-dead files
indefinitely: task-*, question-*, briefing-*, insight-*, friction-*
from voice-originated work that was never spoken because voice wasn't
connected at the time. Any code that scans `results/` for delivery —
my reverted #352 task-bridge wiring, #349's discord-bridge polling
loop, or any future design — is going to re-deliver the entire
backlog on first tick after a cold boot unless the directory is
actually kept small.

D (retention) fixes the underlying hygiene concern independently of
the DM-fallback question. Even if we never wire fallback again, the
directory staying bounded is an unambiguous win. With D in place, C
(the event-driven follow-up) can use `fs.watch` without needing a
complex "pre-existing files at boot" escape hatch — D has already
moved them out of the way.

### Retention window rationale

Defaulted to **24 hours**. This is a judgment call:

- Too short (e.g. 1h) — risks archiving same-day results the user
  might still want the voice agent to speak when they reconnect.
- Too long (e.g. 7d) — bounds the accumulation less aggressively.
- 24h feels right: anything older than one full day is effectively
  dead mail. Overridable per-run via `RETENTION_HOURS=N`.

### Verification

- `bash -n src/startup.sh` clean
- `ast.parse` clean on `archive-stale-results.py`
- Dry run against fresh clean results/: "no stale files to archive"
- Fabricated a 2-day-old test file, DRY_RUN=1 correctly flagged it,
  then real run correctly moved it to `results/archive-2026-04-15/`.
- Does not affect `results/calls/` (subdirectory, skipped by
  `f.is_file()` check).

### What's not in this PR

- **C (event-driven fs.watch wiring)** — stacks on this. Will replace
  the current `readdirSync(RESULT_DIR)` scan in task-bridge.ts with
  an fs.watch listener that only reacts to newly-created files.
  Separate PR, coming next.
- **Full architectural redo of DM fallback delivery** — out of scope;
  the pause-on-DM-fallback rule from the post-mortem still holds.
  This PR is pure hygiene.

### References

- Post-mortem: `notes/post-mortem-dm-flood-2026-04-15.md`
- Feedback memory: `feedback_results_dir_is_graveyard.md`
- Owner's D+C decision: Discord 23:20 local

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

@sonichi sonichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MacBook review: LGTM. Clean retention sweep — archives stale results/*.txt on startup before services can iterate them. Configurable via RETENTION_HOURS env var, DRY_RUN support, date-stamped archive dirs. Good defensive fix for the DM flood class of bug.

liususan091219 added a commit that referenced this pull request Apr 16, 2026
…#354

Each script reproduces the bug (before the fix) and verifies it's resolved
(after the fix). All POCs pass on current main.

- poc-pr353-open-file.sh (11/11) — 18s polling timeout in open_file
- poc-pr355-subtitled-pending.sh (9/9) — false positive subtitled_pending
- poc-pr332-team-tier-revert.sh (9/9) — team-tier -C /tmp broke codex
- poc-pr325-bodhi-dep.sh (7/7) — bodhi dep pointed at deleted repo
- poc-pr354-retention-sweep.sh — retention sweep for stale results

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@josuachavesolmos josuachavesolmos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed by Sutando proactive loop.

APPROVE — core logic is sound. All downstream consumers verified safe (discord-bridge, telegram-bridge, daily-insight, agent-api — none recurse into subdirectories). Placement in startup.sh is correct (after mkdir, before services).

Two optional hardening suggestions:

  • [MEDIUM] Guard against RETENTION_HOURS=0 (archives all in-flight results). Clamp with max(1, RETENTION_HOURS).
  • [MEDIUM] Wrap int() parse in try/except so invalid values fall back to 24h instead of silently skipping via || true.

Neither is blocking — safe to merge as-is.

@liususan091219
Copy link
Copy Markdown
Collaborator

POC: bash scripts/poc-pr354-retention-sweep.sh. Script in PR #358. Issue: #362

liususan091219 and others added 2 commits April 15, 2026 21:14
Reproduces the results/ unbounded growth bug and verifies the archival fix.
Run: bash scripts/test-pr354-retention-sweep.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

results/ directory grows unbounded — causes DM flood on scan

3 participants