Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 30 additions & 13 deletions .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,20 @@

## Mainline Status

- Last merged PR on main: `#217` — secret redaction in persisted discovery state, the root-cause
fix for the seed-time secret-leak incident (below). `redact_secrets()` strips the literal values
of credential-named env vars (`DENBUST_*`/`ANTHROPIC_API_KEY`/Supabase/object-store/Kaggle/HF —
primary, format-agnostic) plus credential shapes (URL key params, JWTs, header tokens, JSON
secret fields, `AIza`/`Bearer`/`sk-` — backstop) from every discovery error string and from the
run/backfill-batch/metrics snapshot writers, so an API error that echoes a key never reaches
state. Threat-model tested across the project's secret types. This is the last step of the
search-backstop code (`UNIFY-PR-05`) plus the incident fix.
- Last merged PR on main: `#220` (`GUARD-PR-SECRET-SCAN`, closes #218) — the three-layer
[gitleaks](https://github.com/gitleaks/gitleaks) secret-scan guard (the outer defense following
the seed-time leak incident below): a shared `.gitleaks.toml`, a `pre-commit` pre-push hook, a
fail-closed `scripts/state-run.sh` scan before each state push, and a Claude Code
`PreToolUse`/`Bash` hook that blocks an agent-issued `git push` carrying a secret. Builds on the
root-cause fix `#217`, which made `redact_secrets()` strip credential values (env-var literals —
primary, format-agnostic — plus URL/JWT/header/`AIza`/`Bearer`/`sk-` shapes — backstop) from every
discovery error string and the run/backfill-batch/metrics snapshot writers, so an API error that
echoes a key never reaches state.
- Next planned PR: `UNIFY-PR-06` (operational, go-live) — re-enable only the non-scraping workflows
on schedules (discover ≥daily for the backstop, daily-review, monthly-report, release, backup,
squash); the scraping ingest / backfill-scrape jobs stay local since GitHub never scrapes.
Deferred until **(a)** the state-push secret-scan guard (issue #218) is in place and **(b)** a
manual dispatch verifies the seeded state end to end — both prompted by the incident below. The
Deferred until a manual dispatch verifies the seeded state end to end — prompted by the incident
below. The state-push secret-scan guard (issue #218) has landed (`GUARD-PR-SECRET-SCAN`). The
state repo `DataHackIL/tfht_enforce_idx_state` is **seeded** from local `data/news_items`
(27,568 candidates + queues/attempts/verdicts/budget/yield + backfill_batches/runs/metrics,
recovered from orphaned `.jsonl.gz` to plain JSONL; excluded: prefilter models + decision-log
Expand Down Expand Up @@ -331,13 +332,29 @@
defers to a recent local search regardless of clock ordering). A zero-run day now finishes
non-fatal.
Covered by ledger + config + discover-job tests.
- [done] `GUARD-PR-SECRET-SCAN` (#220, closes #218): three-surface [gitleaks](https://github.com/gitleaks/gitleaks)
secret-scan guard (the industry tool, not ad-hoc regex), the structural follow-up to the seed-time
leak incident below. A repo-root `.gitleaks.toml` (default ruleset + a no-entropy `AIza` Google
rule) with a **narrow, per-rule allowlist**: because the leak rode in *inside* the candidate-data
JSONL, those paths are not blanket-skipped — only the catch-all `generic-api-key` rule is suppressed
there (and `jwt` too on captured fixtures), so a real key or JWT in the candidate stream is still
caught. Three guards: **(1)** the enforced `scripts/state-run.sh` `scan_state_for_secrets()`,
which scans the **staged** diff before every commit/push, **fails closed** when gitleaks or the
config is absent (`STATE_RUN_SKIP_SECRET_SCAN=1` is the discouraged escape hatch), and distinguishes
leaks from operational errors; **(2)** a best-effort `pre-commit` `gitleaks` hook at
`stages: [pre-push]`; **(3)** a best-effort Claude Code `PreToolUse`/`Bash` hook
(`.claude/settings.json` → `scripts/hooks/gitleaks_prepush_guard.py`) with shlex-based push
detection (no fail-open on `--no-pager`/`-c`/`-C` forms), per-`-C` target scanning, and an
explicit script-relative `--config`. CI installs gitleaks so the guard's blocking tests run rather
than skip. Documented under AGENTS.md “Secret Scanning”. This satisfies gate (a) of `UNIFY-PR-06`.
- [next] `UNIFY-PR-06` (operational, go-live): the state repo is **seeded** (key-scrubbed after the
incident below) with the recovered core state from local `data/news_items` (27,568 candidates +
queues/attempts/verdicts/budget/yield + backfill_batches/runs/metrics; excluded: prefilter models
+ decision logs, engine_query_cache, and the 119 MB candidate_provenance which exceeds GitHub's
100 MB file limit). Remaining: re-enable only the non-scraping workflows on schedules (discover
≥daily, daily-review, monthly-report, release, backup, squash) — gated on the state-push
secret-scan guard (issue #218) plus a manual dispatch verifying the seeded state end to end.
100 MB file limit). The state-push secret-scan guard (`GUARD-PR-SECRET-SCAN`, #218) is now in
place. Remaining: re-enable only the non-scraping workflows on schedules (discover ≥daily,
daily-review, monthly-report, release, backup, squash) — gated only on a manual dispatch
verifying the seeded state end to end.
Scraping ingest / backfill-scrape stay local. Seed-time incident: a live Google CSE key was
captured into a run's `errors[]` from a CSE-403 URL and seeded to the public repo; it was rotated,
the public history was purged, and redaction landed (#217). (Parked scrape→classify decouple:
Expand Down
17 changes: 17 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "python3 \"$CLAUDE_PROJECT_DIR/scripts/hooks/gitleaks_prepush_guard.py\"",
"if": "Bash(git push:*)",
"statusMessage": "Secret-scanning before push (gitleaks)…"
}
]
}
]
}
}
9 changes: 9 additions & 0 deletions .github/workflows/ci-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,15 @@ jobs:
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"

# Required so the gitleaks-gated secret-scan guard tests actually run in CI
# instead of silently skipping (they assert the guard blocks real keys).
- name: Install gitleaks
run: |
VERSION=8.30.1
curl -fsSL "https://github.com/gitleaks/gitleaks/releases/download/v${VERSION}/gitleaks_${VERSION}_linux_x64.tar.gz" \
| sudo tar -xz -C /usr/local/bin gitleaks
gitleaks version

- name: Run integration tests
run: pytest -q tests/integration --cov --cov-report=

Expand Down
6 changes: 4 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,10 @@ __marimo__/
# Local agent overrides
LOCAL_AGENTS.md

# Claude Code
.claude/
# Claude Code — ignore everything under .claude/ except the shared, checked-in
# settings.json (which carries the team-wide secret-scan pre-push hook).
.claude/*
!.claude/settings.json

.DS_Store

Expand Down
51 changes: 51 additions & 0 deletions .gitleaks.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# gitleaks config for denbust — used by the git pre-push hook, the Claude Code
# pre-push hook, and the state-run push guard (scripts/state-run.sh).
#
# Strategy: the industry default ruleset, plus a strict (no-entropy) rule for
# Google API keys (the kind that leaked — gitleaks' default rule applies an
# entropy gate that can miss them).
#
# Allowlisting is deliberately NARROW. The seed-time leak rode in as a Google
# key inside a discovery `errors[]` field — i.e. *inside* the candidate-data
# JSONL. So we must NOT blanket-skip those files: a blanket path allowlist would
# blind the scanner at exactly the incident's location. Instead we suppress only
# the catch-all `generic-api-key` rule (which false-positives on news URLs/titles/
# snippets) on the bulk data paths, while every high-signal provider rule — our
# strict Google rule, `jwt` (the Supabase-JWT incident class), AWS/GitHub/Slack/
# etc. — stays ACTIVE there. Captured third-party HTML fixtures additionally trip
# `jwt` (foreign ad/analytics tokens, not our secrets), so they suppress both.

title = "denbust"

[extend]
useDefault = true

[[rules]]
id = "google-api-key-strict"
description = "Google API key (e.g. Custom Search) — matched without an entropy gate"
regex = '''AIza[0-9A-Za-z\-_]{35}'''
keywords = ["AIza"]

# Bulk news-candidate data (article URLs/titles/snippets). Suppress ONLY the
# generic/entropy catch-all; keep every provider-key rule active so a real key or
# JWT captured into this stream — the incident vector — is still caught.
[[allowlists]]
description = "News-candidate data — suppress generic-api-key noise only"
targetRules = ["generic-api-key"]
paths = [
'''.*/candidates/latest_candidates\.jsonl$''',
'''.*/candidates/retry_queue\.jsonl$''',
'''.*/candidates/backfill_queue\.jsonl$''',
'''.*/candidates/candidate_provenance\.jsonl$''',
'''.*/candidates/scrape_attempts\.jsonl$''',
'''.*/candidates/triage_decisions\.jsonl$''',
'''.*/candidates/engine_query_cache/.*''',
'''.*/prefilter/.*''',
]

# Captured-page test fixtures carry third-party ad/analytics tokens (not ours)
# that trip both the generic catch-all and the JWT rule.
[[allowlists]]
description = "Captured-page test fixtures — third-party tokens, not our secrets"
targetRules = ["generic-api-key", "jwt"]
paths = ['''tests/fixtures/.*''']
8 changes: 8 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,14 @@ repos:
- id: ruff
- id: ruff-format

# Secret scan before push (the "general git" pre-push hook). gitleaks uses the
# repo-root .gitleaks.toml. Runs on pre-push so it gates what leaves the machine.
- repo: https://github.com/gitleaks/gitleaks
rev: v8.30.1
hooks:
- id: gitleaks
stages: [pre-push]

- repo: local
hooks:
- id: mypy
Expand Down
36 changes: 36 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,42 @@ scraping budget. Full protocol: `docs/batch_scraping_protocol.md`.
- `DENBUST_*`
- Never add secrets to fixtures, examples, docs, or workflow YAML.

## Secret Scanning (defense in depth)

Secrets are scanned with [gitleaks](https://github.com/gitleaks/gitleaks) (the
industry tool, not ad-hoc regex). Install it before pushing:

```bash
brew install gitleaks # or: see github.com/gitleaks/gitleaks releases
```

Configuration lives in the repo-root `.gitleaks.toml`: the default ruleset plus
a no-entropy Google API-key rule. Allowlisting is **deliberately narrow** —
because the seed-time leak rode in *inside* the candidate-data JSONL, those
paths are **not** blanket-skipped. We suppress only the catch-all
`generic-api-key` rule there (it false-positives on news URLs/titles/snippets);
every provider rule (the strict Google rule, `jwt`, AWS/GitHub/Slack/…) stays
active, so a real key or JWT captured into the candidate stream is still caught.

Three guards run gitleaks against this config:

1. **state-run push guard** (`scripts/state-run.sh`) — the **enforced** layer
(runs locally *and* in CI). Scans the *staged* diff before each commit/push to
the public state repo and **fails closed**: if gitleaks is missing or
`.gitleaks.toml` cannot be found it refuses to push rather than degrade to
gitleaks' weaker defaults. Override only in an emergency with
`STATE_RUN_SKIP_SECRET_SCAN=1` (not recommended).
2. **git pre-push hook** (`pre-commit`, `stages: [pre-push]`) — best-effort, gates
ordinary `git push`. Enable with `pre-commit install --install-hooks` (the
repo sets `default_install_hook_types: [pre-commit, pre-push]`).
3. **Claude Code pre-push hook** (`.claude/settings.json` → `scripts/hooks/gitleaks_prepush_guard.py`) —
best-effort; a `PreToolUse`/`Bash` hook that blocks an agent-issued `git push`
when gitleaks finds secrets in the pushed repo's tracked content.

The guard's blocking behaviour is regression-tested in
`tests/integration/test_state_run.py`; CI installs gitleaks so those tests run
rather than skip.

## Testing Constraints

- No live network calls in tests.
Expand Down
Loading
Loading