diff --git a/.agent-plan.md b/.agent-plan.md index d3fa467..fca4306 100644 --- a/.agent-plan.md +++ b/.agent-plan.md @@ -6,19 +6,20 @@ ## Mainline Status -- Last merged PR on main: `#217` — secret redaction in persisted discovery state, the root-cause - fix for the seed-time secret-leak incident (below). `redact_secrets()` strips the literal values - of credential-named env vars (`DENBUST_*`/`ANTHROPIC_API_KEY`/Supabase/object-store/Kaggle/HF — - primary, format-agnostic) plus credential shapes (URL key params, JWTs, header tokens, JSON - secret fields, `AIza`/`Bearer`/`sk-` — backstop) from every discovery error string and from the - run/backfill-batch/metrics snapshot writers, so an API error that echoes a key never reaches - state. Threat-model tested across the project's secret types. This is the last step of the - search-backstop code (`UNIFY-PR-05`) plus the incident fix. +- Last merged PR on main: `#220` (`GUARD-PR-SECRET-SCAN`, closes #218) — the three-layer + [gitleaks](https://github.com/gitleaks/gitleaks) secret-scan guard (the outer defense following + the seed-time leak incident below): a shared `.gitleaks.toml`, a `pre-commit` pre-push hook, a + fail-closed `scripts/state-run.sh` scan before each state push, and a Claude Code + `PreToolUse`/`Bash` hook that blocks an agent-issued `git push` carrying a secret. Builds on the + root-cause fix `#217`, which made `redact_secrets()` strip credential values (env-var literals — + primary, format-agnostic — plus URL/JWT/header/`AIza`/`Bearer`/`sk-` shapes — backstop) from every + discovery error string and the run/backfill-batch/metrics snapshot writers, so an API error that + echoes a key never reaches state. - Next planned PR: `UNIFY-PR-06` (operational, go-live) — re-enable only the non-scraping workflows on schedules (discover ≥daily for the backstop, daily-review, monthly-report, release, backup, squash); the scraping ingest / backfill-scrape jobs stay local since GitHub never scrapes. - Deferred until **(a)** the state-push secret-scan guard (issue #218) is in place and **(b)** a - manual dispatch verifies the seeded state end to end — both prompted by the incident below. The + Deferred until a manual dispatch verifies the seeded state end to end — prompted by the incident + below. The state-push secret-scan guard (issue #218) has landed (`GUARD-PR-SECRET-SCAN`). The state repo `DataHackIL/tfht_enforce_idx_state` is **seeded** from local `data/news_items` (27,568 candidates + queues/attempts/verdicts/budget/yield + backfill_batches/runs/metrics, recovered from orphaned `.jsonl.gz` to plain JSONL; excluded: prefilter models + decision-log @@ -331,13 +332,29 @@ defers to a recent local search regardless of clock ordering). A zero-run day now finishes non-fatal. Covered by ledger + config + discover-job tests. +- [done] `GUARD-PR-SECRET-SCAN` (#220, closes #218): three-surface [gitleaks](https://github.com/gitleaks/gitleaks) + secret-scan guard (the industry tool, not ad-hoc regex), the structural follow-up to the seed-time + leak incident below. A repo-root `.gitleaks.toml` (default ruleset + a no-entropy `AIza` Google + rule) with a **narrow, per-rule allowlist**: because the leak rode in *inside* the candidate-data + JSONL, those paths are not blanket-skipped — only the catch-all `generic-api-key` rule is suppressed + there (and `jwt` too on captured fixtures), so a real key or JWT in the candidate stream is still + caught. Three guards: **(1)** the enforced `scripts/state-run.sh` `scan_state_for_secrets()`, + which scans the **staged** diff before every commit/push, **fails closed** when gitleaks or the + config is absent (`STATE_RUN_SKIP_SECRET_SCAN=1` is the discouraged escape hatch), and distinguishes + leaks from operational errors; **(2)** a best-effort `pre-commit` `gitleaks` hook at + `stages: [pre-push]`; **(3)** a best-effort Claude Code `PreToolUse`/`Bash` hook + (`.claude/settings.json` → `scripts/hooks/gitleaks_prepush_guard.py`) with shlex-based push + detection (no fail-open on `--no-pager`/`-c`/`-C` forms), per-`-C` target scanning, and an + explicit script-relative `--config`. CI installs gitleaks so the guard's blocking tests run rather + than skip. Documented under AGENTS.md “Secret Scanning”. This satisfies gate (a) of `UNIFY-PR-06`. - [next] `UNIFY-PR-06` (operational, go-live): the state repo is **seeded** (key-scrubbed after the incident below) with the recovered core state from local `data/news_items` (27,568 candidates + queues/attempts/verdicts/budget/yield + backfill_batches/runs/metrics; excluded: prefilter models + decision logs, engine_query_cache, and the 119 MB candidate_provenance which exceeds GitHub's - 100 MB file limit). Remaining: re-enable only the non-scraping workflows on schedules (discover - ≥daily, daily-review, monthly-report, release, backup, squash) — gated on the state-push - secret-scan guard (issue #218) plus a manual dispatch verifying the seeded state end to end. + 100 MB file limit). The state-push secret-scan guard (`GUARD-PR-SECRET-SCAN`, #218) is now in + place. Remaining: re-enable only the non-scraping workflows on schedules (discover ≥daily, + daily-review, monthly-report, release, backup, squash) — gated only on a manual dispatch + verifying the seeded state end to end. Scraping ingest / backfill-scrape stay local. Seed-time incident: a live Google CSE key was captured into a run's `errors[]` from a CSE-403 URL and seeded to the public repo; it was rotated, the public history was purged, and redaction landed (#217). (Parked scrape→classify decouple: diff --git a/.claude/settings.json b/.claude/settings.json new file mode 100644 index 0000000..58f86ed --- /dev/null +++ b/.claude/settings.json @@ -0,0 +1,17 @@ +{ + "hooks": { + "PreToolUse": [ + { + "matcher": "Bash", + "hooks": [ + { + "type": "command", + "command": "python3 \"$CLAUDE_PROJECT_DIR/scripts/hooks/gitleaks_prepush_guard.py\"", + "if": "Bash(git push:*)", + "statusMessage": "Secret-scanning before push (gitleaks)…" + } + ] + } + ] + } +} diff --git a/.github/workflows/ci-test.yml b/.github/workflows/ci-test.yml index b7f1b39..e7b405f 100644 --- a/.github/workflows/ci-test.yml +++ b/.github/workflows/ci-test.yml @@ -144,6 +144,15 @@ jobs: python -m pip install --upgrade pip python -m pip install -e ".[dev]" + # Required so the gitleaks-gated secret-scan guard tests actually run in CI + # instead of silently skipping (they assert the guard blocks real keys). + - name: Install gitleaks + run: | + VERSION=8.30.1 + curl -fsSL "https://github.com/gitleaks/gitleaks/releases/download/v${VERSION}/gitleaks_${VERSION}_linux_x64.tar.gz" \ + | sudo tar -xz -C /usr/local/bin gitleaks + gitleaks version + - name: Run integration tests run: pytest -q tests/integration --cov --cov-report= diff --git a/.gitignore b/.gitignore index 2a680be..1d0206c 100644 --- a/.gitignore +++ b/.gitignore @@ -211,8 +211,10 @@ __marimo__/ # Local agent overrides LOCAL_AGENTS.md -# Claude Code -.claude/ +# Claude Code — ignore everything under .claude/ except the shared, checked-in +# settings.json (which carries the team-wide secret-scan pre-push hook). +.claude/* +!.claude/settings.json .DS_Store diff --git a/.gitleaks.toml b/.gitleaks.toml new file mode 100644 index 0000000..297b032 --- /dev/null +++ b/.gitleaks.toml @@ -0,0 +1,51 @@ +# gitleaks config for denbust — used by the git pre-push hook, the Claude Code +# pre-push hook, and the state-run push guard (scripts/state-run.sh). +# +# Strategy: the industry default ruleset, plus a strict (no-entropy) rule for +# Google API keys (the kind that leaked — gitleaks' default rule applies an +# entropy gate that can miss them). +# +# Allowlisting is deliberately NARROW. The seed-time leak rode in as a Google +# key inside a discovery `errors[]` field — i.e. *inside* the candidate-data +# JSONL. So we must NOT blanket-skip those files: a blanket path allowlist would +# blind the scanner at exactly the incident's location. Instead we suppress only +# the catch-all `generic-api-key` rule (which false-positives on news URLs/titles/ +# snippets) on the bulk data paths, while every high-signal provider rule — our +# strict Google rule, `jwt` (the Supabase-JWT incident class), AWS/GitHub/Slack/ +# etc. — stays ACTIVE there. Captured third-party HTML fixtures additionally trip +# `jwt` (foreign ad/analytics tokens, not our secrets), so they suppress both. + +title = "denbust" + +[extend] +useDefault = true + +[[rules]] +id = "google-api-key-strict" +description = "Google API key (e.g. Custom Search) — matched without an entropy gate" +regex = '''AIza[0-9A-Za-z\-_]{35}''' +keywords = ["AIza"] + +# Bulk news-candidate data (article URLs/titles/snippets). Suppress ONLY the +# generic/entropy catch-all; keep every provider-key rule active so a real key or +# JWT captured into this stream — the incident vector — is still caught. +[[allowlists]] +description = "News-candidate data — suppress generic-api-key noise only" +targetRules = ["generic-api-key"] +paths = [ + '''.*/candidates/latest_candidates\.jsonl$''', + '''.*/candidates/retry_queue\.jsonl$''', + '''.*/candidates/backfill_queue\.jsonl$''', + '''.*/candidates/candidate_provenance\.jsonl$''', + '''.*/candidates/scrape_attempts\.jsonl$''', + '''.*/candidates/triage_decisions\.jsonl$''', + '''.*/candidates/engine_query_cache/.*''', + '''.*/prefilter/.*''', +] + +# Captured-page test fixtures carry third-party ad/analytics tokens (not ours) +# that trip both the generic catch-all and the JWT rule. +[[allowlists]] +description = "Captured-page test fixtures — third-party tokens, not our secrets" +targetRules = ["generic-api-key", "jwt"] +paths = ['''tests/fixtures/.*'''] diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 0a6ca6e..87689e5 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -19,6 +19,14 @@ repos: - id: ruff - id: ruff-format + # Secret scan before push (the "general git" pre-push hook). gitleaks uses the + # repo-root .gitleaks.toml. Runs on pre-push so it gates what leaves the machine. + - repo: https://github.com/gitleaks/gitleaks + rev: v8.30.1 + hooks: + - id: gitleaks + stages: [pre-push] + - repo: local hooks: - id: mypy diff --git a/AGENTS.md b/AGENTS.md index d7fcd27..0dc2d98 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -144,6 +144,42 @@ scraping budget. Full protocol: `docs/batch_scraping_protocol.md`. - `DENBUST_*` - Never add secrets to fixtures, examples, docs, or workflow YAML. +## Secret Scanning (defense in depth) + +Secrets are scanned with [gitleaks](https://github.com/gitleaks/gitleaks) (the +industry tool, not ad-hoc regex). Install it before pushing: + +```bash +brew install gitleaks # or: see github.com/gitleaks/gitleaks releases +``` + +Configuration lives in the repo-root `.gitleaks.toml`: the default ruleset plus +a no-entropy Google API-key rule. Allowlisting is **deliberately narrow** — +because the seed-time leak rode in *inside* the candidate-data JSONL, those +paths are **not** blanket-skipped. We suppress only the catch-all +`generic-api-key` rule there (it false-positives on news URLs/titles/snippets); +every provider rule (the strict Google rule, `jwt`, AWS/GitHub/Slack/…) stays +active, so a real key or JWT captured into the candidate stream is still caught. + +Three guards run gitleaks against this config: + +1. **state-run push guard** (`scripts/state-run.sh`) — the **enforced** layer + (runs locally *and* in CI). Scans the *staged* diff before each commit/push to + the public state repo and **fails closed**: if gitleaks is missing or + `.gitleaks.toml` cannot be found it refuses to push rather than degrade to + gitleaks' weaker defaults. Override only in an emergency with + `STATE_RUN_SKIP_SECRET_SCAN=1` (not recommended). +2. **git pre-push hook** (`pre-commit`, `stages: [pre-push]`) — best-effort, gates + ordinary `git push`. Enable with `pre-commit install --install-hooks` (the + repo sets `default_install_hook_types: [pre-commit, pre-push]`). +3. **Claude Code pre-push hook** (`.claude/settings.json` → `scripts/hooks/gitleaks_prepush_guard.py`) — + best-effort; a `PreToolUse`/`Bash` hook that blocks an agent-issued `git push` + when gitleaks finds secrets in the pushed repo's tracked content. + +The guard's blocking behaviour is regression-tested in +`tests/integration/test_state_run.py`; CI installs gitleaks so those tests run +rather than skip. + ## Testing Constraints - No live network calls in tests. diff --git a/scripts/hooks/gitleaks_prepush_guard.py b/scripts/hooks/gitleaks_prepush_guard.py new file mode 100755 index 0000000..46c419f --- /dev/null +++ b/scripts/hooks/gitleaks_prepush_guard.py @@ -0,0 +1,179 @@ +#!/usr/bin/env python3 +"""Claude Code PreToolUse hook: secret-scan before a `git push`. + +Reads the PreToolUse event JSON on stdin. When the Bash command is a ``git +push``, it runs gitleaks over the target repo's tracked content (using the +repo-root ``.gitleaks.toml``) and **blocks** the push (exit code 2) when secrets +are found, printing the redacted findings so Claude can fix them instead of +publishing. + +This is the best-effort "Claude ability" pre-push guard. The *enforced* secret +scanning lives in the git pre-push hook (pre-commit) and the +``scripts/state-run.sh`` push guard; this one is a convenience net that stops an +agent from pushing a secret in the first place. + +Design notes: + * Push detection is shlex-tokenized across ``&&``/``;``/``|`` and tolerates git + global options (``--no-pager``, ``-c k=v``, ``-C ``, env-var prefixes), + so it does not fail *open* on a push form an over-specific regex would miss. + * Each detected push is scanned in the directory named by its ``-C `` + (default the current dir), not a hardcoded ``.``. + * The config is resolved relative to this script and passed explicitly, so the + strict ruleset is always applied rather than relying on cwd auto-discovery. + +Exit codes (Claude Code hook protocol): 0 = allow, 2 = block. +""" + +from __future__ import annotations + +import json +import shlex +import shutil +import subprocess +import sys +import tempfile +from pathlib import Path + +# .gitleaks.toml lives at the repo root: scripts/hooks/ -> parents[2]. +_CONFIG = Path(__file__).resolve().parents[2] / ".gitleaks.toml" + +# git global options that consume the following token as their argument. +_OPTS_WITH_ARG = {"-C", "-c", "--git-dir", "--work-tree", "--namespace", "--exec-path"} + + +def _push_targets(command: str) -> list[str]: + """Return the scan directory for each `git push` in a shell command. + + Splits on shell separators, then for each segment walks tokens: skip a + leading run of ``VAR=value`` env assignments, require ``git``, consume git + global options (tracking ``-C ``), and if the resulting subcommand is + ``push`` record the target dir. Returns ``["."]`` as a conservative fallback + when the command cannot be tokenized but clearly contains a git push. + """ + try: + tokens = shlex.split(command, comments=False) + except ValueError: + # Unbalanced quotes etc. — be conservative: if it smells like a push, + # scan the current dir rather than waving it through. + return ["."] if ("git" in command and "push" in command) else [] + + targets: list[str] = [] + segment: list[str] = [] + for tok in (*tokens, "&&"): # sentinel flushes the final segment + if tok in ("&&", "||", "|", ";", "&", "\n"): + target = _push_target_for_segment(segment) + if target is not None: + targets.append(target) + segment = [] + else: + segment.append(tok) + return targets + + +def _push_target_for_segment(tokens: list[str]) -> str | None: + i = 0 + # Skip leading env-var assignments (e.g. `GIT_DIR=… git push`). + while i < len(tokens) and "=" in tokens[i] and not tokens[i].startswith("-"): + i += 1 + if i >= len(tokens) or tokens[i] != "git": + return None + i += 1 + cwd = "." + while i < len(tokens): + tok = tokens[i] + if tok == "-C" and i + 1 < len(tokens): + cwd = tokens[i + 1] + i += 2 + continue + if tok in _OPTS_WITH_ARG and i + 1 < len(tokens): + i += 2 + continue + if tok.startswith("-"): + i += 1 + continue + return cwd if tok == "push" else None + return None + + +def _scan(target: str) -> tuple[bool, str]: + """Scan ``target`` with gitleaks. Returns (has_leaks, redacted_details).""" + config_arg = ["--config", str(_CONFIG)] if _CONFIG.is_file() else [] + if not config_arg: + print( + f"gitleaks-prepush: WARNING — {_CONFIG} not found; scanning with gitleaks " + "defaults, which miss the low-entropy Google key class.", + file=sys.stderr, + ) + with tempfile.NamedTemporaryFile(suffix=".json", delete=False) as tf: + report = tf.name + try: + scan = subprocess.run( + [ + "gitleaks", + "git", + target, + *config_arg, + "--no-banner", + "--redact", + "--report-format", + "json", + "--report-path", + report, + ], + capture_output=True, + text=True, + ) + if scan.returncode == 0: + return False, "" + body = Path(report).read_text() if Path(report).exists() else "" + if '"RuleID"' in body: + return True, (scan.stderr or body).strip()[-2000:] + # Non-zero without findings = gitleaks operational error (e.g. not a git + # repo). Do not block the agent on a tool error; warn instead. + print( + f"gitleaks-prepush: gitleaks errored on '{target}' (exit " + f"{scan.returncode}); not blocking. {scan.stderr.strip()[-300:]}", + file=sys.stderr, + ) + return False, "" + finally: + Path(report).unlink(missing_ok=True) + + +def main() -> int: + try: + event = json.load(sys.stdin) + except (json.JSONDecodeError, ValueError): + return 0 # not parseable -> do not interfere + + if event.get("tool_name") != "Bash": + return 0 + command = (event.get("tool_input") or {}).get("command", "") + targets = _push_targets(command) + if not targets: + return 0 + + if shutil.which("gitleaks") is None: + print( + "gitleaks-prepush: gitleaks not installed; skipping pre-push secret scan " + "(install: 'brew install gitleaks').", + file=sys.stderr, + ) + return 0 + + for target in targets: + has_leaks, details = _scan(target) + if has_leaks: + print( + "BLOCKED by gitleaks-prepush: potential secrets detected in tracked " + f"content of '{target}'; refusing `git push`. Remove/rotate the secret " + "(and purge it from history) before pushing. Findings (redacted):\n" + details, + file=sys.stderr, + ) + return 2 # block the tool call + + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/state-run.sh b/scripts/state-run.sh index fcf402a..6f2effa 100755 --- a/scripts/state-run.sh +++ b/scripts/state-run.sh @@ -123,6 +123,61 @@ push_state() { done } +# Secret-scan guard: refuse to commit/push state that gitleaks flags. The leak +# that motivated this rode in as a key inside a discovery `errors[]` field; +# redaction scrubs those at write time, and this is the enforced defense-in-depth +# net at the push boundary. It scans the *staged* diff (what is about to be +# committed), not the whole working tree, so it does not block on a pre-existing +# secret in an unchanged file and does not re-scan the full candidate store. +# +# Fails closed: if gitleaks is missing or the config cannot be found, it refuses +# to push rather than degrade to gitleaks' default ruleset (which misses the +# low-entropy Google key class our strict rule exists for). Override only in an +# emergency with STATE_RUN_SKIP_SECRET_SCAN=1. +scan_state_for_secrets() { + if [[ "${STATE_RUN_SKIP_SECRET_SCAN:-0}" == "1" ]]; then + echo "state-run: WARNING — secret scan skipped (STATE_RUN_SKIP_SECRET_SCAN=1)." >&2 + return 0 + fi + if ! command -v gitleaks >/dev/null 2>&1; then + echo "state-run: gitleaks not installed — refusing to push state unscanned." >&2 + echo " install it (e.g. 'brew install gitleaks') or, only if you must," >&2 + echo " set STATE_RUN_SKIP_SECRET_SCAN=1 to override (NOT recommended)." >&2 + return 1 + fi + local repo_root cfg + repo_root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" + cfg="$repo_root/.gitleaks.toml" + if [[ ! -f "$cfg" ]]; then + echo "state-run: .gitleaks.toml not found at $cfg — refusing to push." >&2 + echo " scanning with gitleaks defaults would miss the low-entropy Google key" >&2 + echo " class this config exists to catch. Restore the file before pushing." >&2 + return 1 + fi + + local report rc=0 + report="$(mktemp -t state-run-gitleaks.XXXXXX.json)" + # Scan only what is staged. gitleaks exit: 0 = clean, 1 = leaks found, other = + # operational error (which we also treat as fail-closed). + gitleaks git --staged "$STATE_REPO_DIR" \ + --config "$cfg" --no-banner --redact \ + --report-format json --report-path "$report" >/dev/null 2>&1 || rc=$? + + if [[ "$rc" -eq 0 ]]; then + rm -f "$report" + return 0 + fi + + if [[ -s "$report" ]] && grep -q '"RuleID"' "$report" 2>/dev/null; then + echo "state-run: SECRET DETECTED in staged state — refusing to commit/push." >&2 + grep -oE '"(RuleID|File|StartLine)": *[^,}]*' "$report" | head -40 >&2 || true + else + echo "state-run: gitleaks errored (exit $rc) — refusing to push (fail-closed)." >&2 + fi + rm -f "$report" + return 1 +} + acquire_state_repo_lock # 1. Bring the state repo to canonical HEAD. @@ -149,6 +204,10 @@ fi if [[ -z "$(git_state diff --cached --name-only)" ]]; then echo "state-run: no state changes to commit." else + if ! scan_state_for_secrets; then + echo "state-run: aborting before commit/push — secret scan failed." >&2 + exit 1 + fi git_state config user.name "${GIT_AUTHOR_NAME:-github-actions[bot]}" git_state config user.email "${GIT_AUTHOR_EMAIL:-41898282+github-actions[bot]@users.noreply.github.com}" git_state commit -m "${message:-chore(state): update ${subtrees[*]:-state}}" diff --git a/tests/integration/test_state_run.py b/tests/integration/test_state_run.py index eb43efd..12fafac 100644 --- a/tests/integration/test_state_run.py +++ b/tests/integration/test_state_run.py @@ -57,6 +57,7 @@ def _run_wrapper( message: str | None = None, offline: bool = False, no_fetch: bool = False, + scan_secrets: bool = False, ) -> subprocess.CompletedProcess[str]: args: list[str] = ["bash", str(WRAPPER)] for subtree in subtrees: @@ -76,6 +77,10 @@ def _run_wrapper( "GIT_AUTHOR_NAME": "tester", "GIT_AUTHOR_EMAIL": "tester@example.com", } + # The secret-scan guard fails closed when gitleaks is absent; tests that + # exercise wrapper mechanics (not the guard) opt out so they run anywhere. + if not scan_secrets: + env["STATE_RUN_SKIP_SECRET_SCAN"] = "1" return subprocess.run(args, env=env, capture_output=True, text=True) @@ -253,3 +258,107 @@ def test_rejected_push_recovers_via_refetch_rebase(tmp_path: Path) -> None: assert _remote_commit_count(remote) == 3 assert _remote_file(remote, "news_items/discover/other.jsonl") == "other" # not clobbered assert _remote_file(remote, "news_items/discover/mine.jsonl") == "mine" + + +_GITLEAKS = shutil.which("gitleaks") is not None +# A correctly-shaped fake Google key (AIza + 35 chars) — matches the strict rule +# in .gitleaks.toml. No real key is in this file. +_FAKE_GOOGLE_KEY = "AIza" + "B1cD3fGh4JkLmN0pQrStUvWxYz123456789" +# A structurally-valid but meaningless JWT (the Supabase-JWT incident class). +_FAKE_JWT = ( + "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c" +) +# A high-entropy assignment that trips gitleaks' catch-all `generic-api-key` +# rule — the kind of false positive news text produces. Suppressed on candidate +# paths, so it must NOT block a push. +_GENERIC_NOISE = "api_key = 8f3Hq9ZxR2bN7vKpL4wYtCgD6sErA1mU0oP" + + +@pytest.mark.skipif(not _GITLEAKS, reason="gitleaks not installed") +def test_secret_in_state_is_blocked_before_push(tmp_path: Path) -> None: + """A secret written into state is caught by gitleaks; nothing is committed/pushed.""" + remote = _make_remote(tmp_path) + result = _run_wrapper( + work=tmp_path / "work", + remote=remote, + scan_secrets=True, + subtrees=["news_items/discover"], + command=[ + "bash", + "-c", + f'echo \'{{"errors":["?key={_FAKE_GOOGLE_KEY}"]}}\' ' + '> "$DENBUST_STATE_ROOT/news_items/discover/run.json"', + ], + ) + assert result.returncode != 0 + assert "SECRET DETECTED" in result.stderr + assert _remote_commit_count(remote) == 1 # push blocked + assert _FAKE_GOOGLE_KEY not in result.stdout + result.stderr # not echoed (redacted) + + +@pytest.mark.skipif(not _GITLEAKS, reason="gitleaks not installed") +def test_clean_state_passes_the_secret_scan(tmp_path: Path) -> None: + """Clean state passes the scan and is pushed normally.""" + remote = _make_remote(tmp_path) + result = _run_wrapper( + work=tmp_path / "work", + remote=remote, + scan_secrets=True, + subtrees=["news_items/discover"], + command=[ + "bash", + "-c", + 'echo \'{"status":"ok","errors":[]}\' ' + '> "$DENBUST_STATE_ROOT/news_items/discover/run.json"', + ], + ) + assert result.returncode == 0, result.stderr + assert _remote_commit_count(remote) == 2 # clean push succeeded + + +@pytest.mark.skipif(not _GITLEAKS, reason="gitleaks not installed") +def test_generic_noise_in_candidate_data_is_not_false_flagged(tmp_path: Path) -> None: + """The generic-api-key catch-all is suppressed on candidate paths, so key-ish + news text does not block a push.""" + remote = _make_remote(tmp_path) + result = _run_wrapper( + work=tmp_path / "work", + remote=remote, + scan_secrets=True, + subtrees=["news_items/discover"], + command=[ + "bash", + "-c", + 'mkdir -p "$DENBUST_STATE_ROOT/news_items/discover/candidates"; ' + f'echo \'{{"snippet":"{_GENERIC_NOISE}"}}\' ' + '> "$DENBUST_STATE_ROOT/news_items/discover/candidates/latest_candidates.jsonl"', + ], + ) + assert result.returncode == 0, result.stderr # generic noise suppressed → not flagged + assert _remote_commit_count(remote) == 2 + + +@pytest.mark.skipif(not _GITLEAKS, reason="gitleaks not installed") +def test_real_key_in_candidate_data_is_blocked(tmp_path: Path) -> None: + """Regression for the seed-time incident: a real key/JWT captured *into* + candidate data — the exact leak vector — must still be caught. Only the + generic catch-all is suppressed on these paths, not the provider rules.""" + remote = _make_remote(tmp_path) + for secret in (_FAKE_GOOGLE_KEY, _FAKE_JWT): + result = _run_wrapper( + work=tmp_path / f"work_{secret[:8]}", + remote=remote, + scan_secrets=True, + subtrees=["news_items/discover"], + command=[ + "bash", + "-c", + 'mkdir -p "$DENBUST_STATE_ROOT/news_items/discover/candidates"; ' + f'echo \'{{"url":"x","errors":["403 ?key={secret}"]}}\' ' + '> "$DENBUST_STATE_ROOT/news_items/discover/candidates/latest_candidates.jsonl"', + ], + ) + assert result.returncode != 0, f"{secret[:8]} not blocked" + assert "SECRET DETECTED" in result.stderr + assert _remote_commit_count(remote) == 1 # push blocked, nothing committed + assert secret not in result.stdout + result.stderr # redacted diff --git a/tests/integration/test_state_squash.py b/tests/integration/test_state_squash.py index 1de4638..90a6e52 100644 --- a/tests/integration/test_state_squash.py +++ b/tests/integration/test_state_squash.py @@ -58,6 +58,9 @@ def _env(work: Path, remote: Path) -> dict[str, str]: "STATE_REPO_BRANCH": "main", "GIT_AUTHOR_NAME": "tester", "GIT_AUTHOR_EMAIL": "tester@example.com", + # These tests exercise squash/coexistence mechanics, not the secret-scan + # guard (which fails closed without gitleaks); opt out so they run anywhere. + "STATE_RUN_SKIP_SECRET_SCAN": "1", }