From b15bd35bb6a24da684b42c998032bfc458d90da3 Mon Sep 17 00:00:00 2001 From: Ahmet Abdullah Gultekin Date: Tue, 12 May 2026 17:40:33 +0000 Subject: [PATCH 1/2] infra(traefik+ops): XFF strip + OPERATOR_ACTIONS 2026-05-12 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 hygiene from 2026-05-12 senior reviews (backend, DB, infra, security): * infra/traefik: vendored copy of /opt/projects/infra/traefik/config/ with forwardedHeaders.trustedIPs: [] on both :80 and :443 entryPoints. RateLimitInterceptor.getClientIP in identity-core-api consumes `XFF.split(",")[0]` so the prior config (no forwardedHeaders block) let an attacker bypass every per-IP bucket (login, MFA, biometric, qr-generate) by setting their own X-Forwarded-For. Empty trustedIPs causes Traefik to strip incoming XFF and write its own using the peer IP. Internal Docker bridge (172.20.0.0/24) is NOT trusted because external clients never connect from that range — only Docker-network containers, and those don't set XFF. README.md documents the vendored-vs-live split and the sync workflow. * OPERATOR_ACTIONS_2026-05-12.md: 5 items agents shouldn't autonomously execute. Per-item severity, blast radius, maintenance window, dependencies, explicit commands: 1. audit_logs partman bootstrap (V57 was a silent no-op; runbook at infra/RUNBOOK_AUDIT_LOG_PARTMAN.md prepped Option A image) 2. RLS theatre (V25 left FORCE commented; 9 tables relforcerowsecurity=f; app role is postgres superuser → RLS bypassed) 3. web-app/.env.production still byte-identical to leaked literal 6bdedd2; live bundle is clean but rebuild-from-tree would regress 4. parent main fast-forward: master 220 ahead, main 134 ahead but all already merged via PR #51 — `git push origin master:main --force-with-lease` reconciles 5. HS512 kid hs-2026-04 revocation pending Team Auth-Java PR; rebuild api container after merge Companion api PR fix/2026-05-12-infra-hygiene ships V61 NOT NULL for audit_logs.tenant_id (locks down the V59 backfill). Co-Authored-By: Claude Opus 4.7 (1M context) --- OPERATOR_ACTIONS_2026-05-12.md | 370 +++++++++++++++++++++++++++++++ infra/traefik/README.md | 51 +++++ infra/traefik/config/dynamic.yml | 175 +++++++++++++++ infra/traefik/config/traefik.yml | 64 ++++++ 4 files changed, 660 insertions(+) create mode 100644 OPERATOR_ACTIONS_2026-05-12.md create mode 100644 infra/traefik/README.md create mode 100644 infra/traefik/config/dynamic.yml create mode 100644 infra/traefik/config/traefik.yml diff --git a/OPERATOR_ACTIONS_2026-05-12.md b/OPERATOR_ACTIONS_2026-05-12.md new file mode 100644 index 0000000..291a55d --- /dev/null +++ b/OPERATOR_ACTIONS_2026-05-12.md @@ -0,0 +1,370 @@ +# OPERATOR ACTIONS — 2026-05-12 + +Items surfaced by the 2026-05-12 senior reviews (backend, DB, infra, security) +that agents should not autonomously execute. Each is a checklist with explicit +commands, a maintenance-window estimate, and explicit dependencies. Severity +labels: + +- **CRITICAL** — exposes a live, exploitable security or correctness gap. +- **HIGH** — drift between deployed config and committed config; reviewers + cannot reason about prod from code. +- **MEDIUM** — hygiene + cosmetic; safe to defer but easy to land. + +--- + +## 1. audit_logs partitioning — V57 silent no-op (HIGH) + +**Background.** +The Flyway migration `V57__audit_logs_pg_partman.sql` is the one that hands +`public.audit_logs` to the `pg_partman` extension so partitions roll over +monthly with a 24-month retention. V57 runs to `success=t` in +`flyway_schema_history`, but the live postgres image +`pgvector/pgvector:pg17` does not bundle `pg_partman`. The first guard at the +top of V57 detects the missing extension, emits `RAISE WARNING`, and `RETURN`s +before the V40-fallback conversion runs. + +Symptom on prod today (2026-05-12): +- `pg_class.relkind` for `audit_logs` is `'r'` (regular table), not `'p'` + (partitioned). +- 1168 rows in a single heap, no inheritance children. +- `partman.part_config` row for `public.audit_logs` does not exist. + +Memory entry `project_session_20260511` records this: commit `b32ca03` +("infra(scripts+v57): rotation scripts + V57 Option A pg_partman image +preparation") and the untracked file `/opt/projects/infra/RUNBOOK_AUDIT_LOG_PARTMAN.md` +already prep the fix path. The runbook is the authoritative recipe; this +section is the executive summary. + +**Blast radius.** +audit_logs growth becomes painful around 10-20M rows (current is 1168). +There is operational headroom of months at current write rate. Failure mode +when finally addressed = vacuum/index-scan slowdowns + the GDPR/KVKK +24-month purge has to be implemented manually as a `DELETE` instead of +`DROP PARTITION`. No data loss; just latency drift. + +**Maintenance window.** 15-30 minutes; postgres restart required for +`shared_preload_libraries = 'pg_partman_bgw,pg_cron'`. + +**Dependencies.** None on FIVUCSAS code. Operator owns the custom postgres +image build. + +**Suggested execution path (Option A from the runbook).** + +1. Create `/opt/projects/fivucsas/infra/postgres/Dockerfile` per the runbook + (`pgvector/pgvector:pg17` base + `postgresql-17-partman` + + `postgresql-17-cron`). +2. Swap the `image:` in `/opt/projects/fivucsas/docker-compose.prod.yml` + `postgres:` service for a `build:` block pointing at the new Dockerfile. +3. Rebuild: + ```bash + cd /opt/projects/fivucsas + docker compose -f docker-compose.prod.yml --env-file .env.prod build postgres + docker compose -f docker-compose.prod.yml --env-file .env.prod up -d postgres + ``` +4. After postgres is healthy, run the partman bootstrap on the existing + non-partitioned table: + ```sql + CREATE EXTENSION IF NOT EXISTS pg_partman; + CREATE EXTENSION IF NOT EXISTS pg_cron; + SELECT partman.create_parent( + p_parent_table := 'public.audit_logs', + p_control := 'created_at', + p_type := 'range', + p_interval := '1 month', + p_premake := 12, + p_start_partition := '2026-01-01' + ); + UPDATE partman.part_config + SET retention = '24 months', + retention_keep_table = false, + retention_keep_index = false + WHERE parent_table = 'public.audit_logs'; + ``` +5. **Alternative** — if you would rather not bootstrap a live table, mark + V57 as failed in `flyway_schema_history` and re-apply once the new + image is in place (Flyway will see the migration as new and run it + end-to-end with partman available). + +**Acceptance check.** +```sql +SELECT parent_table, partition_interval, premake, retention + FROM partman.part_config + WHERE parent_table = 'public.audit_logs'; +-- expect 1 row, interval='1 mon', premake=12, retention='24 months' +``` + +--- + +## 2. RLS theatre — every policy fail-open + app role is superuser (CRITICAL) + +**Background.** +The Flyway migration `V25__row_level_security.sql` enabled Row-Level +Security on 9 tables but left the `FORCE ROW LEVEL SECURITY` line commented +out. Every policy includes a `current_tenant_id() IS NULL` disjunct, which +returns true any time the session has not run `SET app.current_tenant_id`. +The application's JDBC URL connects as the `postgres` superuser, and +superusers bypass RLS unconditionally. Net effect: RLS is ENABLED in +`pg_class.relrowsecurity` but is functionally OFF. + +Verified today: +```sql +SELECT relname, relrowsecurity, relforcerowsecurity + FROM pg_class + WHERE relname IN ('users','tenants','audit_logs','biometric_enrollments', + 'auth_flows','auth_flow_steps','user_enrollments', + 'oauth2_clients','refresh_tokens'); +-- all 9 rows: relrowsecurity=t, relforcerowsecurity=f +``` + +**Blast radius.** +A SQL-injection (or a deliberately misuse of `JdbcTemplate.queryForList`) +that omits a `tenant_id =` predicate returns rows from every tenant. The +admin-IP whitelist on `/swagger-ui` and `/actuator` does not help here — +the entry point is the application code itself. + +**Maintenance window.** 30-60 minutes; requires postgres role creation, +GRANT statements, and a JDBC URL flip. Smoke-test downtime ~2 minutes +when the api container restarts. + +**Dependencies.** +- New non-superuser role (call it `fivucsas_app`) created and granted only + what is needed. +- `.env.prod` `SPRING_DATASOURCE_USERNAME` flipped from `postgres` to + `fivucsas_app`. +- After the role swap, `FORCE ROW LEVEL SECURITY` flipped on every + RLS-enabled table. +- Smoke-test all 10 auth methods + tenant admin endpoints in maintenance + window. + +**Suggested execution path.** + +1. Inside the maintenance window: + ```sql + CREATE ROLE fivucsas_app LOGIN PASSWORD ''; + GRANT CONNECT ON DATABASE identity_core TO fivucsas_app; + GRANT USAGE ON SCHEMA public TO fivucsas_app; + GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO fivucsas_app; + GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO fivucsas_app; + ALTER DEFAULT PRIVILEGES IN SCHEMA public + GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO fivucsas_app; + ``` +2. Flip FORCE on the 9 RLS tables in a single transaction (Flyway + migration `V62__rls_force.sql` recommended so the change is tracked): + ```sql + ALTER TABLE users FORCE ROW LEVEL SECURITY; + ALTER TABLE tenants FORCE ROW LEVEL SECURITY; + ALTER TABLE audit_logs FORCE ROW LEVEL SECURITY; + ALTER TABLE biometric_enrollments FORCE ROW LEVEL SECURITY; + ALTER TABLE auth_flows FORCE ROW LEVEL SECURITY; + ALTER TABLE auth_flow_steps FORCE ROW LEVEL SECURITY; + ALTER TABLE user_enrollments FORCE ROW LEVEL SECURITY; + ALTER TABLE oauth2_clients FORCE ROW LEVEL SECURITY; + ALTER TABLE refresh_tokens FORCE ROW LEVEL SECURITY; + ``` +3. Drop the `OR current_tenant_id() IS NULL` disjunct from each policy + in the same migration. The application sets `app.current_tenant_id` + via a JDBC interceptor on every transaction; absence means a code + path is wrong and should fail visibly, not return all tenants. +4. Edit `.env.prod`: + ``` + SPRING_DATASOURCE_USERNAME=fivucsas_app + SPRING_DATASOURCE_PASSWORD= + ``` +5. Rebuild + restart: + ```bash + cd /opt/projects/fivucsas/identity-core-api + docker compose -f docker-compose.prod.yml --env-file .env.prod up -d identity-core-api + ``` +6. Smoke-test: log in as a tenant admin from tenant A, hit + `/api/v1/audit-logs`, confirm zero tenant-B rows; same for + `/api/v1/users`. + +**Acceptance check.** +```sql +SELECT relname, relforcerowsecurity FROM pg_class + WHERE relname IN ('users','audit_logs','biometric_enrollments', + 'tenants','auth_flows','auth_flow_steps', + 'user_enrollments','oauth2_clients','refresh_tokens'); +-- all 9 rows: relforcerowsecurity=t +``` + +--- + +## 3. web-app/.env.production still byte-identical to leaked literal (HIGH) + +**Background.** +Commit `6bdedd2` (2026-04-30 morning, since-rotated) committed the +biometric API key plaintext into `web-app/.env.production`. The bio-side +key was rotated 2026-04-30 05:05 UTC and confirmed dead — the live value +is now `API_KEY_SECRET=fcb06b7…` (verified by the 2026-05-12 security +review). However the on-disk template at +`/opt/projects/fivucsas/web-app/.env.production` still contains the +leaked literal in `VITE_BIOMETRIC_API_KEY=…` form (2 occurrences, +verified today by `grep -c`). + +Because the variable has the `VITE_*` prefix, any subsequent +`npm run build` from this working tree would inline the dead key into +the bundle. The current production bundle does NOT reference the variable +(audited by the security reviewer) so there is no live exposure today, +but rebuilding-from-this-directory would regress that. + +**Blast radius.** +- Currently zero — the live key has been rotated and the live bundle does + not include the leaked literal. +- If someone rebuilds web-app without replacing the value first, the dead + literal lands back in `dist/` and gets deployed to Hostinger. + +**Maintenance window.** 5 minutes for the file edit. The git-history +rewrite (if pursued) is a coordination cost across collaborators with +local clones, not a maintenance window per se. + +**Dependencies.** +Team Web-Hygiene (separate parallel agent) is editing +`web-app/.env.production` to either a placeholder or the rotated value. +The git-history rewrite decision stays with the operator. + +**Operator decisions required.** + +1. **(a) On-disk value.** Confirm Team Web-Hygiene replaced the literal + with either `VITE_BIOMETRIC_API_KEY=__SET_AT_DEPLOY_TIME__` (placeholder) + or the rotated live value. Recommended: placeholder, so the rotated + key never sits in any tree that ships to GitHub or to a CI cache. + ```bash + grep -n "VITE_BIOMETRIC_API_KEY" /opt/projects/fivucsas/web-app/.env.production + # expect either placeholder or no leaked literal + ``` +2. **(b) Git history rewrite.** Decide whether to expunge `6bdedd2` from + history. This is destructive: + - Forces every collaborator to re-clone or run + `git filter-repo`-equivalent locally. + - Invalidates any commit-pinned references in CHANGELOG, PR + descriptions, and external docs. + - Recommended approach if you do pursue it: + ```bash + # WARNING: coordinate with all collaborators first. + cd /opt/projects/fivucsas/web-app + git filter-repo --invert-paths --path .env.production + # then force-push and notify the team. + ``` + - Recommendation: skip the rewrite. The key is dead, the bundle is + clean, and the cost of a force-push to a public repo with five + collaborators outweighs the marginal forensic benefit. + +--- + +## 4. Branch reconciliation: parent main is behind master (HIGH) + +**Background.** +The parent FIVUCSAS monorepo has two branches that should track each +other: + +- `master` — integration branch where PRs land. Today it is 220 commits + ahead of `main`. +- `main` — the GitHub default branch and the marketing target. It is + 134 commits ahead of master in raw `git log` terms, but every one of + those 134 commits was already merged into master via parent PR #51 + (the 2026-05-11 reconciliation PR). + +The 220-commit lead of master is the genuine integration drift; the +134-commit "lead" of main is illusory because they're the same commits +from a different merge angle. + +Verified today: +```bash +cd /opt/projects/fivucsas +git log --oneline main..master | wc -l # 220 +git log --oneline master..main | wc -l # 134 +``` + +Memory entry `project_session_20260511` notes that PR #51 was the +2026-05-11 reconciliation — its 134 commits were brought into master +but the operator deferred the reverse direction. + +**Blast radius.** +- GitHub PR UI defaults base-branch to `main`, so first-time contributors + may target main and have their PR confusingly rebased onto master + later. +- CI workflows that filter on `main` (only) are running against a stale + tree. +- Reviewers looking at https://github.com/Rollingcat-Software/FIVUCSAS see + a stale README/CLAUDE.md/ROADMAP. + +**Maintenance window.** 1 minute. Fast-forward push, no PR required. + +**Dependencies.** None on submodules — memory entry +`project_session_20260511` confirms submodule HEADs are already aligned. + +**Suggested execution path.** +```bash +cd /opt/projects/fivucsas +git fetch origin +# Sanity: confirm master is strictly ahead of main (every main commit is +# already on master, so this is a fast-forward). +git merge-base --is-ancestor origin/main origin/master \ + && echo "OK: main is an ancestor of master, fast-forward safe." +# Apply: +git push origin master:main --force-with-lease +``` + +**Acceptance check.** +```bash +git log --oneline master..origin/main | wc -l # expect 0 +git log --oneline origin/main..master | wc -l # expect 0 +``` + +--- + +## 5. HS512 secret revocation — pending Team Auth-Java PR (MEDIUM) + +**Background.** +A historical HS512 JWT signing key (kid `hs-2026-04`) was rotated out of +service. Verification still accepts that kid because `HsKeyRegistry` +retains it for the no-logout rotation pattern (PR #64, 2026-05-04). +Team Auth-Java is shipping an explicit `revoked-kids` list in +`application-prod.yml` so verification refuses `hs-2026-04` outright. +Until that PR merges and the api container is rebuilt, tokens minted +with the leaked secret remain accepted. + +**Blast radius.** +Anyone who held a copy of the leaked HS512 secret can forge a JWT until +the revocation flips. The secret rotation date is the latest known +exposure boundary; effective compromise window persists until rebuild. + +**Maintenance window.** Zero-downtime; api container rolling restart +~30 seconds. + +**Dependencies.** Team Auth-Java PR must merge first. After merge: + +```bash +cd /opt/projects/fivucsas/identity-core-api +git pull +docker compose -f docker-compose.prod.yml --env-file .env.prod build --no-cache identity-core-api +docker compose -f docker-compose.prod.yml --env-file .env.prod up -d identity-core-api +``` + +**Acceptance check.** +After rebuild, attempt verification with a token signed by the revoked +kid (use a stored prod-audit-log JWT from before the rotation if +available): +```bash +curl -sS -H "Authorization: Bearer " \ + https://api.fivucsas.com/api/v1/users/me +# expect 401 with body referencing "kid revoked" or generic invalid +``` + +--- + +## Quick reference: per-item severity + dependency matrix + +| # | Item | Severity | Mtn window | Blocked on | +|---|-------------------------------|----------|-------------|------------------------| +| 1 | audit_logs partman bootstrap | HIGH | 15-30 min | custom postgres image | +| 2 | RLS theatre | CRITICAL | 30-60 min | new postgres role + V62 migration | +| 3 | web-app .env.production leak | HIGH | 5 min | Team Web-Hygiene PR | +| 4 | parent main fast-forward | HIGH | 1 min | nothing | +| 5 | HS512 kid revocation | MEDIUM | rolling | Team Auth-Java PR | + +Recommended order if attacking all five in one session: +4 (instant, unblocks reviewers) → 3 (post-merge verify) → 5 (post-PR +rebuild) → 1 (maintenance window slot) → 2 (longer maintenance window, +covers RLS smoke-test). diff --git a/infra/traefik/README.md b/infra/traefik/README.md new file mode 100644 index 0000000..940e525 --- /dev/null +++ b/infra/traefik/README.md @@ -0,0 +1,51 @@ +# Traefik Config (Vendored Reference) + +The **live** Traefik configuration runs from `/opt/projects/infra/traefik/` +on the Hetzner host. That directory belongs to the `/opt/projects/` local +git repo (no remote) and is the source-of-truth Traefik mounts at runtime +(see `docker-compose.yml`, volumes `./config/traefik.yml` and +`./config/dynamic.yml`). + +This `infra/traefik/` directory inside the FIVUCSAS repo is a **vendored +copy** so reviewers can diff Traefik changes alongside the rest of the +codebase. It is NOT mounted by Traefik directly. + +## Sync workflow + +After merging a change to this directory: + +```bash +# 1. Sync vendored copy -> live config +sudo cp /opt/projects/fivucsas/infra/traefik/config/traefik.yml \ + /opt/projects/infra/traefik/config/traefik.yml +sudo cp /opt/projects/fivucsas/infra/traefik/config/dynamic.yml \ + /opt/projects/infra/traefik/config/dynamic.yml + +# 2. Validate (Traefik watches dynamic.yml live; traefik.yml requires restart) +docker compose -f /opt/projects/infra/traefik/docker-compose.yml \ + --env-file /opt/projects/infra/traefik/.env config + +# 3. Apply +# dynamic.yml changes: zero-restart, picked up via inotify (`watch: true`) +# traefik.yml changes: require container restart +docker compose -f /opt/projects/infra/traefik/docker-compose.yml \ + --env-file /opt/projects/infra/traefik/.env restart traefik + +# 4. Verify access log writes peer IP, not client-supplied XFF +docker logs traefik 2>&1 | tail -20 +``` + +## XFF / Rate-Limit Hardening (2026-05-12) + +`entryPoints.{web,websecure}.forwardedHeaders.trustedIPs: []` ensures +Traefik strips any client-supplied `X-Forwarded-For` and overwrites it +with the connection peer IP. This is required because the backend +`RateLimitInterceptor.getClientIP` (identity-core-api) consumes +`XFF.split(",")[0]` without validating origin. Empty trustedIPs makes +the backend safe regardless of its parsing choice. + +If a CDN or upstream proxy is ever inserted in front of Traefik, add +its egress CIDRs to `trustedIPs`. The internal `proxy` Docker bridge +(`172.20.0.0/24`) is deliberately NOT listed — external clients never +connect from that range; only container-to-Traefik traffic does, and +those callers do not set `X-Forwarded-For`. diff --git a/infra/traefik/config/dynamic.yml b/infra/traefik/config/dynamic.yml new file mode 100644 index 0000000..6d565ff --- /dev/null +++ b/infra/traefik/config/dynamic.yml @@ -0,0 +1,175 @@ +http: + routers: + fivucsas-comtr-redirect: + rule: "Host(`fivucsas.com.tr`) || Host(`www.fivucsas.com.tr`)" + entryPoints: + - websecure + middlewares: + - redirect-to-fivucsas + service: noop@internal + tls: + certResolver: letsencrypt + + fivucsas-online-redirect: + rule: "Host(`fivucsas.online`) || Host(`www.fivucsas.online`)" + entryPoints: + - websecure + middlewares: + - redirect-to-fivucsas + service: noop@internal + tls: + certResolver: letsencrypt + + fivucsas-info-redirect: + rule: "Host(`fivucsas.info`) || Host(`www.fivucsas.info`)" + entryPoints: + - websecure + middlewares: + - redirect-to-fivucsas + service: noop@internal + tls: + certResolver: letsencrypt + + fivucsas-www-redirect: + rule: "Host(`www.fivucsas.com`)" + entryPoints: + - websecure + middlewares: + - redirect-www-to-apex + service: noop@internal + tls: + certResolver: letsencrypt + + rollingcat-apex-redirect: + rule: "Host(`rollingcatsoftware.com`) || Host(`www.rollingcatsoftware.com`)" + entryPoints: + - websecure + middlewares: + - redirect-to-fivucsas + service: noop@internal + tls: + certResolver: letsencrypt + + rollingcat-ica-redirect: + rule: "Host(`ica-fivucsas.rollingcatsoftware.com`)" + entryPoints: + - websecure + middlewares: + - redirect-to-fivucsas-api + service: noop@internal + tls: + certResolver: letsencrypt + + rollingcat-bys-redirect: + rule: "Host(`bys-demo.rollingcatsoftware.com`)" + entryPoints: + - websecure + middlewares: + - redirect-to-fivucsas-demo + service: noop@internal + tls: + certResolver: letsencrypt + + # IN-H2 (2026-04-19): admin surface on api.fivucsas.com. + # Swagger UI, OpenAPI JSON, and Spring actuator endpoints are gated by + # admin-whitelist (IP allowlist). The docker-label router `identity-api@docker` + # keeps carrying the full Host(`api.fivucsas.com`) rule for public OAuth/auth/ + # API traffic; Traefik resolves to THIS router for matching admin paths + # because its combined Host+Path rule has higher specificity (longer match) + # than a bare Host rule. Kept deliberately narrow — public /oauth2/**, + # /auth/**, /api/v1/** routes stay on the docker-label router without ACL. + fivucsas-api-admin: + rule: "Host(`api.fivucsas.com`) && (PathPrefix(`/swagger-ui`) || PathPrefix(`/v3/api-docs`) || PathPrefix(`/actuator`) || Path(`/swagger-ui.html`))" + entryPoints: + - websecure + middlewares: + - admin-whitelist@file + - secure-headers@file + - noindex@file + - rate-limit@file + service: identity-api@docker + tls: + certResolver: letsencrypt + + # P4.2 / IN-M3 (2026-04-20): Grafana observability dashboard. + # The active router is created via docker labels on the grafana container + # (see infra/observability/docker-compose.yml). This file-provider entry + # is intentionally NOT live — it documents the routing contract so anyone + # grepping dynamic.yml for fivucsas hosts finds it. If you ever move the + # grafana container out of the compose stack, uncomment and point service + # to an explicit URL (file provider cannot resolve docker service names). + # + # grafana-observability: + # rule: "Host(`grafana.fivucsas.com`)" + # entryPoints: + # - websecure + # middlewares: + # - admin-whitelist@file + # - secure-headers@file + # - rate-limit@file + # service: grafana@docker + # tls: + # certResolver: letsencrypt + + middlewares: + redirect-to-fivucsas: + redirectRegex: + regex: "^https?://[^/]+(.*)" + replacement: "https://fivucsas.com${1}" + permanent: true + + redirect-www-to-apex: + redirectRegex: + regex: "^https?://www\\.fivucsas\\.com(.*)" + replacement: "https://fivucsas.com${1}" + permanent: true + + redirect-to-fivucsas-api: + redirectRegex: + regex: "^https?://[^/]+(.*)" + replacement: "https://api.fivucsas.com${1}" + permanent: true + + redirect-to-fivucsas-demo: + redirectRegex: + regex: "^https?://[^/]+(.*)" + replacement: "https://demo.fivucsas.com${1}" + permanent: true + + secure-headers: + headers: + browserXssFilter: true + contentTypeNosniff: true + forceSTSHeader: true + stsIncludeSubdomains: true + stsPreload: true + stsSeconds: 31536000 + customFrameOptionsValue: "DENY" + referrerPolicy: "strict-origin-when-cross-origin" + permissionsPolicy: "camera=(self \"https://verify.fivucsas.com\"), microphone=(self \"https://verify.fivucsas.com\"), geolocation=(), payment=(), publickey-credentials-get=(self \"https://verify.fivucsas.com\"), publickey-credentials-create=(self \"https://verify.fivucsas.com\")" + + # SEO indexability gate (2026-05-11): the noindex header is split out of + # secure-headers so that secure-headers stays neutral on crawler signals. + # Attach `noindex@file` ONLY to surfaces that must not appear in SERPs + # (api.fivucsas.com + the admin/swagger/actuator router). Public surfaces + # — docs.fivucsas.com, status.fivucsas.com, mizan/sarnic marketing pages — + # intentionally do NOT attach this and remain indexable. verify.fivucsas.com + # already carries its own HTML `` noindex, so the header is + # redundant there (defense-in-depth only). + noindex: + headers: + customResponseHeaders: + X-Robots-Tag: "noindex, nofollow, noarchive" + + rate-limit: + rateLimit: + average: 100 + burst: 200 + + admin-whitelist: + ipAllowList: + sourceRange: + - "127.0.0.1/32" + - "10.8.0.0/24" + - "193.140.73.0/24" + - "46.104.0.0/16" diff --git a/infra/traefik/config/traefik.yml b/infra/traefik/config/traefik.yml new file mode 100644 index 0000000..f6510fe --- /dev/null +++ b/infra/traefik/config/traefik.yml @@ -0,0 +1,64 @@ +api: + dashboard: true + insecure: false + +entryPoints: + web: + address: ":80" + # XFF hardening (2026-05-12): Traefik is directly internet-facing on :80 + # (no upstream proxy). trustedIPs is empty so Traefik strips any + # client-supplied X-Forwarded-* headers and writes its own using the + # connection's peer IP. This closes the per-IP rate-limit bypass surfaced + # by senior reviews (RateLimitInterceptor.getClientIP uses + # `XFF.split(",")[0]` and would otherwise honour an attacker-controlled + # value). + forwardedHeaders: + trustedIPs: [] + http: + redirections: + entryPoint: + to: websecure + scheme: https + + websecure: + address: ":443" + # XFF hardening (2026-05-12): same rationale as :80 entryPoint above. + # Empty trustedIPs means Traefik overwrites X-Forwarded-For with the + # peer IP on every request. If a CDN or upstream proxy is ever placed + # in front of Traefik, list its egress IPs / CIDRs here so legitimate + # forwarded headers are honoured. Internal Docker subnet (`proxy` + # network 172.20.0.0/24) is NOT listed because external clients never + # connect from that range — only the docker-network containers do, and + # those don't set X-Forwarded-For. + forwardedHeaders: + trustedIPs: [] + http: + middlewares: + - secure-headers@file + - rate-limit@file + tls: + certResolver: letsencrypt + +providers: + docker: + endpoint: "http://docker-socket-proxy:2375" + exposedByDefault: false + network: proxy + file: + filename: /etc/traefik/dynamic.yml + watch: true + +certificatesResolvers: + letsencrypt: + acme: + email: rollingcat.help@gmail.com + storage: /acme.json + httpChallenge: + entryPoint: web + +log: + level: WARN + +accessLog: + filePath: /var/log/traefik/access.log + bufferingSize: 100 From b605579f7e2b41c274fbb4d492249d4ed79d9301 Mon Sep 17 00:00:00 2001 From: Ahmet Abdullah Gultekin Date: Tue, 12 May 2026 18:27:40 +0000 Subject: [PATCH 2/2] docs(operator): append items 6-10 from Bio-Python PR + JWT aud rebuild caveat --- OPERATOR_ACTIONS_2026-05-12.md | 211 ++++++++++++++++++++++++++++++++- 1 file changed, 207 insertions(+), 4 deletions(-) diff --git a/OPERATOR_ACTIONS_2026-05-12.md b/OPERATOR_ACTIONS_2026-05-12.md index 291a55d..207cd53 100644 --- a/OPERATOR_ACTIONS_2026-05-12.md +++ b/OPERATOR_ACTIONS_2026-05-12.md @@ -352,6 +352,23 @@ curl -sS -H "Authorization: Bearer " \ # expect 401 with body referencing "kid revoked" or generic invalid ``` +**Co-shipped behavior change to anticipate: JWT `aud` claim enforcement.** +Team Auth-Java PR #100 also binds and validates the `aud` claim on every +access token (default `fivucsas-api`). After the same api container +rebuild that activates the HS512 kid revocation, **every access token +currently in flight will fail validation** because pre-rebuild tokens +were minted without `aud`. The client SDK silently calls `/refresh` and +re-mints — so user-visible impact is zero, but `/refresh` traffic will +spike for ~15 minutes (one extra call per active session). Watch the +Loki dashboard for the spike to confirm it decays cleanly. If `/refresh` +stays elevated after 30 minutes, something is wrong (likely a service +account or background job holding a long-lived token without refresh +logic — investigate, do not roll back). + +If a tenant needs a non-default audience, set `APP_SECURITY_JWT_AUDIENCE` +in `.env.prod` BEFORE the rebuild; otherwise `fivucsas-api` is baked into +the prod profile literal. + --- ## Quick reference: per-item severity + dependency matrix @@ -363,8 +380,194 @@ curl -sS -H "Authorization: Bearer " \ | 3 | web-app .env.production leak | HIGH | 5 min | Team Web-Hygiene PR | | 4 | parent main fast-forward | HIGH | 1 min | nothing | | 5 | HS512 kid revocation | MEDIUM | rolling | Team Auth-Java PR | +| 6 | DEEPFACE_FACENET512_SHA256 pin | HIGH | 2 min | Team Bio-Python PR | +| 7 | Bio container rebuild | HIGH | 1-2 min | items 5 + 6 + Bio-Python PR | +| 8 | ANTISPOOF_BLOCK_ENFORCE canary | MEDIUM | n/a | Bio-Python PR + item 7 | +| 9 | EAR liveness model deploy | LOW | 5 min | item 7 | +| 10 | identity-core-api puzzle proxy | LOW | 1 PR cycle | follow-up agent / dev | + +Recommended order if attacking all ten in one session: +4 (instant) → 3 (post-merge verify) → 5 + 6 + 7 batched (single +api+bio rebuild flips JWT aud + HS512 kid + SHA pin + ANTISPOOF block +all at once) → 8 (24-48h soak then decide) → 1 (maintenance window) → +2 (longer maintenance window, covers RLS smoke-test) → 9 + 10 (low-risk +follow-ups). + +--- + +## 6. DEEPFACE_FACENET512_SHA256 pinning — pending Team Bio-Python PR (HIGH) + +**Background.** Team Bio-Python PR #102 flips `DEEPFACE_SHA256_REQUIRED=true` +by default, which raises `RuntimeError` on biometric-processor boot if +`DEEPFACE_FACENET512_SHA256` is empty (was previously a WARN + skip). The +running container's Facenet512 weight was hashed during the agent's +investigation: + +``` +DEEPFACE_FACENET512_SHA256=3f76b5117a9ca574d536af8199e6720089eb4ad3dc7e93534496d88265de864f +``` + +The agent already wrote this value into the gitignored +`/opt/projects/fivucsas/biometric-processor/.env.prod` on the host. Verify +it's present before rebuilding: + +```bash +grep '^DEEPFACE_FACENET512_SHA256=' /opt/projects/fivucsas/biometric-processor/.env.prod +# Expected: DEEPFACE_FACENET512_SHA256=3f76b5117a9ca574... +``` + +If the value is missing or wrong, the bio container will fail-fast on +startup — that is by design (defense in depth for model integrity per +the FIVUCSAS ML Split D1-D4 decision). + +**Blast radius.** Without the pin set, bio container does not start +after the next rebuild → /verify and /enroll fail 5xx until the pin is +in place. + +**Maintenance window.** 2 minutes to edit + verify. + +**Acceptance check.** After rebuild (item 7), `docker logs bio-api 2>&1 | grep -i 'sha256'` should show pin-validated, not fail-closed. + +--- + +## 7. Bio container rebuild — coordinated with items 5 + 6 (HIGH) + +**Background.** PR #102 changes default behavior for three things at +once: `DEEPFACE_SHA256_REQUIRED=true`, `ANTISPOOF_BLOCK_ENFORCE=true`, +new `/api/v1/liveness/verify-challenge` endpoint. Plus PR #102 has a +hard dependency on spoof-detector PR #18 (the new public-namespace shim +that exposes `BlinkAnalyzer` + `compute_ear`). Merge order must be +**spoof-detector #18 → bio #102 → web-app #90**, then rebuild bio. + +**Commands.** + +```bash +# 1. Confirm prerequisites are merged in the correct order +# spoof-detector main has the shim; bio main pins the submodule pointer past it +cd /opt/projects/fivucsas/biometric-processor +git pull && git submodule update --init --recursive +git -C ../spoof-detector log --oneline -3 # expect commit 12da821 or its successor in main + +# 2. Pin DEEPFACE_FACENET512_SHA256 per item 6, then rebuild +docker compose -f docker-compose.prod.yml --env-file .env.prod build --no-cache biometric-api +docker compose -f docker-compose.prod.yml --env-file .env.prod up -d biometric-api +``` + +**Acceptance check.** + +```bash +# Health probe +curl -fsS https://api.fivucsas.com/api/v1/biometric/health || echo FAIL +# Confirm anti-spoof block is enforcing +docker logs bio-api 2>&1 | tail -50 | grep -E "ANTISPOOF_BLOCK_ENFORCE|sha256" +``` + +If the container fail-fast loops, most common causes: +- Item 6 pin missing → SHA mismatch RuntimeError +- spoof-detector pointer not advanced past PR #18 → ImportError on + `compute_ear` / `BlinkAnalyzer` +- Submodule recursion not run → stale pointer + +--- + +## 8. ANTISPOOF_BLOCK_ENFORCE canary decision (MEDIUM) + +**Background.** Bio PR #102 ships `ANTISPOOF_BLOCK_ENFORCE=true` as +default. When the assembler verdict (or any sub-signal: +`face_usability_block`, `hybrid_fusion_is_spoof`) returns +`recommended_action='block'`, /verify will now return HTTP 403 with +`{error_code:"ANTISPOOF_BLOCKED", reason:...}` instead of just logging. + +Before this PR the action was advisory only — anti-spoof has been +running but not blocking. Flipping the switch is what the 2026-05-12 +ML review called out as a P0 ("anti-spoof is structurally unreachable +from /verify because the verdict is advisory"). The change is correct; +the canary question is how confident we are in the default thresholds +and the sub-signal calibration. + +**Two acceptable rollout paths.** + +1. **Default-ON, monitor (recommended).** Leave `ANTISPOOF_BLOCK_ENFORCE=true` + in the image, watch the Loki dashboard for 24-48 hours, focus on the + ratio of `ANTISPOOF_BLOCKED` responses on /verify. A spike to >2% of + /verify traffic implies a false-positive issue with the new EAR check + (which is OFF by default — see item 9 — so the spike would have to + come from face-usability or hybrid-fusion). +2. **Flip-false-for-observation.** Add + `ANTISPOOF_BLOCK_ENFORCE=false` to `.env.prod`, redeploy, and consume + the bio audit log feed for 24-48 hours to count "would-have-blocked" + events. Flip back to true once the rate looks tolerable. + +**Blast radius.** False positives = real users get 403 on /verify. +Existing audit logs already record the verdict — so the question is one +of UX (cost of a wrongly-rejected verify call) versus security (cost of +a successfully-rejected spoof attempt). + +**Decision deadline.** Within 48h of bio container rebuild; otherwise +the canary signal degrades. + +--- + +## 9. EAR liveness model deployment (LOW) + +**Background.** Bio PR #102 wires `compute_ear` from +`spoof_detector.infrastructure.analyzers.blink_analyzer` into /verify +as a single-still-frame liveness check. Closed eyes → veto. The check +defaults OFF (`ANTISPOOF_EAR_VETO_ENABLED=false`) until the +MediaPipe FaceLandmarker model is deployed at +`models/face_landmarker.task` — if the model is missing the helper +fails-soft to None and the verdict is "no signal" (not "fail"). + +**Commands.** + +```bash +# Copy face_landmarker.task into the bio container's model volume. +# Path inside container: ${FACE_LANDMARKER_MODEL_PATH} (e.g. /models/face_landmarker.task) +# Download canonical from MediaPipe: +# https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task +docker cp face_landmarker.task bio-api:/models/face_landmarker.task +# Or rebuild image with the model baked in via Dockerfile COPY + +# Then flip the flag in .env.prod +echo 'ANTISPOOF_EAR_VETO_ENABLED=true' >> .env.prod +# Set the SHA256 pin for the model +echo 'FACE_LANDMARKER_MODEL_SHA256=' >> .env.prod + +# Bounce only bio-api, no rebuild needed if the model is in a volume +docker compose -f docker-compose.prod.yml --env-file .env.prod restart biometric-api +``` + +**Acceptance check.** Trigger a `/verify` call with eyes-closed selfie +in the testbed; expect `eyes_closed: true` in the bio audit log +verdict. Trigger normal verify; expect `eyes_closed: false`. + +--- + +## 10. identity-core-api puzzle proxy — follow-up PR (LOW) + +**Background.** Web-app PR #90 adds `useBiometricPuzzleServer` hook +which POSTs to `POST /api/v1/biometric/puzzles/verify-challenge`. That +proxy needs to live in identity-core-api (it forwards to bio +`/api/v1/liveness/verify-challenge` with the bio API key). The agent +deferred this because the identity-core-api repo was on the +`fix/2026-05-12-infra-hygiene` branch with concurrent edits from another +session. + +Until the proxy lands, FacePuzzle + HandGesturePuzzle soft-pass with a +`console.warn` (404 → soft-pass per `useBiometricPuzzleServer.ts:140`). +User-visible behavior is unchanged from pre-PR baseline. + +**Commands.** None — this is a follow-up coding task, not an operator +action. Dispatch a small agent to add: + +``` +PostMapping("/api/v1/biometric/puzzles/verify-challenge") + → BiometricProcessorClient.verifyChallenge(...) + → bio POST /api/v1/liveness/verify-challenge +``` + +with the same API-key middleware contract used by other bio proxies. -Recommended order if attacking all five in one session: -4 (instant, unblocks reviewers) → 3 (post-merge verify) → 5 (post-PR -rebuild) → 1 (maintenance window slot) → 2 (longer maintenance window, -covers RLS smoke-test). +**Acceptance check.** After deploy, web-app browser console should stop +printing `[biometric-puzzles] /biometric/puzzles/verify-challenge proxy +not deployed yet` warnings on first puzzle completion.