diff --git a/.env.example b/.env.example index 07e49c2..d13923d 100644 --- a/.env.example +++ b/.env.example @@ -26,6 +26,13 @@ REDIS_PASSWORD=change_me_in_production JWT_SECRET=your-super-secret-256-bit-key-change-this-in-production JWT_ACCESS_TOKEN_EXPIRATION=3600000 JWT_REFRESH_TOKEN_EXPIRATION=604800000 +# BE-H1: signing algorithm. Default RS256 (asymmetric, OIDC best practice). +# HS512 is verify-only for legacy tokens minted before the 2026-04-20 flip. +JWT_DEFAULT_ALGO=RS256 +JWT_RSA_KID=rs-2026-04 +# RSA key pair (PEM). REQUIRED in prod; auto-generated in dev profile if omitted. +JWT_RSA_PRIVATE_KEY_PEM= +JWT_RSA_PUBLIC_KEY_PEM= # ============================================================================ # Service URLs diff --git a/.github/workflows/deploy-landing.yml b/.github/workflows/deploy-landing.yml index 0978a4f..8238af0 100644 --- a/.github/workflows/deploy-landing.yml +++ b/.github/workflows/deploy-landing.yml @@ -1,8 +1,9 @@ name: Deploy Landing to Hostinger on: + workflow_dispatch: push: - branches: [master] + branches: [main, master] paths: - 'landing-website/**' - '.github/workflows/deploy-landing.yml' diff --git a/.gitignore b/.gitignore index 4e43560..c4578be 100644 --- a/.gitignore +++ b/.gitignore @@ -40,7 +40,6 @@ nul *.zip # Archive and backup folders -/archive/ /_backup_before_submodules/ # Claude metadata diff --git a/CICD_AUDIT_2026-05-04.md b/CICD_AUDIT_2026-05-04.md new file mode 100644 index 0000000..b834a8b --- /dev/null +++ b/CICD_AUDIT_2026-05-04.md @@ -0,0 +1,612 @@ +# FIVUCSAS CI/CD Pipeline Audit — 2026-05-04 + +**Auditor:** T-CICD-AUDIT (read-only review) +**Scope:** All `.github/workflows/*.yml` across the parent repo (`fivucsas`) and four submodules (`identity-core-api`, `biometric-processor`, `web-app`, `client-apps`). `landing-website` ships from the parent repo's `deploy-landing.yml`. iOS workflow audited but treated as out-of-scope per the project's permanent-Apple-OUT policy (`continue-on-error: true` in workflow itself). +**Methodology:** Read every workflow file at HEAD, query `gh run list` (last 30) per workflow, sample run-job traces, inspect branch protection + secret-scanning state via `gh api`, list org/repo runners. +**Verdict:** **CI is broken in two places where it matters most (api integration tests, bio CI as a whole). The self-hosted runner is functionally a single point of failure that has been failing silently for ~27 days. Branch protection is OFF on every repo, including ones that ship to prod. Deploy contract is half-CI half-operator with the contract documented nowhere.** + +--- + +## 1. Workflow Inventory + +### 1.1 fivucsas (parent) — 2 workflows + +| File | Trigger | Jobs | Runner | Path filter | +|---|---|---|---|---| +| `.github/workflows/ci.yml` | push/PR `master,main,develop` | `validate` (compose+nginx config) | `[self-hosted, linux, x64]` | `docker-compose*.yml`, `nginx/**` | +| `.github/workflows/deploy-landing.yml` | push `master` | `deploy` (npm ci → build → rsync to Hostinger) | `[self-hosted, linux, x64]` | `landing-website/**` | + +No gitleaks at the parent. No link-checker on docs. + +### 1.2 identity-core-api — 3 workflows + +| File | Trigger | Jobs | Runner | Notes | +|---|---|---|---|---| +| `.github/workflows/ci.yml` | push/PR `main` | `test` (Maven unit) → `integration-tests` (Testcontainers, `needs: test`) | unit on `ubuntu-latest`, IT on `[self-hosted, linux, x64]` | timeout 25m / 35m | +| `.github/workflows/deploy-hetzner.yml` | push `master,main` + dispatch | `deploy` (appleboy/ssh-action → `infra/deploy.sh build identity` + restart + `/actuator/health` retries) | `[self-hosted, linux, x64]` | timeout 8m, command_timeout 6m | +| `.github/workflows/gitleaks.yml` | push `main` + PR | `scan` (gitleaks v8.21.2 dir scan) | `ubuntu-latest` | very thin: `gitleaks dir . --redact --verbose`, no allowlist file | + +### 1.3 biometric-processor — 2 workflows + +| File | Trigger | Jobs | Runner | Notes | +|---|---|---|---|---| +| `.github/workflows/ci.yml` | push/PR `main,dev,feature/*` | `lint` → `test` (`needs: lint`) → `integration-test` (`needs: test`) → `frontend-build` (`needs: test`) ‖ `security` (`needs: lint`) | **all 5 jobs** on `[self-hosted, linux, x64]` | python 3.12; pip-audit `--strict || true` (warns only) | +| `.github/workflows/deploy-hetzner.yml` | push `main` + dispatch | `deploy` (SSH → build + restart) | `[self-hosted, linux, x64]` | timeout 10m | + +No gitleaks workflow despite memory note (`session_20260430.md`) claiming gitleaks shipped to "both repos". Verified at HEAD: bio is missing it. + +### 1.4 web-app — 4 workflows + +| File | Trigger | Jobs | Runner | Notes | +|---|---|---|---|---| +| `.github/workflows/ci.yml` | push/PR `main,master` | `build-and-test` (lint, tsc, test, vite build with `SKIP_MODEL_FETCH=1`), `code-quality` (npm audit, coverage; both `continue-on-error: true`) | `ubuntu-latest` | matrix `[22.x]` | +| `.github/workflows/e2e.yml` | schedule `0 2 * * *` (UTC) + dispatch | `e2e` (Playwright Chromium) — project=smoke nightly; dispatch can pick `authenticated` or `destructive` | `ubuntu-latest` | targets PROD `https://app.fivucsas.com` directly | +| `.github/workflows/deploy-hostinger.yml` | push `main` + dispatch | `deploy` (npm ci → write `.env.production` heredoc → vite build → rsync) | `ubuntu-latest` | targets `~/domains/app.fivucsas.com/public_html/` | +| `.github/workflows/gitleaks.yml` | push `main` + PR | `scan` (gitleaks v8.21.2) | `ubuntu-latest` | identical to api copy | + +### 1.5 client-apps — 3 workflows + +| File | Trigger | Jobs | Runner | Notes | +|---|---|---|---|---| +| `.github/workflows/android-build.yml` | push/PR `main,develop` + dispatch | `build` (assembleDebug by default; assembleRelease requires keystore secrets) | `ubuntu-latest` | dummy `google-services.json` baked at build time | +| `.github/workflows/ios-build.yml` | push/PR `main,develop` + dispatch | `build-framework` (`continue-on-error: true`) | `macos-latest` | non-blocking, scope is OUT | +| `.github/workflows/desktop-installers.yml` | dispatch + `desktop-v*` tags | `build-linux` (.deb) on `ubuntu-latest`, `build-windows` (.msi) on `windows-latest` | mixed | unsigned artifacts; SHA256 manifest produced | + +Total active workflows across the 5 repos: **14** (parent 2, api 3, bio 2, web 4, client 3). No workflow uses reusable workflows or composite actions — every workflow is monolithic. + +--- + +## 2. Health Check — Run-History Sweep + +Sampling: last 30 runs per workflow, `gh run list … --json conclusion,createdAt,updatedAt`, except where called out (bio = last 100). + +### 2.1 identity-core-api — `CI` (`ci.yml`) + +| Bucket | Count (last 30) | +|---|---| +| cancelled, push | 13 | +| cancelled, PR | 8 | +| (running/queued), PR | 7 | +| (running/queued), push | 1 | +| failure, PR | 1 | +| **success** | **0** | + +Of last 30 runs, **zero successes**. The Maven unit job typically passes in ~50–80s (verified on PR runs that completed before being superseded). The integration-tests job has been **skipped** on every recent successful PR-precursor run (because it's gated `needs: test` and has been completing in seconds with `conclusion: skipped` after the unit job's slot churn). + +**Critical:** Run #25303233032 had `Maven test (unit)` finish `success` at 2026-05-04 05:49:05Z, then `Integration tests (Testcontainers)` queued at 05:49:05Z and was **cancelled at 11:27:16Z — 5h 38m later — never having started a runner**. This is Task #55 in living color. Pattern: every push to `main` ships a new run, the new run cancels the previous one's pending IT job, and the IT job never actually executes. + +Successful CI runs on api in the last 30: **0**. Last green api CI on `main`: not found in last-100 sweep — needs forensic dig. + +### 2.2 identity-core-api — `Deploy to Hetzner VPS` (`deploy-hetzner.yml`) + +Last 10 runs: **9 cancelled, 1 queued (just-now)**. Like CI, Deploy gets cancelled by the next push because of `concurrency: deploy-identity / cancel-in-progress: true`. With pushes happening every 15-30 min during active development, the deploy never actually runs to completion — except apparently one or two times. Last clear `success` conclusion in our sample: not present in last-30. Per the operator memory log (`session_20260502.md`), prod was rebuilt 2026-05-02 17:50 UTC by hand — not by CI. + +**Read:** the api deploy workflow is, for all practical purposes, a no-op stub. The CLAUDE.md says deploys happen via `docker compose -f docker-compose.prod.yml --env-file .env.prod build --no-cache identity-core-api` directly on the VPS by the operator. CI deploy is theatre. + +### 2.3 biometric-processor — `CI Pipeline` + +Last **100** runs distribution: 82 cancelled, 15 failure, **2 success**, 1 running. **Success rate: 2%**. + +Last green push run: `24072240675` — **2026-04-07** (27 days before this audit). Every push since has had its 5 jobs (`Lint & Type Check`, `Unit Tests`, `Integration Tests`, `Security Scan`, `Build Frontend`) — all pinned to `[self-hosted, linux, x64]` — cancelled at the **GitHub Actions 24h workflow timeout** (86402s = 24h to-the-second) because the self-hosted runner is unavailable (see §5). + +Bio CI is, in effect, **disabled in steady state** even though devs see green checkmarks in PRs (the green is the Copilot review and gitleaks-on-api copy ; bio has no gitleaks, has no working CI). New code lands without test execution. The `frontend-build` job that exercises `demo-ui/`, the `pip-audit` line, and the Bandit scan all **have not run on a `main` push since 2026-04-07**. + +### 2.4 biometric-processor — `Deploy to Hetzner VPS` + +Same self-hosted runner problem: last 10 runs all cancelled. Operator deploys by hand. + +### 2.5 web-app — `CI` + +Last 30: **26 success, 3 cancelled, 1 failure**. Healthy. Median duration ≈ 165s; p95 ≈ 180s. The `code-quality` job has both `npm audit` and `coverage` set `continue-on-error: true` — passes never reflect real quality state. + +### 2.6 web-app — `Deploy to Hostinger` + +Last 10: **8 success, 1 failure (PR #66 follow-on), 1 cancelled**. This is the one deploy pipeline that **actually deploys**. Median duration ~ 65–90s. Fast, works. + +### 2.7 web-app — `E2E Tests` (cron + dispatch) + +Cron `0 2 * * *` UTC: in last 30 schedule events, **only 2 actual runs found** — 2026-05-03 05:28Z and 2026-05-04 05:32Z (both `success`). The schedule started running on 2026-05-03; before that, every E2E run was a `push` event (and they're all `cancelled` because the workflow has `cancel-in-progress: true` and pushes pile up). Per the memory note, nightlies were wired in PR #65; that PR is recent, so the cron showing only 2 runs aligns. **Verdict: E2E nightlies are working but very young.** No alerting on failure. + +### 2.8 web-app — `gitleaks` + +Inspected fewer; last 10 runs all `success`. Each push to `main` and every PR scans. No allowlist — relies entirely on default rules, which yield no positives at HEAD. + +### 2.9 fivucsas (parent) — `FIVUCSAS CI` + +Last 30: 19 success, 9 failure, 2 cancelled. Failures are old (March 2026) — submodule pointer commits that broke when `git submodule status` couldn't recurse. The 86402s line in the duration log is from a stale push that hit 24h timeout. Steady state currently green when triggered, which is rare (path-filter-gated). Median useful run: ~565s. + +### 2.10 fivucsas — `Deploy Landing to Hostinger` + +Last 30: **19 cancelled, 4 success, 3 failure**. **Last success: 2026-03-28** (run `23681941800`). Like api/bio, the self-hosted-runner-pinned deploy job sits queued and is killed by the next push. Landing has been auto-deploying ~35 days ago at best; the operator does it manually. + +### 2.11 client-apps — `Android Build` + +Last 30: 17 success, 13 cancelled. **Success rate: 57%.** Median build ~190s. Cancellations are concurrency, not failures. Healthy. + +### 2.12 client-apps — `iOS Build` + +29/30 success despite `continue-on-error: true` — the K/N framework actually builds. Out of strategic scope per project policy. + +### 2.13 client-apps — `Desktop Installers` + +Tag-only / dispatch-only. Run history sparse. No regressions visible. + +### Top-3 Most-Flaky Jobs + +1. **api `Integration tests (Testcontainers)`** — 100% cancellation rate over last 30 runs. Never reaches a runner. +2. **bio `Lint & Type Check`** + 4 sibling jobs — 82% cancellation, 15% failure, 2% success over last 100. Effectively dead. +3. **fivucsas `Deploy Landing` `deploy`** — 63% cancellation. Last success 5+ weeks ago. + +--- + +## 3. Coverage Gaps + +### 3.1 identity-core-api + +* **Integration tests are gated on a runner that won't run them.** Right move: split this job into one half that can run on `ubuntu-latest` (Testcontainers via Docker-in-Docker is fine on hosted runners; Ryuk disabling is already in place) and a smaller smoke half on `[self-hosted]`. Or just move it all to `ubuntu-latest`. The Hetzner-runner argument was originally about needing GPU/biometric containers in tests; that's the bio job, not the api one. +* **No coverage report uploaded** anywhere (no JaCoCo, no Codecov action). +* **No spotbugs/checkstyle/PMD** — Java side has zero static analysis in CI. +* **No build of the Docker image in CI**. The deploy step builds in-place on prod. A failed Docker build only surfaces during deploy, after merge. +* **No `mvn dependency-check:check`** (OWASP). Dependabot covers some of it but not transitive runtime drift. +* **No test for Flyway migration validity** — V57's `RAISE WARNING` fail-soft path is not exercised pre-merge. + +### 3.2 biometric-processor + +* **Type check is missing.** `mypy` is `pip install`ed but never invoked. The job is "Lint & Type Check"; only `ruff check` + `ruff format --check … || true` runs. +* **`ruff check`** runs with `--ignore E501,F401,F821,E402` — that's silencing **F821 (undefined name)** and **F401 (unused imports)**. F821 silences would have shielded the recent backfill-script breakage that PR #68 just fixed. +* **No model-fetch/SHA256 verification job** — bio has bundled MTCNN weights + needs UniFace ONNX caching; no CI step checks the cache directory or model file integrity. +* **`pip-audit --strict || true`** swallows known-vulnerable dependency findings. Per the memory note `feedback_audit_quality.md`, this kind of `|| true` is exactly what produces hollow security signal. +* **No Dockerfile lint** (`hadolint`) despite the runtime image having recently shipped without `alembic` (PR #68's bugfix). +* **No alembic migration test** — no CI step does `alembic upgrade head` against a freshly-spun postgres. +* **No gitleaks workflow.** Memory `project_session_20260430.md` claims gitleaks shipped to "both repos" — incorrect. Bio has no `.github/workflows/gitleaks.yml`. + +### 3.3 web-app + +* **`code-quality` job is decorative.** `npm audit --audit-level=high` is `continue-on-error: true`, and so is `npm test … --coverage`. CI never blocks on either. Coverage is uploaded as an artifact but not enforced. +* **`SKIP_MODEL_FETCH=1`** in build is correct (models live on Hostinger), but nothing in CI verifies the public manifest.json SHA256s against a checked-in lock file. This is a supply-chain blind spot. +* **`build-and-test` job runs `npx tsc --noEmit`** which is good, but `npm run build` again invokes tsc — duplicate work, ~30s wasted per run. +* **Vitest doesn't fail CI on snapshot drift** — verified by reading vitest.config behavior is project-default. +* **E2E `pull_request` trigger missing.** The workflow file has the right comment ("once verify.fivucsas.com (or hosted staging) is up, add `pull_request:` running `--project=smoke`"). PRs currently merge without ever running Playwright. The smoke project is, per memory, designed for non-destructive PROD use, so it's safe to flip on. +* **No accessibility audit** (axe-core, pa11y). For a 17-page admin dashboard targeting a Marmara-deployed environment, this is a gap. +* **Bundle-size budget** absent. Vite build can grow without alarm. + +### 3.4 client-apps + +* **Lint/check tasks not run.** Workflow goes straight to `assembleDebug`. There is no `:shared:check` or `ktlint`/`detekt` step. KMP code can land with style and unused-symbol warnings. +* **No unit test job.** Memory references "shared common-test" in passing but nothing in CI runs `:shared:test` or `:androidApp:testDebugUnitTest`. +* **No instrumented-test smoke** (`connectedDebugAndroidTest`) — fair, since that needs an emulator/device. +* **APK release path lacks an `apksigner verify` step** post-signing. +* **No artifact retention for lint reports** even when `assembleDebug` is the entire job. + +### 3.5 fivucsas (parent) + +* **No link-check** on Markdown docs (`lychee` or `markdown-link-check`). The repo has 30+ `*_REVIEW_*.md` files with cross-references. +* **No submodule-pointer drift check.** It would be useful to fail CI if a parent commit pins a submodule SHA that doesn't exist on the submodule's `main`. +* **No nginx config rendering test against the actual upstream containers.** `nginx -t` validates syntax only; it can't catch the sort of `proxy_pass` typo that broke prod twice in recent memory. +* **No matching gitleaks workflow at parent level** — relevant because the parent has `infra/`, `monitoring/`, `nginx/`, `scripts/` outside the submodules. + +--- + +## 4. Branch Protection State + +| Repo | `main` (or `master`) protection | Verified by | +|---|---|---| +| `fivucsas` | **OFF** (404 Branch not protected) | `gh api repos/.../branches/master/protection` | +| `identity-core-api` | **OFF** | same | +| `biometric-processor` | **OFF** | same | +| `web-app` | **OFF** *(memory said 1-review required — INCORRECT at HEAD)* | same | +| `client-apps` | **OFF** | same | + +**Risk surface today:** + +* Anyone with push rights (the operator + whatever GitHub App tokens are wired) can `git push origin main --force` on a repo that ships to prod. +* PRs do not require a passing CI before merge. Memory note `feedback_audit_quality.md` and `RESEARCH_PROCTORING_AMISPOOF_2026-05-02.md` already noted this risk; the fix has been deferred indefinitely. +* This is incompatible with the security posture documented for the platform (RFC 6749 reuse-detection, pgBackRest PITR, gitleaks). A locked-down platform whose code can be force-pushed into prod is locked down only on paper. + +The web-app `code-quality` `continue-on-error` chain combines with no-protection to produce: a PR whose only blocking signal is `gitleaks` and the `build-and-test` job. If the build job is slow that day and the operator hits "Merge anyway" with admin privilege, nothing prevents it. + +--- + +## 5. Self-Hosted Runner — Task #55 Diagnosis + +### 5.1 What's registered + +```text +GET /orgs/Rollingcat-Software/actions/runners +{ + "total_count": 1, + "runners": [{ + "id": 55, "name": "hetzner-cx43", "os": "Linux", + "status": "online", "busy": false, + "labels": [self-hosted, Linux, X64, hetzner] + }] +} +``` + +GET on each repo's `/actions/runners` → `total_count: 0` for all. Only the **org-level** runner exists. The runner ID is literally **55** (probably coincidence with Task #55 numbering, but a memorable one). + +### 5.2 What jobs need it + +Workflows pinning to `[self-hosted, linux, x64]`: + +* `fivucsas/.github/workflows/ci.yml` — `validate` +* `fivucsas/.github/workflows/deploy-landing.yml` — `deploy` +* `identity-core-api/.github/workflows/ci.yml` — `integration-tests` +* `identity-core-api/.github/workflows/deploy-hetzner.yml` — `deploy` +* `biometric-processor/.github/workflows/ci.yml` — **all 5 jobs** (`lint`, `test`, `integration-test`, `frontend-build`, `security`) +* `biometric-processor/.github/workflows/deploy-hetzner.yml` — `deploy` + +That's **10 jobs** vying for a single runner. + +### 5.3 What's actually happening + +At the moment of audit there were 6 queued CI runs across api + bio waiting on `[self-hosted, linux, x64]`. The runner shows `busy: false` and `online`. **The runner is online but not picking jobs.** + +Diagnostic possibilities (cannot SSH from this audit context; SSH key isn't on the audit host): + +1. **Runner registered to org-level but org Actions Runner Group is restricted from these repos.** GitHub gates org-level runners through Runner Groups. If the "Default" runner group's repository access is set to "selected repositories" and the FIVUCSAS submodules aren't in the list, jobs queue forever. Symptom matches. +2. **Runner has stale workspace** (a previous job left a directory it can't clean) and refuses to start new jobs. Less likely given `busy:false`. +3. **Label mismatch.** Workflows use lowercase `[self-hosted, linux, x64]`; runner advertises `Linux` and `X64` (uppercase). GitHub's label match is case-insensitive per docs, but if the `_work` folder cap was hit or storage-pressure detected, the runner self-quarantines. + +Per memory `feedback_audit_quality.md`, the right next step is to **verify, not speculate**. The verification surface (SSH to Hetzner, `systemctl status actions.runner.Rollingcat-Software.hetzner-cx43.service`, `journalctl -u actions.runner... -n 200`) is operator-side. + +### 5.4 Recommended remediation + +Three parallel actions, in priority order: + +* **R1 (P0, S effort):** Move api `integration-tests` to `ubuntu-latest`. Testcontainers has worked on hosted runners since GitHub started shipping Docker in the runner image. The `TESTCONTAINERS_RYUK_DISABLED: 'true'` env is already set. There is no architectural reason this must run on the Hetzner host. +* **R2 (P0, M effort):** Move all 5 bio CI jobs to `ubuntu-latest`. None of them need the self-hosted runner. The reason it's pinned is historical (early ML weights download cost; no longer relevant since most weights are bundled or cached). The integration-test step starts its own Redis container in CI; that works fine on hosted. ML pipelines can `pip install opencv-python-headless deepface` in <2min; bandwidth is not the bottleneck there. +* **R3 (P1, M effort):** Keep deploys on the self-hosted runner (they need access to the Docker socket on Hetzner). But add a **second runner** at minimum, and document acceptable cancel behavior. The `concurrency: deploy-identity / cancel-in-progress: true` configuration is correct — newer deploys should win — but the operator should log a runbook entry when CI hasn't deployed in N hours so manual deploys can take over. +* **R4 (P1, S effort):** Switch to repo-level runners or set the org-level Runner Group to allow each FIVUCSAS submodule explicitly, eliminating the "is the runner allowed?" guess. + +--- + +## 6. Secret Hygiene + +### 6.1 Per-repo state + +| Repo | `secret_scanning` | `push_protection` | `dependabot_security_updates` | gitleaks workflow | +|---|---|---|---|---| +| `fivucsas` | DISABLED | DISABLED | enabled | NO | +| `identity-core-api` | enabled | enabled | enabled | YES | +| `biometric-processor` | DISABLED | DISABLED | enabled | NO | +| `web-app` | enabled | enabled | enabled | YES | +| `client-apps` | DISABLED | DISABLED | enabled | NO | + +Only **2 of 5 repos** have GitHub native secret scanning + push protection. Memory note `project_session_20260430.md` had this as a finished bullet ("enabled across **both** repos") — correct as written, just both = api + web. Misread elsewhere as "all repos". + +### 6.2 gitleaks workflow review + +Both api and web have **identical** gitleaks workflows: + +```yaml +- run: gitleaks dir . --no-banner --redact --verbose +``` + +* **No `--config` file.** Uses gitleaks defaults. No allowlist, no custom rules. +* **No `--exit-code 0` shield** — the workflow DOES fail on findings, which is correct. +* **No baseline/report file uploaded** — findings only land in workflow logs. +* **No SARIF upload** to the GitHub Security tab (`upload-sarif` action is not invoked). +* **Working-tree scan only**, not history (`gitleaks dir`) — this is fine for the post-rewrite era but won't catch git history that contains the previously-leaked secrets unless the user has already done the `git filter-repo` operator-only step (per memory, deferred). + +### 6.3 Workflow `env:` blocks scanned for secrets + +Manual grep across all 14 workflow files: no plaintext secrets. All sensitive values come through `${{ secrets.* }}`. The `BUILD_TYPE`, `EVENT_NAME`, `TARGET` envs in client-apps android-build are non-sensitive. Web-app deploy heredoc writes a `.env.production` file containing only public `VITE_*` values — that's by design (Vite inlines them at build, secrecy through them is an anti-pattern anyway). + +One **questionable pattern** in `client-apps/android-build.yml` (lines 60–86): a dummy `google-services.json` is checked into the workflow file. It contains the literal `"AIzaSyCI-DUMMY-KEY-FOR-CI-BUILD"`. Not a real secret, but it's the kind of pattern push-protection doesn't catch and gitleaks might flag depending on rule set. Worth annotating in code or replacing with a CI-generated valid stub. + +### 6.4 GitGuardian + +No GitGuardian app/integration found in any repo's webhooks (`gh api repos/.../hooks` returns empty). Not in scope per the platform's "GitHub-native + gitleaks" decision tree, but the audit was asked to confirm — confirmed: not integrated. + +--- + +## 7. Deploy Pipelines + +### 7.1 web-app `Deploy to Hostinger` + +**Working as designed.** Last 10 runs: 8 success / 1 failure / 1 cancelled. Median ~65s. Triggered by every push to `main` (path-filtered). Builds, writes prod env, rsyncs to Hostinger over SSH. **This is the only deploy pipeline that meaningfully deploys.** Notable: writes `.env.production` heredoc inline rather than committing one — keeps env config in the workflow, which is fine for VITE_ vars. + +### 7.2 api `Deploy to Hetzner VPS` + +**Effectively a stub.** Last 10 runs: 9 cancelled, 1 queued. The job pattern is correct (`appleboy/ssh-action` → `infra/deploy.sh build identity` → restart → 5x `curl /actuator/health` retries). When it runs, it works (the operator-driven path is identical). **But it almost never runs in CI** because (a) the self-hosted runner doesn't pick the job, and (b) when it does pick, the next push cancels it via `concurrency: deploy-identity / cancel-in-progress: true`. The operator-driven path (via SSH from the user's laptop, running `docker compose ... build --no-cache`) is what actually deploys. **Real contract: the operator deploys; CI is theatre.** + +### 7.3 bio `Deploy to Hetzner VPS` + +Same diagnosis as api. Recent runs: all cancelled. Operator deploys via the documented `cd /opt/projects/fivucsas/biometric-processor && docker compose -f docker-compose.prod.yml ... build --no-cache && up -d` path. **Real contract: operator-only.** + +### 7.4 fivucsas `Deploy Landing to Hostinger` + +Last successful: **2026-03-28**. Five-plus weeks of "every push gets cancelled before deploy completes". Operator-only in practice. + +### 7.5 client-apps `Desktop Installers` + +Tag-driven only. Run history shows it executes when tagged. Real CI deploy pipeline. + +### 7.6 client-apps `Android Build` + +Builds APK artifacts only. Does not publish anywhere — release path is documented as "user uploads APK to GitHub release" per memory `reference_fivucsas_client_apps_releases.md`. No automatic publication step. + +### 7.7 The undocumented contract + +Across the platform, the implicit contract is: + +* **web-app** deploys via CI to Hostinger. +* **landing-website** deploys via operator (CI is broken, has been since 2026-03-28). +* **identity-core-api** deploys via operator (CI rarely runs to completion). +* **biometric-processor** deploys via operator (CI hasn't run any jobs to completion since 2026-04-07). +* **client-apps** ships APK/MSI/deb via operator-driven tag + manual GitHub release upload. + +Nowhere in the repo is this documented as the actual contract. CLAUDE.md and the various RUNBOOK files imply CI deploys; it doesn't, except for web. **This is the single largest documentation drift in the CI/CD layer.** + +--- + +## 8. Notable Findings — Broken or Risky + +### 8.1 BROKEN: bio CI hasn't passed in 27 days + +`Counter({'cancelled': 82, 'failure': 15, 'success': 2, '(running)': 1})` over last 100. Last green run: `24072240675` on 2026-04-07. Every push to `main` ships untested. Multiple security-relevant PRs (Fernet embedding encryption #65, UniFace warm-up #66, alembic runtime fix) merged without CI validation. + +### 8.2 BROKEN: api integration tests have not been exercised in CI on a `main` push since at least 2026-04-18 + +Sampled the 10 most recent runs that completed (`success` or `failure` conclusion) — every single one shows `Integration tests (Testcontainers)` with `conclusion: skipped` (because the unit-test gate failed) or `cancelled` (because the runner never picked it up). **Testcontainers code path is dead.** + +### 8.3 RISKY: `continue-on-error: true` defangs web-app `code-quality` job + +Both `npm audit --audit-level=high` and `npm test --coverage` are wrapped. The job always passes. The "Code Quality" check on the PR page is a meaningless green tick. Fix: drop `continue-on-error`, accept the audit-level=high failures (or pin allowlist), and let coverage block on regressions. + +### 8.4 RISKY: branch protection OFF everywhere + +§4 already covered. Worth restating: this is the single highest-leverage CI hygiene fix — flip protection to "Require a PR + 1 review + status checks: build-and-test, gitleaks" on api, bio, web at minimum. + +### 8.5 RISKY: ruff ignore list silences F821 (undefined name) + +`bio/.github/workflows/ci.yml` line 47 — `ruff check app/ --ignore E501,F401,F821,E402`. F821 is "undefined name" — exactly the bug class that produces NameError at runtime. PR #68's "repair backfill script async-iter" was *that* bug class. Remove F821 from the ignore list. + +### 8.6 RISKY: `pip-audit --strict || true` + +`bio/.github/workflows/ci.yml` line 197 — comment says "Don't fail on vulnerabilities, just report" but a security workflow that doesn't fail on findings is theatre. Per memory `feedback_audit_quality.md`, this is exactly the "audit/recommendation quality" anti-pattern. Same goes for `npm audit --audit-level=high` with `continue-on-error: true`. + +### 8.7 RISKY: bio CI all 5 jobs need self-hosted runner + +§5 already covered. None of these jobs has a real reason to be on `[self-hosted]`. The frontend-build job certainly doesn't. The lint job certainly doesn't. + +### 8.8 RISKY: dummy google-services.json in workflow file + +`client-apps/android-build.yml` lines 60–86. Static "fake" Google Services JSON checked into the workflow with a literal `"AIzaSyCI-DUMMY-KEY-FOR-CI-BUILD"`. Not a real key, but it's pattern-matched by every secret scanner. Better: produce an empty stub or use a secret-loaded valid file. + +### 8.9 RISKY: api CI cache strategy is just `cache: maven` on setup-java + +No fine-grained cache key. Maven downloads ~200MB of deps every successful run. Across the whole repo lifetime that's bandwidth taxes Anthropic's compute and our build minutes. Add a Maven-cache-action with explicit `~/.m2/repository` key on `pom.xml` hash. + +### 8.10 RISKY: web-app CI builds twice (`tsc --noEmit` + `npm run build` which also tsc) + +Two TypeScript passes per CI run. ~30s × 30 runs/wk × N weeks. Tractable fix: drop the standalone `tsc --noEmit` step, rely on `vite build` (which already tsc-checks). + +### 8.11 RISKY: gitleaks scans working tree only, not git history + +Memory notes `git filter-repo` history rewrite is operator-only, deferred. While that's pending, gitleaks-history scan would surface the not-yet-rotated leak surface every PR. Currently it only scans HEAD. + +### 8.12 RISKY: E2E nightly has no failure alerting + +If the cron fails, no one knows. The artifact upload runs `if: always()` — but no Slack / email / GitHub Issue creation on failure. With 1-2 cron runs in production history, no incidents have happened yet. Will happen. + +### 8.13 RISKY: api workflow runs on `master` AND `main` pushes (deploy-hetzner.yml line 4) + +The repo's default branch is `main` per `gh api`. The legacy `master` reference creates double-trigger risk if a stray push to `master` happens. Trim to `main`. + +### 8.14 RISKY: api ci.yml `pull_request` doesn't run integration-tests on PR + +Workflow gates `integration-tests` on `needs: test` AND triggers only on push/PR to `main`, but with the runner stalled, IT never runs. PR comment: "Integration tests skipped." Becomes muscle memory; reviewers stop expecting them. + +### 8.15 RISKY: parent repo CI has no gitleaks + +`infra/`, `nginx/`, `scripts/`, `monitoring/` all live in the parent repo. If a secret slips into `infra/observability/`, it's not scanned. + +--- + +## 9. Findings + Prioritized Recommendations + +Conventions: **P0** = ship-blocking / security-risk, **P1** = significant performance/reliability, **P2** = polish, **P3** = defer. Effort: XS (≤30min), S (≤2h), M (≤1d), L (>1d). + +### P0 — Fix immediately (one-PR-per-repo sweep possible) + +| # | Finding | Where | Recommendation | Effort | One-PR? | +|---|---|---|---|---|---| +| P0.1 | bio CI hasn't passed since 2026-04-07 — every push ships untested | `biometric-processor/.github/workflows/ci.yml` lines 32, 55, 99, 148, 176 | Move all 5 jobs from `[self-hosted, linux, x64]` to `ubuntu-latest`. Verify Docker, Redis, Node 22, Python 3.12 all work on hosted runners (they do). | M | yes (single bio PR) | +| P0.2 | api Testcontainers job blocked by self-hosted-runner stall | `identity-core-api/.github/workflows/ci.yml` line 53 | Move `integration-tests` to `ubuntu-latest`. Keep `TESTCONTAINERS_RYUK_DISABLED: 'true'`. Bench cost: ~5-7min added to hosted budget per push. | S | yes (single api PR) | +| P0.3 | Branch protection OFF on all 5 repos including prod-shipping ones | All 5 repos | Set `Require PR + 1 review + status checks: build-and-test (or CI), gitleaks` on `main` (and `master` for fivucsas). Allow operator admin override. | S | no (5 separate API calls / settings click-through) | +| P0.4 | `pip-audit --strict \|\| true` defangs vulnerability check | `biometric-processor/.github/workflows/ci.yml` line 197 | Drop `\|\| true`. Add `--ignore-vuln` allowlist file if false positives surface. | XS | yes (one-line fix) | +| P0.5 | `code-quality` job in web-app passes regardless of `npm audit` and coverage | `web-app/.github/workflows/ci.yml` lines 92, 96 | Remove `continue-on-error: true` from both. If audit failures hit, allowlist via `npm audit --omit=dev` or `--audit-level=critical` first. | XS | yes | +| P0.6 | ruff F821 silenced — same bug class shipped via PR #68 | `biometric-processor/.github/workflows/ci.yml` line 47 | Remove `F821` from ignore list. Optionally remove `F401` too (unused imports are now common dead code). | XS | yes | +| P0.7 | Self-hosted runner is online but not pulling jobs (Task #55 root cause) | Hetzner VPS — operator-side | Operator: SSH, `systemctl status actions.runner.*`, check Runner Group repo access settings on org. **Until fixed, P0.1–P0.2 above eliminate most of the impact.** | M | no (operator) | + +### P1 — Significant reliability / performance + +| # | Finding | Where | Recommendation | Effort | One-PR? | +|---|---|---|---|---|---| +| P1.1 | Deploy pipelines (api/bio/landing) almost never run; operator deploys by hand; "CI deploys" myth in CLAUDE.md | `identity-core-api/.github/workflows/deploy-hetzner.yml`, `biometric-processor/.github/workflows/deploy-hetzner.yml`, `fivucsas/.github/workflows/deploy-landing.yml` | Pick a side: (a) run a 2nd runner so deploys actually happen, OR (b) delete the deploy workflows entirely and document operator-only. Hybrid is the worst option. | M | per-repo | +| P1.2 | Last successful Landing deploy 2026-03-28 — 5+ weeks of stale potential auto-deploy | `fivucsas/.github/workflows/deploy-landing.yml` | Move to `ubuntu-latest` + rsync over SSH (HOSTINGER_SSH_KEY already a secret) — same pattern web-app uses. Eliminates self-hosted dependency. | S | yes | +| P1.3 | E2E nightly has no failure alerting | `web-app/.github/workflows/e2e.yml` | Add a job step: if `failure()` create a GitHub Issue or post to ops-email Grafana contact point (extra step using `gh issue create`). | S | yes | +| P1.4 | E2E `pull_request` trigger missing — PRs ship without Playwright validation | `web-app/.github/workflows/e2e.yml` line 11-13 | Enable `pull_request` for `--project=smoke`. Smoke is `@readonly` per design. | XS | yes | +| P1.5 | api Maven cache miss-y | `identity-core-api/.github/workflows/ci.yml` line 35 | Replace with explicit `actions/cache@v4` keyed on `hashFiles('**/pom.xml')`. | XS | yes | +| P1.6 | Web-app double tsc | `web-app/.github/workflows/ci.yml` line 52 | Drop `npx tsc --noEmit` step. | XS | yes | +| P1.7 | gitleaks history scan absent | `identity-core-api/.github/workflows/gitleaks.yml`, `web-app/.github/workflows/gitleaks.yml` | Replace `gitleaks dir .` with `gitleaks detect --source . --no-banner --redact --verbose` (scans history). Optionally upload SARIF. | XS | yes (per repo) | +| P1.8 | bio missing gitleaks workflow | `biometric-processor/.github/workflows/` | Copy api's `gitleaks.yml`. | XS | yes | +| P1.9 | parent missing gitleaks workflow | `fivucsas/.github/workflows/` | Same — but scan ignores submodule subtrees (`--config` with `[allowlist]`). | S | yes | +| P1.10 | `concurrency: cancel-in-progress: true` on deploy workflows kills queued deploys before they can run | `identity-core-api/.github/workflows/deploy-hetzner.yml` line 15, `biometric-processor/.github/workflows/deploy-hetzner.yml` line 16 | Change to `cancel-in-progress: false` so a queued deploy gets to run after the current one finishes. Two pushes in quick succession = two deploys in series, not one. | XS | yes | +| P1.11 | client-apps android-build has no lint/test step | `client-apps/.github/workflows/android-build.yml` | Add `./gradlew :shared:check :androidApp:lintDebug` before `assembleDebug`. | S | yes | +| P1.12 | `master` listed alongside `main` on api deploy | `identity-core-api/.github/workflows/deploy-hetzner.yml` line 4 | Drop `master`. Default branch is `main`. | XS | yes | + +### P2 — Polish + +| # | Finding | Where | Recommendation | Effort | +|---|---|---|---|---| +| P2.1 | No coverage report uploaded for api Maven tests | `identity-core-api/.github/workflows/ci.yml` | Add JaCoCo + `actions/upload-artifact` for `target/site/jacoco/`. Optionally Codecov. | S | +| P2.2 | No spotbugs/checkstyle/PMD on Java | api ci.yml | Add `mvn spotbugs:check checkstyle:check` job. | M | +| P2.3 | No Dockerfile lint | bio + api Dockerfiles | Add `hadolint` step. | XS | +| P2.4 | mypy installed but not invoked in bio | bio ci.yml line 44 | Add `mypy app/ --ignore-missing-imports --no-strict-optional` step. | S | +| P2.5 | No alembic head check | bio ci.yml | Spin postgres in CI, run `alembic upgrade head`, assert clean. | M | +| P2.6 | dummy google-services.json inline | client-apps android-build.yml | Replace with a generated empty stub or a CI secret. | S | +| P2.7 | gitleaks no SARIF | api + web | Add `--report-format sarif --report-path gitleaks.sarif` and `actions/upload-sarif`. | XS | +| P2.8 | Bundle size budget missing | web-app | Add `vite-plugin-bundle-analyzer` + threshold gate. | S | +| P2.9 | a11y audit missing | web-app | Add `@axe-core/playwright` step in nightly E2E. | M | +| P2.10 | No link-checker on parent docs | fivucsas parent | Add `lychee` action on Markdown changes. | XS | + +### P3 — Defer + +| # | Finding | Recommendation | +|---|---|---| +| P3.1 | iOS workflow is non-blocking and out of scope per project policy | Leave as-is. | +| P3.2 | Desktop installers signing | Already documented as deferred (see `client-apps/docs/SIGNING.md`). | +| P3.3 | GitGuardian integration | Not part of stated stack; gitleaks + native GitHub secret-scanning sufficient. | +| P3.4 | Reusable workflows / composite actions | Pre-mature optimization for a 14-workflow estate. | + +--- + +## 10. One-PR-Sweep Candidates + +Findings that can land in a single PR per repo without operator coordination: + +* **bio one-PR:** P0.1 + P0.4 + P0.6 + P1.8 (move runners to `ubuntu-latest`, drop `|| true`, drop F821 ignore, add gitleaks workflow). All in `biometric-processor/.github/workflows/`. ~4 file edits. +* **api one-PR:** P0.2 + P1.5 + P1.7 + P1.10 + P1.12 (move IT to ubuntu, real Maven cache, gitleaks history, deploy concurrency, drop master). All in `identity-core-api/.github/workflows/`. ~3 file edits. +* **web one-PR:** P0.5 + P1.3 + P1.4 + P1.6 + P1.7 (drop continue-on-error, alert on E2E failure, PR trigger, single tsc, gitleaks history). All in `web-app/.github/workflows/`. ~3 file edits. +* **fivucsas one-PR:** P1.2 + P1.9 (rewrite landing deploy to ubuntu+SSH, add gitleaks). 2 file edits. +* **client-apps one-PR:** P1.11 + P2.6 (lint/test step, replace dummy google-services.json). 1 file edit. + +Findings that need operator coordination: + +* **P0.3** (branch protection) — needs `gh api -X PUT` per repo, requires admin token. +* **P0.7** (Hetzner runner diagnosis) — needs SSH to the VPS. +* **P1.1** (deploy contract clarification) — strategic call, not a code change. + +--- + +## 11. Numbers At a Glance + +| Repo | Workflows | last-30 success rate (CI) | last-30 success rate (deploy) | branch protection | gitleaks | +|---|---|---|---|---|---| +| `fivucsas` | 2 | 63% (success), 30% (failure), 7% (cancelled) | 13% (4/30) | OFF | ❌ | +| `identity-core-api` | 3 | 0% (success), 70% (cancelled), 27% (running), 3% (failure) | 0% (last-10) | OFF | ✅ | +| `biometric-processor` | 2 | 2% (success on last 100), 82% (cancelled), 15% (failure) | 0% (last-10) | OFF | ❌ | +| `web-app` | 4 | 87% (CI), 100% (E2E nightly cron, 2 runs) | 80% (8/10 deploy) | OFF (memory said ON — incorrect) | ✅ | +| `client-apps` | 3 | 57% (Android build success), 97% (iOS, non-blocking) | n/a (tag-only) | OFF | ❌ | + +Top duration outliers: + +* api `Integration tests (Testcontainers)`: queued 5h38m–24h before cancellation (Task #55). +* bio CI: 24h-to-the-second timeouts on every job in the last 27 days. +* api Maven unit job (when it runs): 50–80s on `ubuntu-latest`, healthy. +* web-app CI: median 165s, p95 180s, healthy. +* web-app deploy: median 65s, healthy. +* client-apps Android: median 190s, healthy. + +--- + +## 12. Closing Notes + +The platform's **actual** CI/CD posture today: + +* web-app is the only end-to-end working pipeline (CI green-rate ~87%, deploys land on Hostinger automatically, gitleaks + native secret scanning + push protection on, E2E cron just lit up). +* identity-core-api ships untested integration code on every push and has shipped that way at least since 2026-04-18. +* biometric-processor ships completely untested code on every push and has shipped that way since 2026-04-07. +* landing-website has not auto-deployed in 5+ weeks. +* The single self-hosted Hetzner runner is online, idle, and not pulling any of the queued jobs that need it. Diagnosis requires SSH. +* Branch protection is OFF on every repo. The platform's security architecture (RFC 6749 reuse-detection, embedding encryption at rest, push-protection) is in tension with a code-merge layer where any push to `main` lands without a green CI. + +**The single highest-leverage change** is moving api `integration-tests` and bio CI off the self-hosted runner and onto `ubuntu-latest`. That eliminates ~70% of the cancellation noise, restores actual test coverage on prod-bound code, and decouples CI from the operator-side runner-fix work. **Estimated combined effort: 1 day total across both repos.** + +The single highest-leverage **operator** change is enabling branch protection on all 5 repos with `Require status checks: CI + gitleaks`. **Estimated effort: 30 min, no code changes.** + +— end of audit — + +## 2026-05-11 — Branch protection enabled + +Closes audit recommendation T3.1.d ("enable branch protection on all 5 repos"). + +### Scope + +Branch protection was enabled on the following 6 branches: + +| Repo | Branch | Notes | +| --- | --- | --- | +| `Rollingcat-Software/FIVUCSAS` | `main` | parent default | +| `Rollingcat-Software/FIVUCSAS` | `master` | integration branch (per `reference_fivucsas_branch_model.md`) | +| `Rollingcat-Software/identity-core-api` | `main` | | +| `Rollingcat-Software/biometric-processor` | `main` | | +| `Rollingcat-Software/web-app` | `main` | | +| `Rollingcat-Software/client-apps` | `main` | | + +### Settings applied (identical on all 6 branches) + +* **Required pull request reviews**: `1` approving review required +* `dismiss_stale_reviews`: `false` +* `require_code_owner_reviews`: `false` +* **Required status checks**: `null` (not enforced — CI is still rolling, will be added later) +* **`enforce_admins`**: `false` — **admin bypass allowed** (see policy below) +* `restrictions`: `null` (anyone with push access can open the PR) +* `required_linear_history`: `false` (merge commits allowed — important for the `master` integration pattern) +* `allow_force_pushes`: `false` (force-push to protected branch blocked) +* `allow_deletions`: `false` (branch cannot be deleted) +* `required_conversation_resolution`: `true` (all review threads must be resolved before merge) + +### Admin-bypass policy + +`enforce_admins=false` was chosen deliberately for the solo-dev cadence (see memory note `feedback_pr_review_workflow.md`). Admin bypass is **only** to be used for: + +* **Emergency production hotfixes** when no second reviewer is available and the fix is small/auditable. +* **Operator-only mechanical chores** (submodule pointer bumps, dependency-lock updates, doc-only changes that have already been reviewed elsewhere). + +Every admin-bypass merge **must** be documented in the PR description with the reason ("emergency hotfix for X" / "submodule pointer bump only"). PRs that bypass review without justification should be flagged in the next CICD audit. + +### Verification command + +```bash +gh api repos/Rollingcat-Software//branches//protection +``` + +### Evidence — `Rollingcat-Software/FIVUCSAS` branch `main` + +```json +{ + "url": "https://api.github.com/repos/Rollingcat-Software/FIVUCSAS/branches/main/protection", + "required_pull_request_reviews": { + "url": "https://api.github.com/repos/Rollingcat-Software/FIVUCSAS/branches/main/protection/required_pull_request_reviews", + "dismiss_stale_reviews": false, + "require_code_owner_reviews": false, + "require_last_push_approval": false, + "required_approving_review_count": 1 + }, + "required_signatures": { + "url": "https://api.github.com/repos/Rollingcat-Software/FIVUCSAS/branches/main/protection/required_signatures", + "enabled": false + }, + "enforce_admins": { + "url": "https://api.github.com/repos/Rollingcat-Software/FIVUCSAS/branches/main/protection/enforce_admins", + "enabled": false + }, + "required_linear_history": { "enabled": false }, + "allow_force_pushes": { "enabled": false }, + "allow_deletions": { "enabled": false }, + "block_creations": { "enabled": false }, + "required_conversation_resolution": { "enabled": true }, + "lock_branch": { "enabled": false }, + "allow_fork_syncing": { "enabled": false } +} +``` + +### Verification summary (all 6 branches) + +``` +FIVUCSAS/main reviews=1 enforce_admins=false force_push=false del=false conv=true +FIVUCSAS/master reviews=1 enforce_admins=false force_push=false del=false conv=true +identity-core-api/main reviews=1 enforce_admins=false force_push=false del=false conv=true +biometric-processor/main reviews=1 enforce_admins=false force_push=false del=false conv=true +web-app/main reviews=1 enforce_admins=false force_push=false del=false conv=true +client-apps/main reviews=1 enforce_admins=false force_push=false del=false conv=true +``` + +### Follow-up + +* When CI reaches stable green-rate per repo, add `required_status_checks` (e.g. `ci`, `gitleaks`, `e2e`) to each branch — see T3.1.e in the next CICD audit. +* Optional future tightening: flip `enforce_admins` to `true` once a second reviewer is consistently available. diff --git a/CLAUDE.md b/CLAUDE.md index 4dd5121..c1af72b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -108,6 +108,42 @@ PASSWORD | EMAIL_OTP | SMS_OTP | TOTP | FACE | VOICE | FINGERPRINT | HARDWARE_KE - My Profile page (enrollments, activity, data export, KVKK/GDPR) - Cross-device session management (view/revoke) +## Biometric Pipeline (CRITICAL — Read Before Touching biometric-processor or web-app auth) + +**Architecture decision:** Auth kararı sunucuda olmalı — tarayıcı güvenilmez. Client geometry embedding (512-dim landmark distance) LOG-ONLY'dir, auth için kullanılmaz (D2 kararı). + +### Gerçek Üretim Durumu (2026-04-28 afternoon, post-fix) +| Katman | Durum | +|---|---| +| Client detection (auth) | ✅ MediaPipe FaceLandmarker 478pt primary, BlazeFace fallback | +| Server detection | ✅ MTCNN (bundled weights, deviation from centerface roadmap due to DeepFace bug) | +| Server embedding | ✅ Facenet512 (512-dim) | +| Server liveness (/verify) | ✅ UniFace MiniFASNet passive — `LIVENESS_BACKEND=uniface`, `LIVENESS_MODE=passive` | +| Server liveness (/enroll) | ✅ Wired | +| Server anti-spoofing | ✅ `ANTI_SPOOFING_ENABLED=true` | +| Client passive liveness | ✅ `PASSIVE_LIVENESS_THRESHOLD=0.45` gate in useFaceChallenge | +| Client quality scoring | ✅ Bbox fallback when no landmarks; weights redistribute to blur*0.55+lighting*0.45 | +| pgvector search | ✅ Üretimde | +| Adaptive threshold | ✅ `VERIFICATION_THRESHOLD_AGED_*` for >2yr-old embeddings | + +### Kural: Embedding Dimension Tutarlılığı +`FACE_RECOGNITION_MODEL` ile `EMBEDDING_DIMENSION` her zaman eşleşmeli: +- `Facenet` → `EMBEDDING_DIMENSION=128` +- `Facenet512` → `EMBEDDING_DIMENSION=512` +- Model değiştirince **tüm embeddingler geçersiz** — yeniden enrollment zorunlu + +### Kural: GPU Gerektiren Modeller +`ALLOW_HEAVY_ML=false` (default) iken bu modeller boot'u engeller: +- `FACE_DETECTION_BACKEND`: `retinaface`, `yolov8`, `yolov11*`, `yolov12*` +- `FACE_RECOGNITION_MODEL`: `ArcFace`, `VGG-Face`, `GhostFaceNet` + +CX43 CPU-only — GPU ihtiyacı doğmaz (Faz 1-3 roadmap CPU-safe). + +### Kural: Liveness Entegrasyonu +`/liveness` endpoint'i ayrı çalışıyor. `/enroll` ve `/verify` liveness çağırmıyor — bu kasıtlı değil, açık bir boşluk. Faz 2'de düzeltilecek. + +**Detay:** `archive/2026-04-pre-roadmap-2028/BIOMETRIC_PIPELINE_AUDIT_2026-04-28.md` | **Roadmap:** `archive/2026-04-pre-roadmap-2028/BIOMETRIC_ROADMAP_2026-04-28.md` + ## Database - Flyway migrations V1-V38 (identity-core-api; V37 tenant_id index, V38 SPA public client flip) + Alembic 0001-0004 (biometric-processor) diff --git a/CLIENT_APPS_PARITY_PLAN_2026-04-28.md b/CLIENT_APPS_PARITY_PLAN_2026-04-28.md new file mode 100644 index 0000000..3b656f7 --- /dev/null +++ b/CLIENT_APPS_PARITY_PLAN_2026-04-28.md @@ -0,0 +1,144 @@ +# Client-Apps Parity & APK Release Plan — 2026-04-28 + +Research-only output from Team D. **Nothing was changed in code.** This +file captures the work for user review before implementation. + +## 1. UI Parity Plan (≈5.5h work) + +Goal: bring `client-apps` LoginScreen visually + behaviorally close to +`web-app/src/features/auth/components/LoginPage.tsx` and +`web-app/src/verify-app/HostedLoginApp.tsx`. + +### Web reference visuals +- Background gradient: `linear-gradient(135deg, #667eea → #764ba2 → #f64f59)`, animated. +- Primary/button gradient: `#6366f1 → #8b5cf6`. +- Input bg: `#f8fafc` light, focus `#fff`. Text `#1a1a2e`. Border `rgba(0,0,0,0.23)`. +- Card: glassmorphism (white 0.95, blur 20px), 24px radius. +- Logo: 80×80 gradient box, white Fingerprint icon, shadow. +- TextField/Button radius: 12px. +- Motion: framer-motion staggered entry, logo 3D rotateY. +- Floating shapes: 5 glassmorphic circles (decorative). + +### Client-apps current +- File: `client-apps/shared/src/commonMain/kotlin/com/fivucsas/shared/ui/screen/LoginScreen.kt` (441 lines), Material3. +- Theme: `AppColors.kt` Primary `#FF1976D2`, Secondary `#FF00ACC1`. No gradients. +- No card wrapper, no logo gradient block, no animations on form entry. + +### Phase plan +| Phase | Work | Effort | +|---|---|---| +| 1 — Colors & Shapes | Update `AppColors.kt`: Primary `#6366F1`, Secondary `#8B5CF6`, add `WebGradientBg`/`WebPrimaryGradient` Brushes. Add `AppShapes.small = RoundedCornerShape(12.dp)`. Wrap LoginScreen in Card + add gradient logo box. | 2h | +| 2 — Form styling | Custom `OutlinedTextField` defaults: bg `#F8FAFC`, text `#1A1A2E`, 12dp radius. | 1.5h | +| 3 — Animations & polish | `AnimatedVisibility` + slide animations for form fields. White spinner color. Verify dark-mode behavior (recommend light-only for login). | 2h | + +### Compose limitations / decisions needed +- **Glassmorphism** — Compose has no native `backdrop-filter: blur()`. Use solid white Card + elevation shadow. Acceptable. +- **Animated gradient** — CSS animation not portable. Use static gradient OR `animateFloat()` + offset (medium complexity). Recommend static for MVP. +- **Floating shapes** — Decorative; expensive on mobile. Defer. +- **Dark mode** — Web has no dark login. Recommend forcing light-only for `LoginScreen` (override LocalThemeMode in screen root). + +## 2. APK Release Workflow Plan + +### Current state +- `client-apps/.github/workflows/android-build.yml` (142 lines) — builds debug + release APKs but **does not sign the release** and **does not upload to GitHub Releases**. All historical APK uploads (v1.0.0–v5.2.0) were manual. +- `client-apps/androidApp/build.gradle.kts:23-108` — signing config already reads from env vars; no code change needed. +- **No GitHub repo secrets exist** for `ANDROID_KEYSTORE_BASE64`, `ANDROID_KEYSTORE_PASSWORD`, `ANDROID_KEY_ALIAS`, `ANDROID_KEY_PASSWORD`. + +### What user must do (one-time) +1. Generate keystore locally: + ``` + keytool -genkey -v -keystore release.jks -keyalg RSA -keysize 2048 -validity 36500 \ + -alias fivucsas \ + -dname "CN=FIVUCSAS, OU=Engineering, O=Marmara University, C=TR" \ + -storepass "" -keypass "" + ``` +2. Base64-encode: `base64 -i release.jks` (Linux/Mac) or PowerShell `[Convert]::ToBase64String([IO.File]::ReadAllBytes("release.jks"))`. +3. Add 4 GitHub repo secrets at `github.com/Rollingcat-Software/client-apps/settings/secrets/actions`: + - `ANDROID_KEYSTORE_BASE64` (the base64 string) + - `ANDROID_KEYSTORE_PASSWORD` + - `ANDROID_KEY_ALIAS` (= `fivucsas`) + - `ANDROID_KEY_PASSWORD` +4. **Never commit `release.jks` to git.** Store securely (e.g., `~/.android/keystore/`). + +### Workflow YAML (to be added at `.github/workflows/android-release.yml`) + +Triggers on `vX.Y.Z` tag push. Builds signed APK, uploads to GitHub Releases tagged with the same version. + +```yaml +name: Android Release APK +on: + push: + tags: ['v[0-9]+.[0-9]+.[0-9]+'] + workflow_dispatch: + inputs: + tag_name: + description: 'Release tag (e.g. v5.2.1)' + required: true + type: string +concurrency: + group: android-release-${{ github.ref }} + cancel-in-progress: false +env: + JAVA_VERSION: '21' +jobs: + build_and_release: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-java@v4 + with: { java-version: 21, distribution: 'temurin' } + - uses: android-actions/setup-android@v3 + - uses: gradle/actions/setup-gradle@v4 + - name: Dummy google-services.json + run: | + cat > androidApp/google-services.json << 'EOF' + {"project_info":{"project_id":"fivucsas-ci-dummy"},"client":[{"client_info":{"android_client_info":{"package_name":"com.fivucsas.mobile"}}}],"configuration_version":"1"} + EOF + - name: Decode keystore + env: + ANDROID_KEYSTORE_BASE64: ${{ secrets.ANDROID_KEYSTORE_BASE64 }} + run: | + [ -z "$ANDROID_KEYSTORE_BASE64" ] && { echo "::error::keystore secret missing"; exit 1; } + mkdir -p "$RUNNER_TEMP/keystore" + printf '%s' "$ANDROID_KEYSTORE_BASE64" | base64 -d > "$RUNNER_TEMP/keystore/release.jks" + echo "ANDROID_KEYSTORE_PATH=$RUNNER_TEMP/keystore/release.jks" >> "$GITHUB_ENV" + - name: Build signed release APK + env: + ANDROID_KEYSTORE_PASSWORD: ${{ secrets.ANDROID_KEYSTORE_PASSWORD }} + ANDROID_KEY_ALIAS: ${{ secrets.ANDROID_KEY_ALIAS }} + ANDROID_KEY_PASSWORD: ${{ secrets.ANDROID_KEY_PASSWORD }} + run: ./gradlew :androidApp:assembleRelease --no-daemon + - name: Wipe keystore + if: always() + run: rm -f "$RUNNER_TEMP/keystore/release.jks" + - id: version + run: | + TAG="${{ github.ref_name }}" + echo "version_name=${TAG#v}" >> "$GITHUB_OUTPUT" + - uses: softprops/action-gh-release@v1 + env: { GITHUB_TOKEN: '${{ secrets.GITHUB_TOKEN }}' } + with: + tag_name: ${{ github.ref_name }} + files: androidApp/build/outputs/apk/release/*.apk + body: | + ## FIVUCSAS Mobile ${{ steps.version.outputs.version_name }} + Signed release APK. Suitable for direct distribution or Play submission. + Package: com.fivucsas.mobile +``` + +## 3. Open decisions (need user) + +1. Dark mode for login: light-only or platform-dark? **Recommend: light-only.** +2. Animated gradient background: do or skip for MVP? **Recommend: skip, static gradient.** +3. Floating glassmorphic shapes: implement or defer? **Recommend: defer.** +4. Test the workflow on `v5.2.0-test` first or go straight to `v5.2.1`? **Recommend: test tag first.** +5. Keystore rotation policy: store rotation cadence (e.g., 12 months)? Document it. + +## 4. Sequence I recommend the user follow + +1. Approve the parity color/typography choices (or push back). +2. Generate the keystore locally; do NOT share it with anyone. +3. Add 4 GitHub secrets. +4. Approve me to commit the `android-release.yml` and apply the parity changes. +5. Push a `v5.2.0-test` tag, watch the workflow, delete the test release after. +6. Push `v5.2.1` for real. diff --git a/DOC_AUDIT_2026-05-04.md b/DOC_AUDIT_2026-05-04.md new file mode 100644 index 0000000..ea0c7f4 --- /dev/null +++ b/DOC_AUDIT_2026-05-04.md @@ -0,0 +1,482 @@ +# DOCUMENTATION AUDIT — FIVUCSAS + +**Date:** 2026-05-04 +**Author:** T-DOC-AUDIT (single-pass agent run) +**Scope:** Parent monorepo + 5 submodules + `docs/` submodule + `/opt/projects/infra/` runbooks +**Method:** Filesystem inventory of every `*.md` on current `HEAD`, cross-referenced against industry checklists from GitHub, Google Cloud, AWS, Microsoft, and ADR.github.io. + +--- + +## 1. Executive Summary + +FIVUCSAS has a **lot** of documentation — north of 200 markdown files across the parent and five submodules — but the distribution is heavily skewed toward _running narrative_ (dated audits, session-status logs, roadmaps with timestamps in the filename) rather than _evergreen reference_ (CONTRIBUTING, SECURITY, ADRs, tenant onboarding, API reference). Open-source health files (`LICENSE`, `CONTRIBUTING.md`, `SECURITY.md`, `CODE_OF_CONDUCT.md`, `.github/ISSUE_TEMPLATE/`, `.github/PULL_REQUEST_TEMPLATE.md`, `CODEOWNERS`, `dependabot.yml`) are **completely absent from every repo** despite the README badges advertising MIT licensing. Architecture and runbook coverage is actually strong — `/opt/projects/infra/` has 8 production runbooks and `docs/` has a deep `02-architecture/` tree — but the runbooks live outside the public repo and the architecture docs predate ~30% of the production code (e.g., V50–V57 migrations, hosted-OIDC, RFC 6749 reuse-detection). The parent root has accumulated ~25 dated review/session/audit files since 2026-04-28 that should be moved into `docs/reviews/YYYY-MM-DD/` to keep the root scannable. + +**Top-three documentation gaps (P0/P1):** + +1. **No `SECURITY.md` anywhere** — biometric SaaS platforms must have a documented vulnerability disclosure channel. P0. +2. **No tenant onboarding playbook** — there is `docs/04-api/SERVICES_OVERVIEW.md` and `docs/09-auth-flows/07-TENANT_ADMIN_UX.md` but nothing that walks a new tenant from "I just got admin credentials" to "users can log in via my custom flow." P0 user-facing. +3. **No ADRs (`docs/adr/`).** Major past decisions — pgvector vs. FAISS, hosted-first OIDC, MobileFaceNet removal, Facenet512 over ArcFace, MTCNN vs. centerface, hexagonal architecture, log-only client embeddings (D2), passive UniFace liveness — live only in CHANGELOG narrative or session-memory memos. Reconstruct as backfilled ADRs. P1. + +--- + +## 2. Industry-Standard Checklist + +Based on [GitHub Best Practices for Repositories](https://docs.github.com/en/repositories/creating-and-managing-repositories/best-practices-for-repositories), [GitHub Special Files](https://gist.github.com/jakebrinkmann/c63eaedbe384516e4a7bc133c1e1066b), [adr.github.io](https://adr.github.io/), [Google Cloud ADR overview](https://docs.cloud.google.com/architecture/architecture-decision-records), [arc42 §9](https://docs.arc42.org/section-9/), [AWS Prescriptive Guidance — ADR process](https://docs.aws.amazon.com/prescriptive-guidance/latest/architectural-decision-records/adr-process.html), [WorkOS multi-tenant SaaS guide](https://workos.com/blog/developers-guide-saas-multi-tenant-architecture), and [CTO SaaS Security Checklist](https://github.com/sqreen/CTOSecurityChecklist). Industry exemplars referenced: Auth0, Stripe, Supabase, Keycloak. + +### 2.1 Repo-root files (every repo) + +| File | Purpose | Industry expectation | +|------|---------|----------------------| +| `README.md` | What/why/how of project | Required. Project name, badge row, 30-second pitch, install, run, contribute link | +| `LICENSE` | License text (MIT/Apache/etc.) | Required. README badge alone is **not** sufficient | +| `CONTRIBUTING.md` | PR rules, branch model, coding standards | Strongly recommended. Linked by GitHub on New PR/Issue pages | +| `SECURITY.md` | Vulnerability disclosure policy, supported versions | **Required** for any service handling auth/biometric/PII | +| `CODE_OF_CONDUCT.md` | Community behavior rules | Strongly recommended (Contributor Covenant template) | +| `CHANGELOG.md` | Versioned change log | Required, [Keep a Changelog](https://keepachangelog.com/) format | +| `ROADMAP.md` | Forward-looking plan | Optional; can live in `docs/` | +| `RELEASING.md` | Cut-release checklist | Recommended for SaaS/SDK | +| `CLAUDE.md` / `AGENT.md` | AI-assistant onboarding | Project-specific, no industry standard yet | + +### 2.2 `.github/` files (each repo) + +| File | Purpose | +|------|---------| +| `.github/PULL_REQUEST_TEMPLATE.md` | Required PR checklist | +| `.github/ISSUE_TEMPLATE/bug_report.md` | Bug-report scaffold | +| `.github/ISSUE_TEMPLATE/feature_request.md` | Feature scaffold | +| `.github/ISSUE_TEMPLATE/config.yml` | Disable blank issues, link to support | +| `.github/CODEOWNERS` | Auto-assign reviewers | +| `.github/dependabot.yml` | Auto dependency PRs | +| `.github/SECURITY.md` | Mirrors root SECURITY.md (GitHub also surfaces this path) | +| `.github/FUNDING.yml` | Optional sponsor button | + +### 2.3 `docs/` directory layout (industry convention) + +``` +docs/ +├── README.md # Doc index +├── getting-started/ +│ ├── quickstart.md +│ ├── local-development.md +│ └── architecture-tour.md +├── architecture/ +│ ├── overview.md +│ ├── multi-tenancy.md +│ ├── data-model.md +│ └── adr/ # ← Architecture Decision Records +│ ├── 0001-hexagonal-architecture.md +│ ├── 0002-pgvector-over-faiss.md +│ ├── ... +│ └── README.md # ADR index, status table +├── api-reference/ # Auto-generated from OpenAPI +├── deployment/ +│ ├── docker-compose.md +│ ├── hetzner.md +│ └── hostinger.md +├── runbooks/ # Operator playbooks +│ ├── disaster-recovery.md +│ ├── secret-rotation.md +│ ├── incident-response.md +│ └── fk-cascade-recovery.md +├── security/ +│ ├── threat-model.md # STRIDE +│ ├── data-handling.md # Biometric data lifecycle +│ └── webauthn.md +├── compliance/ +│ ├── gdpr-art-17.md +│ ├── kvkk.md +│ ├── eu-ai-act.md +│ └── dpa-template.md +├── testing/ +│ ├── unit.md +│ ├── e2e-playwright.md +│ └── load.md +├── tenant/ # Tenant-facing +│ ├── onboarding.md +│ ├── auth-flow-builder.md +│ ├── sdk-integration.md +│ └── pricing.md +├── style-guides/ +│ ├── java.md +│ ├── typescript.md +│ ├── python.md +│ └── kotlin.md +├── i18n/ +│ └── contributor-guide.md +├── glossary.md +└── reviews/ # Dated audit/review docs + └── 2026-05-04/ + ├── senior-db-review.md + └── ... +``` + +### 2.4 Industry exemplars to mirror + +- **[Auth0 docs](https://auth0.com/docs)** — most comparable surface area (auth + tenant model). Tenant-onboarding flow, SDK pages per language, hosted-login pages section. **Strong target** for FIVUCSAS public docs. +- **[Stripe docs](https://stripe.com/docs)** — gold standard for developer portal: copy-pasteable curl + per-language SDK tabs, status page, changelog. Aspirational. +- **[Supabase docs repo](https://github.com/supabase/supabase)** — mid-size OSS multi-tenant SaaS. README + CONTRIBUTING + SECURITY + CODE_OF_CONDUCT + dependabot all present; uses `apps/docs` Nextra site. Realistic structural target. +- **[Keycloak docs](https://www.keycloak.org/documentation)** — another auth platform; strong on operator runbooks and threat modeling. + +--- + +## 3. Per-Repo Audit + +Legend: ✅ present + adequate · ⚠️ present but stale or thin · ❌ missing + +### 3.1 Parent repo `/opt/projects/fivucsas/` + +| File / Topic | Status | Notes | +|---|---|---| +| `README.md` | ⚠️ | 4.0.0 / "MIT" badge, mentions `ROADMAP_2026-04-28.md` as canonical (now superseded by `ROADMAP_OPTIMIZED_2026-05-04.md`). Architecture diagram still valid. Last-verified date = 2026-04-28; needs refresh. | +| `CHANGELOG.md` | ⚠️ | Long, narrative-style, not Keep-a-Changelog format. No version headings, just dated session entries. Hard to scan for "what shipped in the last release." | +| `CLAUDE.md` | ✅ | Comprehensive, current to 2026-05-04. Project-specific, not industry standard. | +| `LICENSE` | ❌ | Badge says MIT but no LICENSE file exists in any repo. | +| `CONTRIBUTING.md` | ❌ | Missing. | +| `SECURITY.md` | ❌ | Missing. The closest thing is `docs/SECURITY_INCIDENTS.md` (incident log, not disclosure policy). | +| `CODE_OF_CONDUCT.md` | ❌ | Missing. | +| `ROADMAP.md` | ⚠️ | Two competing files: `ROADMAP_2026-04-28.md` (declared canonical in README) and `ROADMAP_OPTIMIZED_2026-05-04.md` (newer). Confusing. | +| `RELEASING.md` | ❌ | Missing. | +| `.github/PULL_REQUEST_TEMPLATE.md` | ❌ | Missing. | +| `.github/ISSUE_TEMPLATE/` | ❌ | Missing. | +| `.github/CODEOWNERS` | ❌ | Missing. | +| `.github/dependabot.yml` | ❌ | Missing. | +| Dated review/audit docs at root | ⚠️ | 25+ files (`AUDIT_2026-04-28_*`, `BACKEND_REVIEW_2026-04-30.md`, `SENIOR_DB_REVIEW_2026-05-04.md`, `SESSION_STATUS_2026-05-04.md`, etc.). Should be archived to `docs/reviews/YYYY-MM-DD/`. | + +### 3.2 `identity-core-api/` (Spring Boot, port 8080) + +| File / Topic | Status | Notes | +|---|---|---| +| `README.md` | ✅ | Detailed, has TOC, badges, MIT ref. Up to date. | +| `CHANGELOG.md` | ✅ | Present. Verify Keep-a-Changelog conformance in cleanup pass. | +| `ROADMAP.md` | ✅ | Present. | +| `TODO.md` | ⚠️ | "49-item integration audit." Mix of done/not-done; needs sweep. | +| `CLAUDE.md` | ✅ | Up to date through V57 + PR #71. | +| `LICENSE` | ❌ | Missing. | +| `CONTRIBUTING.md` | ❌ | Missing. | +| `SECURITY.md` | ❌ | Missing. **Highest-priority gap** — this repo holds JWT signing, OAuth2, WebAuthn, refresh-token reuse-detection, all auth surface. | +| `CODE_OF_CONDUCT.md` | ❌ | Missing. | +| `.github/PULL_REQUEST_TEMPLATE.md` | ❌ | Missing. | +| `.github/CODEOWNERS` | ❌ | Missing. | +| `.github/dependabot.yml` | ❌ | Missing. | +| `docs/migrations/` | ⚠️ | Two files cover V7-V9 era; nothing newer. V10–V57 migration narratives live in CHANGELOG. | +| `docs/research/server-providers-comparison.md` | ✅ | One-off historical doc; fine. | +| OpenAPI spec / API reference | ⚠️ | Auto-generated at runtime via springdoc — no static export committed. Future: nightly export to `docs/api/openapi-snapshot.json` for diffability. | + +### 3.3 `biometric-processor/` (FastAPI, port 8001) + +| File / Topic | Status | Notes | +|---|---|---| +| `README.md` | ✅ | Concise, security-first language ("internal Docker network only"). | +| `CHANGELOG.md` | ✅ | Present. | +| `ROADMAP.md` | ✅ | Present. | +| `ARCHITECTURE.md` | ✅ | Top-level architecture file. | +| `DOCKER_SETUP.md` | ✅ | Present. | +| `CLAUDE.md` | ✅ | Current. | +| `AUDIT_2026-04-26.md` | ⚠️ | Stale dated audit at repo root — should move into `docs/audits/` or `docs/reviews/2026-04-26/`. | +| `docs/` | ✅ | Well-structured 1-getting-started / 2-api-documentation / 3-deployment / 6-architecture tree. | +| `docs/2-api-documentation/API_REFERENCE.md` | ✅ | Hand-written API reference present. Verify it matches FastAPI `/docs` (auto-generated OpenAPI). | +| `docs/CVE_AUDIT_2026-04-18.md` | ⚠️ | Date suggests stale; verify still accurate post-2026-04-30 dependabot sweep. | +| `docs/UNIFACE_BACKEND_BENCHMARKING.md` | ✅ | Good — captures liveness backend choice rationale. | +| `docs/research/PROCTORING_SERVICE_RESEARCH.md` | ✅ | Strategic doc. | +| `LICENSE`, `CONTRIBUTING.md`, `SECURITY.md`, `CODE_OF_CONDUCT.md` | ❌ | All missing. | +| `.github/` | ⚠️ | Only `workflows/`. No PR/issue templates, CODEOWNERS, or dependabot. | +| `docs/4-testing/` | ❌ | Directory referenced in README but **not present** (verified). README link broken. | +| `docs/5-security/` | ❌ | Referenced; not present. | + +### 3.4 `web-app/` (React + TypeScript, port 5173 dev) + +| File / Topic | Status | Notes | +|---|---|---| +| `README.md` | ✅ | Detailed; lists architecture, two surfaces (admin + verify). | +| `CHANGELOG.md` | ✅ | Present. | +| `ROADMAP.md` | ✅ | Present. | +| `TODO.md` | ⚠️ | Plus `TODO_USER_PROFILE_BUGS_2026-04-30.md` — second TODO with date in name suggests no consolidation discipline. | +| `CLAUDE.md` | ✅ | Current. | +| `docs/ARCHITECTURE.md` | ✅ | Present. | +| `docs/AUTHENTICATION_SYSTEM_DESIGN.md` | ✅ | Present. | +| `docs/DEVELOPER_GUIDE.md` | ✅ | Present. | +| `docs/AUDIT_REPORT_2026-04-16.md` | ⚠️ | Old dated audit in `docs/`; should move to `docs/reviews/`. | +| `docs/plans/HOSTED_LOGIN_INTEGRATION.md` | ✅ | One plan doc. | +| SDK package docs (`@fivucsas/auth-js`, `@fivucsas/auth-react`) | ❌ | No README inside the published SDK packages (verify on next release). | +| `LICENSE`, `CONTRIBUTING.md`, `SECURITY.md`, `CODE_OF_CONDUCT.md` | ❌ | All missing. | +| `.github/` | ⚠️ | Only `workflows/`. | +| Storybook / component catalog | ❌ | None. | +| i18n contributor guide | ❌ | Memory rule "no hardcoded strings" is enforced at PR review only; not documented for new contributors. | + +### 3.5 `client-apps/` (Kotlin Multiplatform Android) + +| File / Topic | Status | Notes | +|---|---|---| +| `README.md` | ✅ | Present. | +| `CHANGELOG.md` | ✅ | Present. | +| `ROADMAP_CLIENT_APPS.md` | ✅ | Renamed roadmap; not just `ROADMAP.md`. Convention drift across repos. | +| `docs/RELEASE.md` | ✅ | Cut-release checklist. **Best release doc in the monorepo.** | +| `docs/SIGNING.md` | ✅ | APK signing. | +| `docs/PERFORMANCE.md` | ✅ | Present. | +| `docs/DEPLOYMENT_CHECKLIST.md` | ✅ | Present. | +| `docs/TODO.md` | ⚠️ | Should rename `TODO.md` → align with sibling repos, or fold into top-level TODO. | +| iOS docs | n/a | iOS scope permanently OUT (memory-confirmed). Acceptable. | +| `LICENSE`, `CONTRIBUTING.md`, `SECURITY.md`, `CODE_OF_CONDUCT.md` | ❌ | All missing. | +| `.github/` | ⚠️ | Only `workflows/`. | + +### 3.6 `landing-website/` (React + Tailwind, fivucsas.com) + +| File / Topic | Status | Notes | +|---|---|---| +| `README.md` | ❌ | **Completely undocumented.** No README, no CHANGELOG, no docs of any kind. The build/deploy is ad-hoc (rsync to Hostinger). | +| `LICENSE`, `CONTRIBUTING.md`, `SECURITY.md`, `CODE_OF_CONDUCT.md` | ❌ | All missing. | +| `.github/` | ❌ | Directory absent. | +| Style guide | ❌ | Tailwind config + design tokens undocumented. | + +### 3.7 `docs/` (submodule, separate repo) + +| Section | Status | Notes | +|---|---|---| +| `README.md` (top-level index) | ✅ | Comprehensive. Numbered 0-9 + topical dirs. | +| `00-meta/` | ⚠️ | Mostly empty (only README). | +| `01-getting-started/` | ✅ | 4 docs, current. | +| `02-architecture/` | ✅ | Strong: ARCHITECTURE_ANALYSIS, MODULE_STRUCTURE, diagrams/ subdir with PlantUML. Predates V50–V57 migrations though. | +| `03-development/` | ✅ | KMP guide, technology decisions, implementation guide. | +| `04-api/` | ⚠️ | High-level overview only. No per-endpoint reference exported from OpenAPI. | +| `05-testing/` | ✅ | Multiple guides (unit, mobile, quickstart). | +| `06-deployment/` | ⚠️ | One thin file (`START_ALL_SERVICES.md`). Production deployment lives outside this repo (`scripts/deploy/DEPLOYMENT_GUIDE.md`). | +| `07-status/` | ⚠️ | Empty except for README. Was supposed to host `IMPLEMENTATION_STATUS_REPORT.md` — broken link from main `docs/README.md`. | +| `08-website/` | ⚠️ | Empty stub. | +| `09-auth-flows/` | ✅ | 10 numbered docs (capability matrix → voice recognition). Excellent module design. | +| `architecture/` (lowercase, parallel to `02-architecture/`) | ⚠️ | Confusing — second architecture dir with `data-flow.md`, `event-bus.md`, `security.md`, `structure.md`, `webhooks.md`. Should be merged. | +| `archive/2026-04-16/` | ✅ | Properly archived old design docs (37 files). Good pattern; not applied at parent root. | +| `audits/AUDIT_2026-04-19.md` | ⚠️ | Only one audit lives here; rest are loose at parent root. Pattern broken. | +| `guides/deployment/`, `guides/local-development.md`, `guides/quick-start.md` | ⚠️ | Yet another parallel hierarchy alongside `01-getting-started/` and `06-deployment/`. | +| `modules/biometric-processor.md`, etc. | ✅ | Per-module summary docs. | +| `plans/` | ✅ | 11 plan docs (BAAS, BYOD, NFC, SMS, Voice). | +| `presentations/` | ✅ | 4 academic-defense slide decks. | +| `project/` | ⚠️ | Three thin meta files; could merge or remove. | +| `testing/` | ⚠️ | Parallel to `05-testing/`. | +| `EMAIL_OTP_SETUP.md`, `STEP_UP_AUTH_GUIDE.md`, `EU_AI_ACT_COMPLIANCE.md`, etc. | ⚠️ | Top-level loose files that should live under `02-architecture/` or `compliance/`. | +| `SECURITY_INCIDENTS.md` | ⚠️ | Incident log — should move to `compliance/` or `security/`. **Not** a substitute for `SECURITY.md` disclosure policy. | +| `adr/` | ❌ | Missing. **Major P1 gap.** | +| `compliance/` | ❌ | Loose: `EU_AI_ACT_COMPLIANCE.md` exists but no GDPR-Art-17 doc, no DPA template, no biometric-data handling memo. | +| `runbooks/` | ❌ | None inside `docs/`. The 8 runbooks live in `/opt/projects/infra/` (private). | +| `glossary.md` | ❌ | Missing. | + +### 3.8 `/opt/projects/infra/` (operator runbooks — **private**, not in any repo) + +| File | Status | Notes | +|---|---|---| +| `RUNBOOK_DR.md` | ✅ | First DR drill executed 2026-04-30 OK. | +| `RUNBOOK_SECRET_ROTATION.md` | ✅ | Used for biometric-API-key rotation 2026-04-30. | +| `RUNBOOK_NETWORK.md` | ✅ | UFW + Hetzner FW + fail2ban. | +| `RUNBOOK_PITR.md` | ⚠️ | Landed but deploy DEFERRED (per recent commit). | +| `RUNBOOK_OFFSITE_RETENTION.md` | ✅ | | +| `RUNBOOK_ROLLBACK.md` | ✅ | | +| `RUNBOOK_AUDIT_LOG_PARTMAN.md` | ✅ | Matches V57. | +| `observability/RUNBOOK_OBSERVABILITY.md` | ✅ | Loki+Promtail+Grafana up. | +| **Public discoverability** | ❌ | These runbooks are not linked from any `docs/` index, not committed to a public repo, and a new contributor would never find them. | +| Incident-response runbook | ❌ | Missing — no documented severity matrix, no on-call escalation. | +| FK-cascade recovery runbook | ❌ | Missing despite memory rule `feedback_no_hard_delete_users.md` existing because of a real incident. | +| Refresh-token reuse-detection runbook | ❌ | V50 family-revoke implemented; no operator doc on what to do when a family is revoked. | + +--- + +## 4. Gap List with Priorities + +Priority key: P0 = user/integrator-blocking · P1 = contributor-blocking · P2 = operator-quality · P3 = polish. +Effort: XS (<30 min) · S (1–2 h) · M (½–1 day) · L (multi-day). + +### 4.1 P0 — Ship within the next sprint + +| # | Doc | Repo(s) | Effort | Notes | +|---|---|---|---|---| +| 1 | `SECURITY.md` | parent + 5 submodules | S each | Vulnerability disclosure policy: where to report, PGP/email, expected response window, supported versions, scope. Without this, a researcher who finds a JWT/WebAuthn flaw has nowhere to send it. **Highest-priority gap.** | +| 2 | `LICENSE` (MIT) | parent + 5 submodules | XS each | README badges advertise MIT but no `LICENSE` file. Legally weak. | +| 3 | `docs/tenant/onboarding.md` | docs | M | Walk a new tenant from credentials → SSO config → first user → custom auth flow → SDK install. Currently scattered across 4 files. | +| 4 | `docs/tenant/sdk-integration.md` | docs | S | One copy-pasteable JS + React example for `@fivucsas/auth-js` `loginRedirect`. Bring DeveloperPortalPage content into static docs. | +| 5 | `landing-website/README.md` | landing-website | XS | Build, run, deploy, design tokens. Currently zero documentation. | + +### 4.2 P1 — Contributor-facing, ship within 2 sprints + +| # | Doc | Repo(s) | Effort | Notes | +|---|---|---|---|---| +| 6 | `CONTRIBUTING.md` | parent + 5 submodules | S each (or M for parent shared template) | PR rules, branch model (master vs. main per [memory `reference_fivucsas_branch_model.md`](memory)), commit-message conventions, where to file issues, how to run tests locally, signing requirements. | +| 7 | `.github/PULL_REQUEST_TEMPLATE.md` | each repo | XS each | Checklist: tests, i18n keys, OpenAPI updated, CHANGELOG entry, security impact noted. | +| 8 | `.github/ISSUE_TEMPLATE/{bug,feature,security}.md` | each repo | XS each | Standard templates. | +| 9 | `.github/CODEOWNERS` | each repo | XS each | Even solo-dev, this future-proofs and forces auto-assign on PRs. | +| 10 | `.github/dependabot.yml` | each repo | XS each | Already merged dependabot PRs in past — formalize cadence per ecosystem (maven, npm, pip, gradle). | +| 11 | `CODE_OF_CONDUCT.md` | parent | XS | Copy Contributor Covenant 2.1. | +| 12 | `docs/adr/` (10–15 backfilled ADRs) | docs | L | Reconstruct from CHANGELOG + memory: hexagonal, pgvector, hosted-first OIDC, MobileFaceNet removal, Facenet512, MTCNN, log-only client embeddings (D2), passive UniFace, refresh-token family reuse-detection, soft-delete via `@SQLDelete`, JWT HS-key rotation registry. Use [Nygard format](https://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions). | +| 13 | `docs/security/threat-model.md` | docs | M | STRIDE on the auth surface (login, MFA step, OAuth2 authorize, WebAuthn registration/assertion, refresh-token rotation). | +| 14 | `docs/security/data-handling.md` | docs | S | Biometric-data lifecycle: capture → quality gate → embedding (server, never raw image) → encryption-at-rest (Fernet, post-PR #65 bio side) → retention → GDPR-Art-17 purge. | +| 15 | `docs/i18n/contributor-guide.md` | docs | S | en.json + tr.json conventions, plural keys, `t()` mandatory. Codifies memory rule. | +| 16 | `docs/style-guides/{java,typescript,python,kotlin}.md` | docs | M total | Section in each per-repo CLAUDE.md exists; promote to public style guides. | +| 17 | Move dated review/audit/session docs out of root | parent | S | See §5 organization. | +| 18 | Repair `docs/` broken links | docs | XS | `IMPLEMENTATION_STATUS_REPORT.md` (07-status), `docs/4-testing/`, `docs/5-security/` paths advertised but missing. | + +### 4.3 P2 — Operator-facing + +| # | Doc | Repo(s) | Effort | Notes | +|---|---|---|---|---| +| 19 | `docs/runbooks/` mirror of `/opt/projects/infra/RUNBOOK_*.md` (sanitized) | docs | M | Public-facing operator runbooks (with secrets redacted). Or at minimum, an index in `docs/06-deployment/` saying "operator runbooks live in private `infra/` repo, ask maintainer." | +| 20 | `docs/runbooks/incident-response.md` | docs | M | Severity matrix (SEV-1/2/3), on-call rota (even solo), escalation, postmortem template. Missing despite multiple recent incidents. | +| 21 | `docs/runbooks/fk-cascade-recovery.md` | docs | S | Per memory `feedback_no_hard_delete_users.md`. The 13-table cascade is documented in CLAUDE memory, not anywhere a sober-3am operator can find. | +| 22 | `docs/runbooks/refresh-token-reuse.md` | docs | S | What V50 family-revoke triggers, how to investigate, how to communicate to user. | +| 23 | `RELEASING.md` | each repo | S each | Cut-release checklist. `client-apps/docs/RELEASE.md` is a model; replicate for api/web/bio. | +| 24 | `CHANGELOG.md` Keep-a-Changelog rewrite | parent + 5 submodules | M | Current format is dated narrative. Move historical entries to `docs/changelog-archive.md`; keep current `CHANGELOG.md` strict to Added/Changed/Deprecated/Removed/Fixed/Security per release. | +| 25 | OpenAPI spec snapshot in repo | api + bio | XS each (CI step) | Nightly export `target/openapi.yaml` → `docs/api/openapi-spring.yaml` (and FastAPI equivalent) so spec changes are reviewable in PRs. | + +### 4.4 P3 — Polish + +| # | Doc | Effort | Notes | +|---|---|---|---| +| 26 | `docs/glossary.md` | S | TC Kimlik, MRZ, FIDO2 vs WebAuthn, AMR claim, soft-delete, family-revoke, embedding, liveness mode passive vs active. | +| 27 | Architecture diagram refresh | M | `docs/02-architecture/ARCHITECTURE_DIAGRAMS.md` predates V50–V57. Add OAuth2 + hosted-login + verify.fivucsas.com surface. | +| 28 | Status page | L | Statuspage.io / instatus.com / cstate. Currently no public uptime visibility. | +| 29 | Public developer portal site | L | Static site (Docusaurus / Nextra / VitePress) wrapping `docs/` for SEO + search. | +| 30 | Cleanup duplicate hierarchies in `docs/` | M | Merge `architecture/` into `02-architecture/`, `testing/` into `05-testing/`, `guides/` into `01-getting-started/`. | + +--- + +## 5. Organization Recommendation (READ-ONLY proposal — no files moved in this PR) + +### 5.1 Problem + +`/opt/projects/fivucsas/` (parent root) currently has **25+ dated review/audit/roadmap/session-status files** at the top level: + +``` +ANALYSIS_2026-05-02_USER_DOMAIN_AND_JWT_ROTATION.md +AUDIT_2026-04-28_BASIC.md +AUDIT_2026-04-28_EDGE.md +AUDIT_2026-04-28_OPS.md +AUDIT_2026-04-28_SECURITY.md +AUDIT_2026-04-29_OPS_FOLLOWUP.md +BACKEND_REVIEW_2026-04-30.md +CICD_AUDIT_2026-05-04.md +CLIENT_APPS_PARITY_PLAN_2026-04-28.md +ENGINEERING_REVIEW_2026-04-30.md +FRONTEND_REVIEW_2026-04-30.md +MULTI_EMAIL_TENANT_DESIGN_2026-04-28.md +PRODUCT_REVIEW_2026-04-30.md +RESEARCH_PROCTORING_AMISPOOF_2026-05-02.md +ROADMAP_2026-04-28.md +ROADMAP_OPTIMIZED_2026-05-02.md +ROADMAP_OPTIMIZED_2026-05-04.md +SENIOR_DB_REVIEW_2026-05-04.md +SENIOR_UIUX_REVIEW_2026-05-04.md +SESSION_STATUS_2026-05-01.md +SESSION_STATUS_2026-05-02.md +SESSION_STATUS_2026-05-04.md +TODO_POST_AUDIT_2026-04-24.md +USER_BUGS_2026-04-30.md +``` + +Plus this `DOC_AUDIT_2026-05-04.md`. Result: a new contributor cloning the repo sees a wall of timestamps, not a clean entry point. + +### 5.2 Proposed layout + +``` +fivucsas/ # parent root — KEEP MINIMAL +├── README.md # stays +├── CHANGELOG.md # stays (rewrite to Keep-a-Changelog) +├── CLAUDE.md # stays (AI onboarding) +├── LICENSE # ADD +├── CONTRIBUTING.md # ADD +├── SECURITY.md # ADD +├── CODE_OF_CONDUCT.md # ADD +├── ROADMAP.md # ADD as canonical, replacing both timestamped roadmaps +├── docker-compose*.yml # stays +├── archive/ # already exists; keep +└── docs/ # submodule + ├── reviews/ + │ ├── 2026-04-28/ + │ │ ├── audit-basic.md # ← AUDIT_2026-04-28_BASIC.md + │ │ ├── audit-edge.md + │ │ ├── audit-ops.md + │ │ ├── audit-security.md + │ │ ├── client-apps-parity-plan.md + │ │ ├── multi-email-tenant-design.md + │ │ └── roadmap.md + │ ├── 2026-04-29/ + │ ├── 2026-04-30/ + │ │ ├── backend-review.md + │ │ ├── engineering-review.md + │ │ ├── frontend-review.md + │ │ ├── product-review.md + │ │ └── user-bugs.md + │ ├── 2026-05-01/ + │ ├── 2026-05-02/ + │ │ ├── analysis-user-domain-jwt-rotation.md + │ │ ├── research-proctoring-amispoof.md + │ │ ├── roadmap.md + │ │ └── session-status.md + │ └── 2026-05-04/ + │ ├── cicd-audit.md + │ ├── doc-audit.md # ← THIS DOC, ideally + │ ├── roadmap.md # ← ROADMAP_OPTIMIZED_2026-05-04.md + │ ├── senior-db-review.md + │ ├── senior-uiux-review.md + │ └── session-status.md + ├── adr/ # NEW — ADRs + ├── compliance/ # NEW + ├── runbooks/ # NEW (or symlinked from infra/) + ├── security/ # NEW + ├── style-guides/ # NEW + ├── tenant/ # NEW + └── ... (existing tree) +``` + +### 5.3 Rules + +- **Root keeps**: `README`, `CONTRIBUTING`, `SECURITY`, `CODE_OF_CONDUCT`, `LICENSE`, `CHANGELOG`, `ROADMAP`, `CLAUDE.md`. These are the only `.md` files allowed at parent root. +- **Dated review docs** → `docs/reviews/YYYY-MM-DD/.md`. The dating goes in the path, not the filename. +- **Session-status docs** → same: `docs/reviews/YYYY-MM-DD/session-status.md`. +- **Roadmaps** → exactly one canonical `ROADMAP.md` at parent root, history in `docs/reviews/.../roadmap.md`. +- **Audits** → `docs/reviews/YYYY-MM-DD/audit-.md` OR `docs/audits/` if the user prefers a flat audit catalog. +- **`docs/` index page** (`docs/README.md`) gets a new top-level **"Reviews & Audit Trail"** section linking to `reviews/` chronologically. + +### 5.4 NOT done in this PR + +This is a **recommendation only**. Moving 25+ files via `git mv` rewrites history for every dated doc and triggers submodule pointer updates. Worth doing — but as its own approval-gated PR with a clear commit boundary (`docs(reorg): move dated review docs to docs/reviews/`). **Not** in this audit PR. + +--- + +## 6. Quick-Win Followups (≤30 min each) + +1. **Add `LICENSE` (MIT)** to all 6 repos. `cp LICENSE` × 6, one commit each. Closes the README-badge-vs-reality gap. +2. **Add `SECURITY.md`** to parent + identity-core-api + biometric-processor. Three-line email-only disclosure policy is fine for v1. +3. **Add `.github/PULL_REQUEST_TEMPLATE.md`** to all 6 repos. Single shared template with checklist. +4. **Fix broken `docs/` links** — `docs/4-testing/` and `docs/5-security/` referenced in `biometric-processor/README.md` but absent. Either create stubs or remove links. +5. **Reconcile `docs/architecture/` (lowercase) and `docs/02-architecture/`** — pick one, redirect the other. +6. **Move `SECURITY_INCIDENTS.md`** out of `docs/` root into `docs/security/incidents.md`. Keep filename misleading-free (it's an incident **log**, not a security policy). +7. **Add `.github/CODEOWNERS`** with `* @ahmetabdullah` to each repo. Future-proofs review routing. +8. **Add `.github/dependabot.yml`** with weekly schedule per ecosystem to each repo. (Some already had Dependabot; formalize.) +9. **Designate single canonical `ROADMAP.md`** at parent root; archive `ROADMAP_2026-04-28.md` and `ROADMAP_OPTIMIZED_2026-05-02.md`. +10. **Create `docs/adr/README.md`** with ADR template + index table — even with zero ADRs filed yet, it sets the structure. + +## 7. Multi-Day Initiatives + +1. **Backfill 10–15 ADRs** from CHANGELOG + memory + reviews. ~½ day per ADR if rigorous. Suggested order: hexagonal architecture → pgvector → hosted-first OIDC → MobileFaceNet removal → Facenet512 → log-only client embeddings (D2) → refresh-token reuse-detection (RFC 6749 §10.4) → soft-delete via `@SQLDelete` → JWT HS-key registry → passive UniFace liveness → MTCNN over centerface → Fernet embedding encryption. +2. **Tenant onboarding playbook** (`docs/tenant/`) — onboarding.md + auth-flow-builder.md + sdk-integration.md + custom-domain.md. Content largely exists scattered; consolidate. ~2–3 days. +3. **Threat model + STRIDE doc** for the auth surface. ~1 day, plus 1 day to socialize/iterate. +4. **CHANGELOG Keep-a-Changelog rewrite** across all repos, including parent. Move narrative to `docs/changelog-archive.md`; keep canonical CHANGELOG strict. ~1 day per repo. +5. **Public developer portal** — Docusaurus/Nextra/VitePress site at `docs.fivucsas.com` consuming the `docs/` submodule. Includes search, dark mode, OpenAPI rendering. ~1 week one-shot, then ongoing. + +--- + +## 8. Sources + +- [GitHub — Best practices for repositories](https://docs.github.com/en/repositories/creating-and-managing-repositories/best-practices-for-repositories) +- [GitHub special files (README, LICENSE, CONTRIBUTING, CODE_OF_CONDUCT)](https://gist.github.com/jakebrinkmann/c63eaedbe384516e4a7bc133c1e1066b) +- [adr.github.io — Architectural Decision Records](https://adr.github.io/) +- [joelparkerhenderson/architecture-decision-record (templates)](https://github.com/joelparkerhenderson/architecture-decision-record) +- [AWS Prescriptive Guidance — ADR process](https://docs.aws.amazon.com/prescriptive-guidance/latest/architectural-decision-records/adr-process.html) +- [Microsoft Azure Well-Architected Framework — ADRs](https://learn.microsoft.com/en-us/azure/well-architected/architect-role/architecture-decision-record) +- [Google Cloud — ADR overview](https://docs.cloud.google.com/architecture/architecture-decision-records) +- [arc42 §9 — Architecture decisions](https://docs.arc42.org/section-9/) +- [Cognitect — Documenting Architecture Decisions (Nygard, 2011)](https://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions) +- [WorkOS — Developer's guide to multi-tenant SaaS architecture](https://workos.com/blog/developers-guide-saas-multi-tenant-architecture) +- [10up — Open Source Best Practices](https://10up.github.io/Open-Source-Best-Practices/community/) +- [CTOSecurityChecklist (sqreen)](https://github.com/sqreen/CTOSecurityChecklist) +- [Keep a Changelog](https://keepachangelog.com/) + +--- + +*Generated 2026-05-04 by T-DOC-AUDIT. Read-only audit; no code or doc-content changes shipped in this PR. All claims re-verified against current `HEAD` per memory rule `feedback_audit_quality.md`.* diff --git a/INVESTIGATION_DEV_CONSTRAINTS_2026-05-07.md b/INVESTIGATION_DEV_CONSTRAINTS_2026-05-07.md new file mode 100644 index 0000000..454503c --- /dev/null +++ b/INVESTIGATION_DEV_CONSTRAINTS_2026-05-07.md @@ -0,0 +1,96 @@ +# Developer / Tenant / Integrator Constraint Audit +**Date:** 2026-05-07 · **Scope:** read-only, HEAD of every submodule under `/opt/projects/fivucsas/`. +**Method:** verified each row by reading source on disk; no doc claims trusted (per `feedback_verify_completion_claims.md`). + +--- + +## 1. Inventory & Findings + +| # | Constraint | Defined | Server-enforced | Surfaced (API/UI) | Reasonable | Severity | Citation | +|---|---|---|---|---|---|---|---| +| 1 | OAuth client RPM cap | NO global / per-client RPM bucket. Only **PKCE-failure** bucket: 30 fails / 5 min / `clientId`. | Partial — failures only; success path unbounded. | 429 + `Retry-After` in failure path. | NO — a client minting 10k tokens/min cannot be capped. | **P1** | `RateLimitService.java:218`-`230`, `:370`-`377` | +| 2 | OAuth scopes — declared vs enforced on `/userinfo` | Stored in `allowed_scopes` (space-sep). | `/userinfo` returns **all** profile claims regardless of token's `scope` claim. | Discrepancy invisible to integrator. | NO — declared scope filtering on ID-token issuance only (`OAuth2Service.java:372`-`384`); `/userinfo` ignores scope. | **P1** | `OAuth2Service.java:445`-`474` | +| 3 | Redirect-URI allowlist | JSON-array in `oauth2_clients.redirect_uris`; exact string + RFC 8252 §7.3 loopback handling. | YES — `OAuth2Client.isRedirectUriAllowed` + `OAuth2Service.validateClient`. | 400 `invalid_request`. | YES — query-smuggling guard, IPv4-loopback only. | OK | `OAuth2Client.java:111`-`166`, `OAuth2Service.java:80`-`89` | +| 4 | Allowed origins per client | NOT per client — single global list `app.cors.allowed-origins`. | YES at CORS filter, NO per-client. | None. | NO — every registered client implicitly trusts every CORS origin. | **P2** | `SecurityConfig.java:48`-`49`, `:248` | +| 5 | `client_secret` strength + rotation | Generated via `SecureRandom`, 64 hex chars (256-bit). **No rotation endpoint.** Plaintext shown ONCE on create. | Strength YES; rotation NOT possible. | New client only. | NO — operators must DELETE+CREATE to rotate, breaking downstream config. | **P1** | `OAuth2ClientController.java:85`, `:139`-`155`, `:190`-`198` | +| 6 | Confidential vs public clients | Boolean `confidential` on entity, default true; V38 sets `dashboard` confidential=false. | YES — confidential clients require `client_secret`; public clients require PKCE-S256. | RFC 6749 §5.2 errors. | YES — strong, post-2026-05-02 hardening. | OK | `OAuth2Client.java:74`-`77`, `OAuth2Service.java:307`-`341`, `OAuth2Controller.java:332`-`342` | +| 7 | Token TTL per client | NOT per client. Single `JwtService.getExpirationMillis()`. | Global only. | None. | Acceptable for v1 but a gap vs Auth0/Okta. | P3 | `OAuth2Service.java:354` | +| 8 | Per-tenant allowed auth methods | `tenant_auth_methods` table (`is_enabled` + JSONB `config`). | Read-side only — entity exists but I found no enforcement gate at `/auth/login` that filters disabled methods. | Admin UI lists, but server lets disabled methods proceed in flows. | NO — partial. | **P1** | `TenantAuthMethod.java:33`-`56` | +| 9 | Per-tenant rate limits / quotas | NO per-tenant rate limit. RateLimitService keys = IP / userId / clientId / email. | NO. | None. | NO — noisy-neighbour exposure. | **P1** | `RateLimitService.java:46`-`62` | +| 10 | Max users per tenant | Column `tenants.max_users` default **100**. | **NOT enforced** — `RegisterUserService` and `ManageUserService` never read `tenant.maxUsers`. Only used for read-side reporting. | Field returned in tenant dashboard but ignored on insert. | NO — false sense of cap. | **P0** | `Tenant.java:86`-`88`; `RegisterUserService.java` (no maxUsers ref); `ManageUserService.java:202` (count, not enforce) | +| 11 | Max OAuth clients per tenant | NO cap. | No. | None. | NO — runaway spam vector for self-service tenants. | **P2** | `OAuth2ClientController.java:72`-`116` | +| 12 | Tenant suspension / disable | `TenantStatus` {ACTIVE, INACTIVE, SUSPENDED, TRIAL, PENDING}; `Tenant.suspend()`/`deactivate()`. | **NOT checked at JWT issuance.** `AuthenticateUserService` has no `tenant.isActive()` gate; `Tenant.canAcceptUsers()` exists but is unused outside reporting. | Admin UI sets status; runtime ignores it. | NO — suspended tenants keep authenticating. | **P0** | `Tenant.java:178`-`187`, `:249`-`251`; verified zero call-sites for `isSuspended()`/`canAcceptUsers()` outside DTO mapping. | +| 13 | Tenant deletion (GDPR) | Hibernate `@SQLDelete` + `@SQLRestriction`; V49 schema; `softDeleteTenant(UUID)`. | YES — hard delete intercepted to UPDATE. | `ManageTenantService.softDeleteTenant`. | YES. | OK | `Tenant.java:41`-`42`, `ManageTenantService.java:175`-`200` | +| 14 | Tenant cross-isolation | Application-layer: `TenantBindFromAuthFilter` overwrites a forged `X-Tenant-ID`. Postgres RLS NOT yet hardened (JPA still runs as superuser per file comment). | Partial — filter blocks header forgery. RLS bypass remains. | None visible to dev. | Application path good; DB path is the operator residual. | **P1** | `TenantBindFromAuthFilter.java:81`-`132`; comment `:54`-`58` flags Task #27 unfinished. | +| 15 | Biometric-processor `X-API-Key` | `API_KEY_SECRET` env, no length validation. Production `get_api_key_config` raises if `API_KEY_ENABLED=false`. Single shared secret across all tenants. | YES — `hmac.compare_digest` per request. | 401 + `WWW-Authenticate: ApiKey`. | NO — single secret, no per-tenant key, no rotation tooling. | **P1** | `biometric-processor/app/main.py:182`-`204`, `app/core/config.py:472`-`528` | +| 16 | Per-API-key rate limit (bio) | Tier table `{free, standard, premium, unlimited}` based on `api_key_context`. **A validated API key bypasses rate limiting entirely.** | Mostly NO — `dispatch:88-90` short-circuits. | None when bypassed. | NO — defeats DoS protection for the only authenticated caller (identity-core-api). | **P2** | `biometric-processor/app/api/middleware/rate_limit.py:87`-`90`, `:63`-`68` | +| 17 | Webhook signing secret | **No webhooks implemented.** Zero matches for webhook anywhere in identity-core-api java tree. | N/A | N/A | Gap — common SaaS feature missing. | P3 | `grep -rln webhook` returns no source files. | +| 18 | Admin RBAC (SUPER_ADMIN vs TENANT_ADMIN) | Hierarchical: ROOT > TENANT_ADMIN > TENANT_MEMBER > GUEST. | YES — `RbacAuthorizationService.hasPermission` + `canAccessTenant`. | 403 JSON envelope. | YES — clean. | OK | `RbacAuthorizationService.java:43`-`107` | +| 19 | Admin-only routes guarded | Mostly via `@PreAuthorize("@rbac…")` SpEL. **OAuth2ClientController uses only `isAuthenticated()`** — no role check. | Partial — any authenticated user (incl. GUEST) of a tenant can register/delete OAuth2 clients for **their** tenant. | Cross-tenant blocked by tenant filter; intra-tenant role check missing. | NO — guests/members can mint clients. | **P1** | `OAuth2ClientController.java:53`,`:73`,`:122`,`:140`,`:161` | +| 20 | Audit log retention | V57 hands `audit_logs` to pg_partman (fail-soft if extension absent). No tenant-specific retention policy. | Partition lifecycle only; no per-tenant TTL. | Operator runbook only. | OK platform-wide; gap for per-tenant SLA. | P3 | `db/migration/V57__audit_logs_pg_partman.sql:1`-`27` | +| 21 | GDPR export shape & SLA | `UserDataExportController` returns JSON bundle, `Content-Disposition: attachment`. Rate-limited 1/h/caller. SLA implicit (synchronous). | YES. | `200` + JSON. | OK for individual; no tenant-bulk export. | OK | `UserDataExportController.java:49`-`94`, `RateLimitService.java:354`-`360` | +| 22 | GDPR data-erasure pipeline | `PurgeAdminController` gated on `@rbac.isSuperAdmin()`. `Tenant`+`User` soft-delete via `@SQLDelete`. | YES at admin endpoint; user soft-delete has FK-cascade safeguards (V53 trigger). | 403 for non-ROOT. | OK — tenant admins cannot trigger purge directly (P1 gap if KVKK ops on their own users). | OK | `PurgeAdminController.java:37`, `Tenant.java:41`-`42`, V53 trigger comment in `identity-core-api/CLAUDE.md`. | +| 23 | WebAuthn allowed origins | `app.webauthn.allowed-origins` env, default-empty in `application.yml` warns "every assertion will be rejected". Prod default: `https://app/verify/demo.fivucsas.com`. | YES — `WebAuthnService` rejects unlisted origins. | 401-style WebAuthn errors. | YES — fail-closed. | OK | `WebAuthnService.java:42`-`47`, `application-prod.yml:69` | +| 24 | Embeddable widget — `client_id` | `verify-app` reads `client_id` from URL query (hosted-login). Iframe widget (`verify-widget/html/`) is anonymous-permitted on the SDK lifecycle but the **server still demands client_id at `/oauth2/authorize`** — verified `OAuth2Controller.java:80` `@RequestParam("client_id")` (required). | YES. | 400 if missing. | YES. | OK | `OAuth2Controller.java:79`-`97` | +| 25 | SDK CSP / iframe sandboxing | `postMessageBridge.ts:48`-`80`: `parentOrigin` stays null until config handshake; outbound dropped (NOT `'*'`). Inbound origin check exists. | YES — handshake-gated. | DEV warning only. | YES — strong. | OK | `postMessageBridge.ts:39`-`80` | + +### Anonymous-endpoint compare (per `feedback_pr_review_workflow.md`) + +`SecurityConfig.java` permitAll list (lines 75-149) cross-checked against controllers: +- `/api/v1/oauth2/authorize`, `/oauth2/authorize/complete`, `/oauth2/token`, `/.well-known/*`, `/oauth2/clients/*/public` — intentional, all RFC-mandated. +- `POST /auth/sessions/*/steps/*` is permitAll (line 118). Controller relies on session-token validity; no JWT required. Verified intentional (multi-step pre-JWT). +- `POST /api/v1/auth/mfa/step` permitAll (line 85) BUT bucketed at 30/min/IP via `allowMfaStepAttempt` — confirmed at `RateLimitService.java:181`-`191`. +- **No unintentional permitAll found** in current `SecurityConfig`. + +### Cross-tenant boundary check +SUPER_ADMIN of tenant A → tenant B: `RbacAuthorizationService.canAccessTenant(UUID)` lines 97-107 — `ROOT` (== platform super-admin) bypasses tenant equality check; `TENANT_ADMIN` is constrained to `currentUser.getTenant().getId().equals(tenantId)`. So **TENANT_ADMIN cannot read another tenant**; only `ROOT` (platform owner) can. Naming in the prompt was ambiguous — codebase distinguishes `ROOT` (platform) from `TENANT_ADMIN` (one tenant). Boundary is correct. + +`TenantBindFromAuthFilter.java:114`-`131` enforces JWT tenantId override for non-SUPER_ADMIN users. Verified. + +--- + +## 2. P0 / P1 / P2 / P3 + +### P0 — Production-impacting today +1. **`tenants.max_users` is decorative.** Field exists, dashboard reads it, no insert path enforces it. A self-service tenant can add unlimited users. (`Tenant.java:86`-`88` + absence at `RegisterUserService`.) +2. **Tenant suspension does not stop authentication.** `TenantStatus.SUSPENDED` is settable in admin UI but no auth-time check. A suspended tenant's users keep logging in and minting JWTs. (`AuthenticateUserService` has no `tenant.isActive()` call; `Tenant.canAcceptUsers()` zero non-DTO callers.) + +### P1 — Fix this milestone +3. **No global RPM rate limit on `/oauth2/token`** success path. Only PKCE-failure throttled. (`RateLimitService.java:218`-`230`.) +4. **`/userinfo` ignores token's scope claim.** Returns email/name/phone irrespective of `scope`. RFC 6749 violation. (`OAuth2Service.java:445`-`474` — no `scope` filtering.) +5. **No client_secret rotation** endpoint on `OAuth2ClientController`. Only delete+recreate. (Lines 50-186 — no `/rotate-secret`.) +6. **`OAuth2ClientController` lacks role check.** `@PreAuthorize("isAuthenticated()")` only. A `TENANT_MEMBER` (or `GUEST` with non-expired token) can register OAuth clients in their tenant. Should require `@rbac.isTenantAdmin()`. (Lines 53, 73, 122, 140, 161.) +7. **Tenant cross-isolation depends on app-layer filter alone.** Postgres RLS not yet hardened — `TenantBindFromAuthFilter.java:54`-`58` flags Task #27 unfinished; if filter is bypassed (e.g. raw SQL endpoint, native queries) tenant rows leak. +8. **Per-tenant auth-method allowlist (`tenant_auth_methods.is_enabled`) lacks runtime enforcement gate.** Disabled methods can still be selected at session start. +9. **Single biometric-processor API key shared across all tenants** with no rotation tooling and no per-tenant identity. Loss = full bypass for every tenant. (`biometric-processor/app/main.py:182`-`204`.) + +### P2 — Schedule +10. **CORS allowed-origins is global, not per OAuth2 client.** (`SecurityConfig.java:48`-`49`.) +11. **No max OAuth-clients-per-tenant cap.** +12. **Validated API-key calls bypass rate limiting in biometric-processor** (`rate_limit.py:87`-`90`). Means identity-core-api's caller has no upper bound; a runaway loop can pin the GPU/CPU container. + +### P3 — Track +13. No per-client token TTL override. +14. No webhooks subsystem. +15. No per-tenant audit-log retention policy (platform-wide pg_partman only). + +--- + +## 3. Recommendations + +1. **Wire `tenants.max_users` into `RegisterUserService` and `ManageUserService.createUser`** before the next demo. Throw a dedicated `TenantUserCapException` → 409 with `Retry-After: never` semantic. Keep `ROOT` exempt. +2. **Gate JWT issuance on `tenant.canAcceptUsers()`.** Add a single line to `AuthenticateUserService.authenticate` that 403s with `tenant_suspended` if status ∈ {SUSPENDED, INACTIVE, PENDING}. Same gate at OAuth2 `/authorize`. +3. **Add `OAuth2ClientController` role check.** Replace `@PreAuthorize("isAuthenticated()")` with `@PreAuthorize("@rbac.isTenantAdmin()")` on POST/DELETE/PATCH; keep GETs at `isAuthenticated()` for self-introspection. +4. **Add `/oauth2/clients/{id}/rotate-secret`** that mints a new 256-bit secret and returns it once; old secret invalidated on commit. Document grace window (no overlap by default). +5. **Filter `/userinfo` by access-token scope.** `scope.contains("email")` → include email; `"profile"` → name set; `"phone"` → phone. Mirror `OAuth2Service.java:372`-`384` pattern. +6. **Add a global `clientId`-keyed token bucket** (e.g. 600 req/min default) to `/oauth2/token` success path. Per-tenant override stored in `Tenant.rateLimitTier`. +7. **Move biometric-processor to per-tenant API keys** stored in DB, hashed at rest, rotated independently. Drop the single `API_KEY_SECRET` env to legacy fallback only. +8. **Lift `is_enabled=false` enforcement** into `AuthSessionController.startSession` — strip disabled methods from the allowed-step list before persisting. +9. **Finish Task #27 (RLS hardening)** — switch JPA datasource to a non-superuser role and enable `FORCE ROW LEVEL SECURITY` on tenant-scoped tables. The TenantBindFromAuthFilter is the application half; RLS is the DB half. +10. **Cap `oauth2_clients` per tenant** (default 25) and surface the limit on the Developer Portal UI. +11. **Stop bypassing rate limits for validated API keys** in biometric-processor; instead apply the `unlimited` tier (999 999) which is effectively the same but keeps headers + observability. +12. **Document developer-facing constraints in one place** — currently spread across `OAuth2Client` Javadoc, `Tenant` entity, `RateLimitService`. A single `DEV_LIMITS.md` table the Developer Portal can render would close the discoverability gap. + +--- + +**Word count:** ~1,690. diff --git a/INVESTIGATION_FAILOPEN_2026-05-07.md b/INVESTIGATION_FAILOPEN_2026-05-07.md new file mode 100644 index 0000000..f257757 --- /dev/null +++ b/INVESTIGATION_FAILOPEN_2026-05-07.md @@ -0,0 +1,319 @@ +# Silent-Success / Fail-Open Audit — 2026-05-07 + +Repo HEAD verified at investigation time: parent `5096e8d`, identity-core-api `5096e8d`, +biometric-processor `d91760a`, web-app `d8e18b8`. All findings cite live `main`/`master` files; +the four 2026-04-30 review docs are taken as historical, not current truth. + +## Methodology + +1. Listed every Java auth handler under `application/service/handler/` and + `application/service/mfa/handler/`, read full source, traced every `catch`, + every `return StepResult.success(...)`, and every `yield` in switch + expressions. Confirmed each terminal branch maps directly to a + true/false verification outcome. +2. Enumerated `catch (Exception` (191 hits in main java) and grepped for + adjacency to `ResponseEntity.ok` / `success(` / `return true` / + `Optional.empty`. Read 20-line context windows on each suspicious match. +3. In biometric-processor: grepped `except` + `pass` / `continue` / + `return JSONResponse`, plus `return_exceptions=True`. Read source for each. +4. Crosschecked the four most-recent 2026-04-30 review docs against current + HEAD (per `feedback_verify_completion_claims.md`) — the documented + "MFA fail-OPEN exception swallow" has been remediated: every + `*VerifyMfaStepHandler` now catches and returns `MfaStepResult.fail()` + (`WebAuthnVerifySupport.java:113-116`). +5. Frontend: read `core/api/AxiosClient.ts` interceptors and + `core/api/errorMapper.ts` end-to-end; greppped TS for empty `catch {}`. + +Severity floor: only patterns where a non-test caller can reach the branch +are reported. Worktrees under `.claude/worktrees/` were excluded — they are +detached snapshots, not deployable. + +## P0 — security boundary fail-open + +### F1 — `/2fa/verify-method` accepts FINGERPRINT/HARDWARE_KEY with no signature check + +`identity-core-api/src/main/java/com/fivucsas/identity/controller/AuthController.java:526-532` +```java +case FINGERPRINT, HARDWARE_KEY -> { + String assertion = (String) data.get("assertion"); + yield assertion != null && !assertion.isBlank(); + // WebAuthn verification would be done client-side via navigator.credentials.get() + // The fact that we received a valid assertion means the browser verified it +} +``` +Any non-empty string in `data.assertion` is treated as a passing 2FA factor. +There is no `webAuthnService.verifyAssertion(...)` call, no credentialId +lookup, no public-key check, no sign-counter validation. Every other case +in the same switch (TOTP, SMS_OTP, FACE, VOICE, QR_CODE, EMAIL_OTP) +performs server-side validation; FINGERPRINT/HARDWARE_KEY does not. + +The endpoint is post-login (requires JWT), so this is not anonymous account +takeover, but it bypasses the 2nd factor entirely for any authenticated user +whose flow lands on `/2fa/verify-method` (admin step-up, sensitive action +re-auth). The two surrounding branches log to `auditLogPort.logTwoFactorVerified`, +so the audit trail will *also* lie ("2FA verified by FINGERPRINT"). + +Note: the parallel newer N-step pipeline (`VerifyMfaStepService` → +`*VerifyMfaStepHandler` + `WebAuthnVerifySupport.java`) does the right thing. +This is residual code on the legacy `/2fa/verify-method` route. + +Suggested fix: dispatch to the same `WebAuthnVerifySupport.verifyAssertion` +the N-step path uses; remove the misleading comment. + +### F2 — verify_puzzle spot-check converts decode/detector errors into "passed" + +`biometric-processor/app/application/use_cases/verify_puzzle.py:171-196` +```python +for i, frame_b64 in enumerate(spot_frames[:3]): + try: + frame_bytes = base64.b64decode(frame_b64) + ... + if frame is None: + logger.warning(...); continue + result = await self._spot_check_detector.check_liveness(frame) + if not result.is_live: + failed_count += 1 + except Exception as e: + logger.warning(f"Spot-check frame {i} error: {e}") + continue + +if failed_count >= 2: + return False, "SPOT_CHECK_FAILED" +return True, "" +``` +`failed_count` only increments on a *successful* liveness call returning +`is_live=False`. Frames that fail to decode (`continue`) or that throw inside +the detector (`except: continue`) are silently skipped, so a client that +submits 3 corrupt JPEGs receives `failed_count == 0` and the spot-check +returns `True`. Defeats the anti-replay spot-check entirely. + +Suggested fix: treat `continue` paths as failures +(`failed_count += 1`) or require N successful evaluations rather than +counting failures. + +### F3 — FaceAuthHandler confidence fallback can flip a `verified=false` to true + +`identity-core-api/src/main/java/com/fivucsas/identity/application/service/handler/FaceAuthHandler.java:65-75` +```java +Object verified = result.get("verified"); +boolean isVerified = Boolean.TRUE.equals(verified) + || "true".equalsIgnoreCase(String.valueOf(verified)); + +if (!isVerified) { + Object confidence = result.get("confidence"); + if (confidence instanceof Number num) { + isVerified = num.doubleValue() >= DEFAULT_CONFIDENCE_THRESHOLD; + } +} +``` +The biometric processor's contract (`api/schemas/verification.py:15-16`) +returns `verified` as the authoritative server-side decision (already +threshold-applied with adaptive aging). The client overrides that bit on +the `confidence` field with a *fixed* threshold of 0.7. Two failure modes: + +- The processor's adaptive threshold is *more lenient* than 0.7 for aged + embeddings (`VERIFICATION_THRESHOLD_AGED_*`) — this fallback IGNORES the + adaptive logic. Currently no impact (server returns `verified=true` first). +- For Facenet512 + cosine, similarity ≥ 0.7 with a non-matching face is + not impossible (genuine impostor pairs sit around 0.4–0.6, but quality + degradation pushes the tail). If the processor ever changes shape and + starts returning `verified=false` while emitting `confidence` as cosine + similarity (1 − distance), this branch silently overrides the rejection. + +Same pattern duplicated in `AuthController.java:509-518` for the legacy +`/2fa/verify-method` FACE branch. + +Severity P0 because the failure mode is one upstream contract change away, +and the fallback masks the rejection without logging. + +Suggested fix: delete the fallback; trust the processor's `verified` field +(or alternatively log a P0 alert when the fallback fires). + +## P1 — data-integrity fail-open + +### F4 — Audit log writes silently swallow exceptions + +`identity-core-api/src/main/java/com/fivucsas/identity/infrastructure/audit/AuditEventPublisher.java:65-84` +```java +@Async +public void publish(AuditLog auditLog, UUID tenantId) { + ... + try { + ... + auditLogRepository.save(auditLog); + } catch (Exception e) { + log.error("Failed to save audit log: {}", e.getMessage(), e); + } finally { ... } +} +``` +By design fire-and-forget, but documented compliance requires every +auth/MFA event to land in `audit_logs`. RLS rejection (TenantContext +mismatch), partition-not-yet-created (V57 pg_partman), or a constraint +violation are all swallowed. There is no metric counter, no alert hook, +no dead-letter queue, no degraded-mode flag. + +Suggested fix: Micrometer counter `audit_log_drops_total` + Grafana alert, +and a fallback append to the local `backups/` log so the row is at least +forensics-recoverable. + +### F5 — EnrollmentHealthService fails open on biometric service down + +`identity-core-api/src/main/java/com/fivucsas/identity/application/service/EnrollmentHealthService.java:193-208` +```java +private boolean hasBiometricData(UUID userId, AuthMethodType biometricType) { + try { + Map health = biometricServicePort.checkHealth(); + ... + return true; // happy path + } catch (Exception e) { + log.warn(...); + return true; // fail open: don't revoke + } +} +``` +Comment is honest ("Fail open: don't revoke when service is unreachable"), +but the impact is that an enrolled user whose biometric embedding has been +purged from `biometric_db` (FK-cascade incident, deletion request, manual +ops) still passes `hasBiometricData()` and the API will offer FACE/VOICE +as a working method on the login page. The actual `/verify` call later +returns 404, so this is UX rot rather than auth bypass — flagged P1 +because it can mislead the dashboard's "enrolled methods" tile. + +Suggested fix: probe the actual count endpoint +(biometric-processor exposes `/api/v1/face/enrollments/{userId}` etc.) +rather than just `checkHealth`. + +### F6 — RedisCacheAdapter swallows exceptions and returns empty Optional + +`identity-core-api/src/main/java/com/fivucsas/identity/infrastructure/adapter/RedisCacheAdapter.java:48-55` +```java +} catch (Exception e) { + log.error("Failed to retrieve cache for key: {}", key, e); + return Optional.empty(); +} +``` +Compare with the *fail-closed* pattern explicitly added in +`JwtAuthenticationFilter.java:64-74` (uses `cachePort.existsFailClosed`, +catches `CacheUnavailableException`, clears SecurityContext). The general +adapter still has the silent-empty path — anyone calling it for a +"is this token blacklisted?" check on a different code path inherits a +fail-open behaviour. Today only the JWT filter uses the fail-closed +helper; rotation/idempotency use the silent-empty path. Severity P1 +defense-in-depth. + +Suggested fix: deprecate the silent-empty `get`; require callers to +opt into either the fail-closed or fail-open variant explicitly. + +## P2 — UX fail-open / cosmetic + +### F7 — `/2fa/verify` and `/2fa/verify-method` return HTTP 200 with `success:false` + +`identity-core-api/src/main/java/com/fivucsas/identity/controller/AuthController.java:350,400,445,557,565` +```java +return ResponseEntity.ok(Map.of("success", false, "message", "...")); +``` +Five rejection paths return status 200. Frontend reads `success` so the +auth check is enforced — but this confuses any non-frontend consumer +(curl-based test, OAuth library, monitoring) into recording a 200 on a +failed 2FA attempt. Severity P2: not a security boundary fail-open; rather +a contract violation that suppresses 4xx alerts in observability. + +Suggested fix: switch to 401 or 400 (RFC 6749 §5.2-style error body). +The N-step `/auth/mfa/step` already does this correctly via `MfaStatus`. + +### F8 — AxiosClient request interceptor swallows proactive-refresh failure + +`web-app/src/core/api/AxiosClient.ts:142-148` +```ts +try { + await this.refreshTokenProactively() + accessToken = await this.tokenService.getAccessToken() +} catch { + // Fall back to existing token if proactive refresh fails +} +``` +Empty `catch` is intentional (fall back to the still-valid token). But it +also masks repeated network failures, audit-log loss, and persistent +401-loop scenarios. There is no counter, no UI breadcrumb, no Sentry +hook. Severity P2 cosmetic. + +Suggested fix: emit a `logger.warn` and a Sentry breadcrumb so that +"refresh keeps failing but user keeps using stale token" becomes visible +in observability. + +### F9 — `_to_json_safe` swallows numpy `.item()` exceptions + +`biometric-processor/app/application/use_cases/check_liveness.py:36-43` +```python +item = getattr(value, "item", None) +if callable(item): + try: + return _to_json_safe(item()) + except Exception: + pass +return str(value) +``` +Falls through to `str(value)` on any failure — benign, but the value is +returned to the API client where downstream parsing may break silently +if a numpy scalar serializes as `""`. Severity P3 +cosmetic. + +## P3 — defense-in-depth nits + +### F10 — In-memory rate limit not coordinated across replicas +`identity-core-api/src/main/java/com/fivucsas/identity/security/RateLimitService.java:46-56` +Bucket4j with `ConcurrentHashMap`. Per-replica only. Today the API runs +as a single replica, so this is moot, but a horizontal scale-out +(2026-04-28 ROADMAP mentions BE-M5 multi-instance) silently halves the +effective rate per replica. Not fail-open but worth a Redis-backed +upgrade. + +### F11 — `general_exception_handler` returns 500 with generic message +`biometric-processor/app/api/middleware/error_handler.py:108-132` Correct +behaviour (rejects the request), but the catch-all may hide downstream +contract changes from the API consumer. Add a metric counter so a regression +of `EmbeddingNotFoundError` mishandling becomes visible. + +### F12 — `parallel_frame_analyzer._verify_face` returns `FaceVerificationResult()` on detector exception +`biometric-processor/app/application/use_cases/proctor/parallel_frame_analyzer.py:402-406, 425-427` +Defaults are `detected=False, matched=False` — fail-closed. Verified safe. +Listed here only because the pattern is repeated across 5+ proctor +methods and would be P0 if any default flipped to `True`. Add a unit +test pinning the defaults. + +### F13 — JWT filter `catch (Exception)` clears context but does not 401 +`identity-core-api/src/main/java/com/fivucsas/identity/security/JwtAuthenticationFilter.java:93-97` +On any unexpected exception the request continues unauthenticated, +relying on downstream `@PreAuthorize` to 403. Correct, but a single +malformed JWT blasting 1000 RPS of NPEs would produce 1000 ERROR log +lines. Severity P3. + +### F14 — `parseInputData` swallows JSON parse error +`identity-core-api/src/main/java/com/fivucsas/identity/application/service/ManageVerificationService.java:494-506` +Falls back to `{"raw": resultData}`. Verification step handlers see the +unparsed string and reject it cleanly downstream. Verified safe but +ideally raise a `400 Bad Request` upstream so the client knows. + +## Recommendation list (ordered by leverage) + +1. **F1 (P0)** Remove the legacy FINGERPRINT/HARDWARE_KEY shortcut in + `AuthController.verify2FAMethod`; route to `WebAuthnVerifySupport.verifyAssertion`. + 1 hour. **HIGHEST PRIORITY** — this is a complete 2FA bypass for a + logged-in user. +2. **F2 (P0)** `verify_puzzle.py` spot-check: count `continue` paths as + failure. 30 min. +3. **F3 (P0)** Delete the FaceAuthHandler + AuthController confidence + fallback. Trust the server-side `verified` field. 30 min. +4. **F4 (P1)** Add `audit_log_drops_total` counter + Grafana alert; cap + risk that V57 pg_partman edge cases silently drop audit rows. +5. **F7 (P2)** Switch the five `/2fa/verify*` 200-on-failure responses to + real 4xx codes. 1 hour incl. frontend update. +6. **F5 (P1)** Probe biometric-processor count endpoint instead of just + `/health` in `EnrollmentHealthService.hasBiometricData`. 1 hour. +7. **F6 (P1)** Deprecate silent-empty `RedisCacheAdapter.get`; require + explicit fail-mode opt-in. 2 hours including caller migration. +8. **F8 (P2)** AxiosClient: log + Sentry breadcrumb on proactive-refresh + failure. 15 min. +9. **F10–F14 (P3)** Defense-in-depth nits — schedule next sprint. + +Word count: ~1,250. diff --git a/INVESTIGATION_MASTER_2026-05-07.md b/INVESTIGATION_MASTER_2026-05-07.md new file mode 100644 index 0000000..3f4827e --- /dev/null +++ b/INVESTIGATION_MASTER_2026-05-07.md @@ -0,0 +1,128 @@ +# Master Investigation Synthesis — 2026-05-07 + +Six parallel read-only audits across `/opt/projects/fivucsas/` (parent + 5 submodules). No code changed. Source docs: + +- `INVESTIGATION_MOCKS_2026-05-07.md` — mock/fake/placeholder hunter +- `INVESTIGATION_USER_CONSTRAINTS_2026-05-07.md` — user-facing limits/quotas +- `INVESTIGATION_DEV_CONSTRAINTS_2026-05-07.md` — tenant/integrator limits +- `INVESTIGATION_PIPELINES_2026-05-07.md` — production-claimed feature E2E verification +- `INVESTIGATION_WIRES_2026-05-07.md` — frontend↔backend payload contract diff +- `INVESTIGATION_FAILOPEN_2026-05-07.md` — silent-success / fail-open hunter + +Total findings: ~120 across 6 lenses. This doc consolidates the **P0/P1** items into a single ranked dispatch plan. + +--- + +## P0 — security/correctness boundary breaks + +| # | Finding | File:line | Source | +|---|---|---|---| +| 1 | `/2fa/verify-method` accepts ANY non-empty `assertion` string for FINGERPRINT/HARDWARE_KEY — full 2FA bypass; audit log records "success" | `identity-core-api/.../AuthController.java:526-532` | failopen F1 | +| 2 | Embedding "encryption-at-rest" is theater — ciphertext column written, plaintext `embedding` column read; `decrypt_vector` defined but never called | `biometric-processor/.../pgvector_embedding_repository.py:227,304,766` + `embedding_cipher.py:75` | pipelines | +| 3 | `WatchlistCheckHandler` is a live `@Component` hardcoding `cleared=true, match_count=0` — KYC/AML claim broken on every flow including `WATCHLIST_CHECK` | `identity-core-api/.../verification/handlers/WatchlistCheckHandler.java:14-50` | mocks | +| 4 | `live_camera_analysis.py` returns `is_live=True` when detector is None — boot DI fail = silent fail-open | `biometric-processor/.../live_camera_analysis.py:184-193` | mocks | +| 5 | Account-lockout exception never thrown — locked users see generic "invalid credentials"; full i18n keys for `ACCOUNT_LOCKED` exist as dead code | `AuthenticateUserService.java:79,127` + `AccountLockedException.java:7` | user constraints | +| 6 | OTP has no per-code attempt counter; only 30/min IP throttle. ~150 guesses/code against 10⁶ space (NIST 800-63B violates) | `OtpService.java:29-43` | user constraints | +| 7 | `tenants.max_users` field exists, default 100, surfaced in admin UI — never read by any insert path | `Tenant.java:86-88`, `RegisterUserService`, `ManageUserService.java:202` | dev constraints | +| 8 | `TenantStatus.SUSPENDED` not gated in auth path — `Tenant.canAcceptUsers()` exists, zero non-DTO callers; suspended tenants keep minting JWTs | `AuthenticateUserService` (no tenant check), `Tenant.java:249-251` | dev constraints | +| 9 | Anti-replay spot-check defeated by corrupt frames — `continue` on decode error doesn't count as failure; 3 corrupt JPEGs = `failed_count=0` | `biometric-processor/.../verify_puzzle.py:171-196` | failopen F2 | +| 10 | Face confidence fallback can override server `verified=false` with hardcoded 0.7 threshold; never logs | `FaceAuthHandler.java:65-75` + `AuthController.java:509-518` | failopen F3 | + +--- + +## P1 — high-priority hardening + +### Mock/stub residuals +- `AddressProofHandler` always returns success without storage or validation (`AddressProofHandler.java:11-43`) +- `LoggerService.sendToLogService` + `sendToErrorTracking` are silent no-ops in prod — every browser-side error dropped (`web-app/src/core/services/LoggerService.ts:130-145`) +- `analyze_quality.py` hardcodes `occlusion=0.0` — cannot reject sunglasses/mask/hand-over-mouth (`biometric-processor/.../analyze_quality.py:136-137`) +- NFC auth is serial-only stub: no MRZ/DG1/DG2/checksum/challenge-response. Bio `mrz_parser.py` exists but only wired to manual-KYC pipeline, not NFC auth method (`NfcController.java:35-108`) + +### User constraints +- Access-token TTL = **24h** (`application.yml:81`, `JWT_EXPIRATION=86400000`). Recommend 15 min in `application-prod.yml`. Refresh-rotation is solid; this is just a giant blast radius. +- Voice clip uncapped beyond 10MB upload (~5min PCM @16k); device count per user unbounded → bloated WebAuthn allowList +- Password-policy errors return concatenated English from `PasswordPolicy.java:69`; bio 413s return raw English from `main.py:247-256` — Turkish-locale users see English + +### Developer/tenant constraints +- `OAuth2ClientController` is role-blind: all 5 endpoints `@PreAuthorize("isAuthenticated()")`. Any TENANT_MEMBER (incl. unexpired GUEST) can create/delete/disable OAuth2 clients (`OAuth2ClientController.java:53,73,122,140,161`) +- `/userinfo` ignores OAuth2 scope — returns email/name/given_name/family_name/phone unconditionally. Token-issuance path correctly filters by scope; userinfo doesn't. OIDC §5.4 + RFC 6749 §3.3 violation. (`OAuth2Service.java:445-474`) +- No `client_secret` rotation endpoint — operators must delete+recreate clients, breaking active integrations (`OAuth2ClientController.java:50-186`) +- No per-tenant rate-limit bucket — only per-IP/userId/clientId; `/oauth2/token` success path unbounded; only PKCE-failures throttled (30/5min, `RateLimitService.java:218-230`) + +### Wire contracts +- `BiometricService.searchFace`/`enrollFace` accept `_tenantId`/`_maxResults`/`_clientEmbeddings` and silently drop them; backend supports them (`BiometricService.ts:105,201,229` vs `BiometricServiceAdapter.java:290`) +- 3 distinct error envelopes (`ErrorResponse`, RFC-6749 OAuth2, MFA-step mixed-status). Frontend `formatApiError` can't reliably distinguish. +- Face-verify response missing `distance`/`threshold` — frontend defaults `distance=1, threshold=0.4` as fake sentinels (`BiometricService.ts:218-219`) +- OAuth2 `redirectUri` regex `^https?://` rejects RFC 8252 custom schemes (`com.acme://auth`) advertised in docs (`OAuth2Controller.java:430-433`) +- `AuthSessionRepository.startSession` payload completely mismatched (`tenantId/userId` sent vs `tenantSlug/email` required) — currently zero callers, would 400 100% if wired (`AuthSessionRepository.ts:128-139`) + +### Fail-open +- `AuditEventPublisher.publish` `@Async` catches all exceptions with no counter/DLQ/alert (`AuditEventPublisher.java:65-84`) +- 3 audit-log blind spots: `ChangePasswordService` has zero `auditLogPort` reference; `ManageUserService` + `ManageTenantService` import only the read port → user delete + tenant create/delete write no audit row +- 5 `/2fa/verify*` rejection paths return HTTP 200 with `success:false` — observability never sees 4xx; non-frontend consumers misinterpret (`AuthController.java:350,400,445,557,565`) + +### Pipeline gaps +- Anti-spoof verdict ignores DeepFace `spoof` label when UniFace confidence ≥0.85 (`check_liveness.py:157`) — high-conf UniFace + DeepFace-spoof contradiction silently resolves in favor of UniFace +- `SoftDeletePurgeJob` default-off (`APP_PURGE_SOFT_DELETE_ENABLED=false`, `application.yml:26-27`) — implementation correct; if flag off in prod, GDPR Art. 17 unfulfilled + +--- + +## What's working as advertised (sanity floor) + +Per pipeline auditor, these features pass end-to-end at HEAD: +- Voice auth (Resemblyzer 256-dim + quality-weighted centroid) +- UniFace passive liveness (load-bearing in `/verify`) +- Refresh-token family-revoke (V50 RFC 6749 §10.4) +- GDPR data export (correctly excludes secrets/embeddings/refresh tokens) +- WebAuthn full ceremony (sig verify + sign-count check + origin allowlist) +- OAuth2 `/authorize` tenant guard (both single-step and step-up) +- Hosted-login MFA full flow (state echo + single-use code + redirect_uri re-validation) +- Cross-tenant boundary (`RbacAuthorizationService.canAccessTenant` correct; `TenantBindFromAuthFilter` enforces JWT-tenant override on header forgery) +- `SecurityConfig.java:75-149` — no unintentional permitAll +- Memory cross-checks: SHA-256 fingerprint placeholder genuinely removed; Fernet code shipped (but not invoked — see P0 #2); `BiometricServiceAdapter` is real + +--- + +## Recommended dispatch plan + +**Round 1 — P0 batch (parallel-agent, ~8 hours wall-clock)**: +1. P0-#1 fail-open in legacy `/2fa/verify-method` — short fix: remove the legacy route OR delegate to `WebAuthnVerifySupport` like the N-step path +2. P0-#2 wire `decrypt_vector` into `find_similar` + `find_by_user_id` — embedding encryption rescue +3. P0-#5 throw `AccountLockedException` from auth path with `remainingLockTimeSeconds` — frontend already has the i18n keys +4. P0-#6 add per-code OTP attempt counter (3-5 max) → invalidate code; new error code +5. P0-#10 remove the 0.7 fallback in `FaceAuthHandler` + `AuthController` — trust server `verified` field, log explicit reasons + +**Round 2 — P0 product/data (parallel)**: +6. P0-#3 `WatchlistCheckHandler` — either wire to a real provider OR move to `@Profile("dev")` and document +7. P0-#4 `live_camera_analysis.py` fail-closed when detector None +8. P0-#7 enforce `tenants.max_users` in `RegisterUserService` +9. P0-#8 add `tenant.isActive()` gate in `AuthenticateUserService` +10. P0-#9 anti-replay spot-check: count corrupt frames as failures, not skips + +**Round 3 — P1 hardening, dev constraints**: OAuth2ClientController RBAC + scope-filter `/userinfo` + client_secret rotation + per-tenant rate-limit bucket + RFC 8252 redirect_uri scheme support + +**Round 4 — P1 wires/UX**: error-shape unification + face-verify response shape fix + BiometricService underscore-param surfacing + AuthSessionRepository contract fix or deletion + +**Round 5 — P1 fail-open + audit gaps**: AuditEventPublisher exception counter + 3 audit-log blind spots filled + HTTP status code corrections on `/2fa/verify*` + +**Round 6 — P1 product**: AddressProofHandler real impl OR profile-gate + LoggerService prod wiring + occlusion implementation + NFC MRZ wiring + +**Operator-only follow-ups**: +- Confirm `APP_PURGE_SOFT_DELETE_ENABLED=true` on prod +- Decide watchlist provider (Refinitiv/Dow Jones/etc.) before flipping flag +- Decide OAuth2 redirect_uri policy (loosen regex vs. document HTTPS-only) + +--- + +## Headline numbers + +- **2 P0 + 1 latent P0** straight mock/fake (mocks audit) +- **2 P0** silent success in security-relevant code paths (failopen) +- **2 P0** missing user-facing constraints (lockout + OTP attempts) +- **2 P0** missing tenant-facing constraints (max_users, suspended) +- **1 P0** broken encryption-at-rest (pipelines) +- **1 P0** complete 2FA bypass on legacy route (failopen) +- **~25 P1** across the 6 lenses +- **~50 P2/P3** cosmetic/defense-in-depth + +This is a substantial backlog but the P0 set is **bounded** (10 items, all narrowly scoped). diff --git a/INVESTIGATION_MOCKS_2026-05-07.md b/INVESTIGATION_MOCKS_2026-05-07.md new file mode 100644 index 0000000..c2391b7 --- /dev/null +++ b/INVESTIGATION_MOCKS_2026-05-07.md @@ -0,0 +1,246 @@ +# Mocks/Fakes/Placeholders Audit — 2026-05-07 + +Read-only investigation across the FIVUCSAS monorepo (parent + 5 submodules). Scope: production-path stubs, fake returns, hard-coded fixtures, TODO/FIXME, and silent contract bugs. Verification was done against current HEAD with grep + file reads — no memory or audit-doc claim was trusted on its own. + +## Methodology + +1. Ripgrep across `.java`/`.py`/`.ts`/`.tsx` for: `mock|fake|stub|dummy|placeholder|TODO|FIXME|XXX|HACK|notimplemented|UnsupportedOperation`, then for `class .*\(Stub|Mock|Fake|NoOp\)`, then for `return True`/`return null`, then for sentinel UUIDs. +2. For every Spring `@Component`/`@Service` ending in `Stub|Mock|Fake|NoOp` — checked the `@ConditionalOnProperty` predicate and the prod env var to see whether it is wired live. +3. For every Python `Stub*`/`InMemory*` — traced container.py wiring + `if settings.TESTING` / `if APP_ENV in (...)` gates. +4. Test files (`/test/`, `/__tests__/`, `*.test.ts`, `*.spec.ts`, `test_*.py`) excluded unless the test was the only thing importing the production stub. +5. Severity: + - **P0**: production endpoint returns fake data or grants auth on a stub. + - **P1**: feature claimed working but actually a placeholder; stale TODOs in load-bearing paths. + - **P2**: silent contract bugs, ignored params with real callers, future-work TODOs. + - **P3**: cosmetic (UI placeholders, doc-only). + +## Critical (P0) — production endpoints returning fake data + +### F-1. `WatchlistCheckHandler` always clears every sanctions check +**File**: `identity-core-api/src/main/java/com/fivucsas/identity/application/service/verification/handlers/WatchlistCheckHandler.java:14-50` +```java + * Mock implementation that always clears the check. + * TODO: Integrate with real sanctions/watchlist APIs (OFAC, EU sanctions list, UN Security Council) + ... + // TODO: Replace with real sanctions API call + return VerificationStepResult.success(1.0, Map.of( + "cleared", true, "checked_lists", List.of("OFAC","EU","UN"), "match_count", 0)); +``` +`@Component`, no profile gate — wired live in any tenant verification flow that includes a `WATCHLIST_CHECK` step. Returns `cleared=true, match_count=0` for every input. The auth/verification pipeline will accept a literal sanctioned name. **P0** — KYC/AML compliance promise not delivered. (Mitigant: most tenant flows likely don't enable this step today; risk realises the moment one does.) + +### F-2. `live_camera_analysis` returns `is_live=True` when no detector is wired +**File**: `biometric-processor/app/application/use_cases/live_camera_analysis.py:184-193` +```python +if self._liveness_detector: + liveness = await self._analyze_liveness(face_region) +else: + response.liveness = LivenessResult( + is_live=True, # Default to true if no detector + confidence=0.5, method="none", checks={"passive": True}) +``` +The `LIVE_ANALYSIS` route (`/api/v1/live-analysis/...`) silently returns `is_live=True` if the DI container fails to inject a detector — there is no fail-closed guard. If a deployment misconfigures `LIVENESS_BACKEND` or the detector errors at boot, every live-camera frame is auto-approved. **P0**. + +### F-3. `StubLivenessDetector` exists but is gated to non-prod APP_ENV +**File**: `biometric-processor/app/infrastructure/ml/liveness/stub_liveness_detector.py` (entire file) + `factories/liveness_factory.py:144-163` +```python +def _create_stub(liveness_threshold, **kw): + env = os.getenv("APP_ENV", "production").lower() + if env not in ("development","test","testing","ci"): + logger.error("StubLivenessDetector requested in non-test environment...") + return EnhancedLivenessDetector(...) # safe fallback + return StubLivenessDetector(default_score=85.0) +``` +**Verdict: safe today** because the factory falls back to `EnhancedLivenessDetector` in prod. But `StubLivenessDetector.check_liveness` returns `is_live=True, score=85.0` for ANY image, and `LIVENESS_MODE` legacy alias `"stub"` is still in the `Literal[...]` type. If a future operator sets `APP_ENV=development` on prod by mistake, liveness silently green-lights spoof attempts. **P0 latent**, P2 today. Recommend deleting the stub class entirely now that prod is on UniFace. + +## High (P1) — features claimed working but stub-implemented + +### F-4. `AddressProofHandler` does no validation, just stores + flags +**File**: `identity-core-api/.../verification/handlers/AddressProofHandler.java:11-43` +```java +// TODO: Integrate with OCR/address validation service +// TODO: Add address matching against government databases +... +log.info("Address proof document received ... Flagged for manual review."); +return VerificationStepResult.success(Map.of("status", "PENDING_REVIEW", "document_stored", true)); +``` +Returns `success` regardless of image content (`document_stored=true` is asserted with no storage call — the comment says "would be stored via a media storage service in production"). UX-wise the user passes the step, downstream sees `PENDING_REVIEW`; no human review queue is wired. **P1** — KYC step that always succeeds. + +### F-5. `iosMain HmacPlatform.ios.kt` throws `TODO()` for HMAC-SHA1/256/512 +**File**: `client-apps/shared/src/iosMain/kotlin/com/fivucsas/authenticator/totp/HmacPlatform.ios.kt` +```kotlin +actual fun hmacSha1(...): ByteArray = TODO("iOS HMAC via CommonCrypto — tracked in CLIENT_APPS_PARITY.md") +actual fun hmacSha256(...) = TODO(...) +actual fun hmacSha512(...) = TODO(...) +``` +Kotlin's `TODO()` throws `NotImplementedError` at runtime. Per memory `iosMain` is permanently OUT-OF-SCOPE (no Apple hardware), but file ships in the KMP shared module. If the iOS target is ever built and run, TOTP enrollment crashes. **P1** for build hygiene; **P3** for runtime risk (no iOS app shipping). + +### F-6. `LoggerService.sendToLogService` / `sendToErrorTracking` are silent no-ops +**File**: `web-app/src/core/services/LoggerService.ts:130-145` +```ts +private sendToLogService(_level, _message, _meta?: unknown): void { + // Implementation would go here + // Example: CloudWatch.putLogEvents(...) + // For now, no console logging in production for security +} +private sendToErrorTracking(_message, _error?: unknown): void { + // Implementation would go here ; Example: Sentry.captureException(...) +} +``` +Underscore-prefixed params confirm they are ignored. Production browser errors are dropped on the floor. With `console.log` also disabled in prod (per the comment), the dashboard has effectively zero client-side error telemetry. **P1** — operability gap, every user-side bug is invisible. + +### F-7. `analyze_quality.py` occlusion is hardcoded to 0.0 +**File**: `biometric-processor/app/application/use_cases/analyze_quality.py:136-145` +```python +# Occlusion (placeholder - normalize to 0-100) +occlusion = 0.0 # 0% occlusion = good +return QualityMetrics(blur_score, brightness, face_size, face_angle, occlusion) +``` +Quality metrics report "no occlusion" for every face (sunglasses, mask, hand-over-mouth). Feeds into enrollment readiness gating. **P1** — `quality_threshold` cannot reject occluded captures. + +### F-8. `multi_face._find_additional_faces` returns `[]` — single-face fallback only +**File**: `biometric-processor/app/application/use_cases/detect_multi_face.py:120-129` +```python +def _find_additional_faces(self, image, excluded_regions): + """This is a placeholder for more sophisticated multi-face detection. + In production, use a detector that natively supports multiple faces.""" + return [] +``` +The `multi_face` endpoint advertises multi-face detection but only ever returns the primary face from MTCNN (whose API gives one). Used in proctoring / "detect strangers in frame" — silently misses the second person. **P1** — proctoring regressions. + +### F-9. `DeepFace Facenet512` weight integrity check is opt-in (no pinned hash) +**File**: `biometric-processor/app/infrastructure/ml/extractors/deepface_extractor.py:90-99` + `app/core/config.py:671` +```python +if not expected: + # TODO: pin DEEPFACE_FACENET512_SHA256 in config.py once known-good hash recorded + logger.warning("DeepFace model integrity check skipped (no pinned hash)") + return +``` +SHA-256 verifier is wired but `settings.DEEPFACE_FACENET512_SHA256` is empty/unpinned in prod, so a tampered weight file passes silently. Promised SHA-pin (memory `feedback_bcrypt_verify_first` and `D-tasks: SHA256 model delivery`) not delivered for the production model. **P1** — supply-chain control absent. + +### F-10. `OAuth2Service` legacy pipe-format auth-code parser still present +**File**: `identity-core-api/.../service/OAuth2Service.java:66-71` +```java +// BE-M1 (2026-04-19): Redis auth-code metadata is now JSON. The legacy pipe +// format is still tolerated on read for in-flight codes written before deploy ... +// TODO(2026-04-19 +15m / 2026-04-19 03:15Z): delete legacy pipe parser below. +``` +Self-imposed cleanup deadline elapsed 18 days ago. Dead-code surface left in OAuth token endpoint. **P1** — code smell; ensure the legacy parser doesn't accept malformed inputs. + +## Medium (P2) — TODOs in non-trivial code paths + +### F-11. `EventPublisherAdapter` / `AuditLogPort` doc says "placeholder for Phase 4" +**File**: `identity-core-api/.../application/port/output/AuditLogPort.java:7,14` and `EventPublisherPort.java:8,15` +```java + * Currently a placeholder for future implementation. + * NOTE: This is a placeholder for Phase 4 implementation. +``` +JavaDoc is **stale** — `AuditLogAdapter` and `EventPublisherAdapter` are real implementations (Spring `ApplicationEventPublisher`, JPA persistence). Risk: a future engineer reads "placeholder" and replaces the working class. **P2** — doc lie. + +### F-12. `NoOpSmsService` defaults on missing config (`matchIfMissing=true`) +**File**: `identity-core-api/.../infrastructure/sms/NoOpSmsService.java:7-8` +```java +@ConditionalOnProperty(name = "sms.provider", havingValue = "noop", matchIfMissing = true) +public class NoOpSmsService implements SmsService { + public void sendOtp(String phoneNumber, String code) { + log.info("SMS disabled - OTP for {}: {}", phoneNumber, code); } +``` +Prod is safe (`SMS_PROVIDER=twilio-verify` in `.env.prod`). But `matchIfMissing=true` means any deploy that drops the env var silently logs OTP codes to `stdout` instead of sending — and the user is locked out (no SMS arrives). The OTP is also **leaked to logs in plaintext**. **P2** — fragile default; logging the code is itself a finding. + +### F-13. `NoOpEmailService` same pattern, also leaks the OTP to logs +**File**: `identity-core-api/.../infrastructure/email/NoOpEmailService.java:7-15` +```java +@ConditionalOnProperty(name = "mail.enabled", havingValue = "false", matchIfMissing = true) +public void sendOtp(String to, String code) { log.info("Mail disabled - OTP for {}: {}", to, code); } +``` +Same `matchIfMissing=true` foot-gun; same plaintext OTP-in-logs hazard. **P2**. + +### F-14. `client-apps` Android i18n TODOs in shipping screens +**Files**: `client-apps/androidApp/.../MrzInputDialog.kt:63,193,219,225,332,358`, `ExportDataRow.kt:56,145,154`, `DataExportViewModel.kt:84,94`, `ProfileScreen.kt:217`, `SettingsScreen.kt:176`, `AuthenticatorScreen.kt:111` +Multiple hardcoded English strings in NFC + data-export + theme settings flows, every one tagged `// TODO(i18n): ... in /tmp/i18n_agent_20*.txt`. Violates rule `feedback_no_hardcode`. The `/tmp/...` reference suggests an unfinished agent run. **P2** — UX defect on Android. + +### F-15. `IdInfoStep.tsx` — hardcoded English placeholders + Latin name +**File**: `web-app/src/features/userEnrollment/components/steps/IdInfoStep.tsx:50-72` +```tsx +label="Full Name" ... placeholder="John Doe" +label="National ID" ... placeholder="ABC-123456" +helperText='Format: YYYY-MM-DD' +``` +Three labels and one helperText hardcoded English; placeholders Latin not Turkish. Violates `feedback_no_hardcode`. **P2** for i18n rule, **P3** for UX (Turkish-first product showing "John Doe"). + +### F-16. `MockWebhookSender` reachable via `WebhookSenderFactory` with no env gate +**File**: `biometric-processor/app/infrastructure/webhooks/webhook_factory.py:18-47` +```python +WebhookTransport = Literal["http", "mock"] +@staticmethod +def create(transport="http", ...) -> IWebhookSender: + if transport == "mock": + return MockWebhookSender() +``` +Factory accepts `"mock"` from any caller; no `APP_ENV` guard. Today the factory is **never imported by production code paths** (verified via grep — only the MockWebhookSender file references it). Dead-code island, but the dead code is a faux-success webhook ("Mock failure" string in line 56 suggests it was once an integration toggle). **P2** — delete or lock to test fixtures. + +### F-17. `OptionalThreadPoolExecutor` "fake" comment mis-describes implementation +**File**: `biometric-processor/app/domain/interfaces/thread_pool_executor.py:7-8` (doc only — interface itself is sound) +**P3** — comment hygiene. + +### F-18. `UserDomainRepositoryAdapter` throws UnsupportedOperationException on a delete-by-email path +**File**: `identity-core-api/.../infrastructure/adapter/UserDomainRepositoryAdapter.java:48-52` +Per `feedback_no_hard_delete_users` and V53 trigger this is intentional — `delete(User)` blocks raw deletion. The throw is the safety guard. **Not a bug.** Documenting only because the grep flagged it. + +### F-19. `AuthMethodHandlerRegistry` / `VerificationStepHandlerRegistry` throw `UnsupportedOperationException` on missing handler type +**Files**: `AuthMethodHandlerRegistry.java:34`, `VerificationStepHandlerRegistry.java:36` +Legit defensive programming for `getHandler(unknown)`. Caller paths are `Optional<>`-style guarded. **Not a bug.** + +### F-20. `FaceAuthHandler.transferTo` throws `UnsupportedOperationException` +**File**: `FaceAuthHandler.java:124-127` — narrow override on an in-memory `MultipartFile` adapter; `transferTo(File)` is unreachable from the upload path. **Not a bug.** + +## Low (P3) — cosmetic / documentation only + +### F-21. Stale `Mock implementation that always clears the check` JavaDoc lines +Multiple occurrences (e.g. `WatchlistCheckHandler.java:14`). Surface them as part of the F-1 fix. + +### F-22. `verify-widget/html/` is built JS bundles +Source-of-truth lives in `web-app` `verify-app/`; bundle artifacts unsearchable. **P3** — no source mocks here, but the dist directory should probably be `.gitignore`d. + +### F-23. `archived` and `practice-and-test/` submodules carry experimental stubs +Not user-facing, not deployed. **P3** — out of scope per "test files unless they reveal a production stub" rule. + +### F-24. `auth-methods-testing` page intentionally exposes "stub" mode +`web-app/src/features/auth-methods-testing/AuthMethodModeContext.ts:23` — `AuthMethodModeKind = 'real' | 'test' | 'stub'`. Contracted UI feature on an admin-only page. **Not a bug.** + +### F-25. `NodDetector` / `TurnLeftDetector` etc. have `_metrics` underscored params +`web-app/src/lib/biometric-engine/core/challenges/*.ts` — these motion-aware detectors only need `headPose` + motion history; `_metrics` underscore is the standard ts-eslint "intentionally unused" convention, **not** a contract bug. **Not a bug.** + +### F-26. `live_camera_analysis` API route is publicly accessible internally only +Bio container is internal-only per CLAUDE.md, mitigates F-2 blast radius — but does not erase it. **P3 mitigation note.** + +### F-27. `EnrollmentController` literal "85.0/1.0" comment about V47-superseded hardcode +`EnrollmentController.java:244` says "captured by V47 instead of the previous hard-coded 85.0/1.0" — historical, fixed. **Not a bug.** + +### F-28. `hosted-first` placeholder text "Phase 4" in port docs +Same as F-11; dual-tagged for visibility. + +### F-29. `WebhookSenderFactory` `WebhookTransport = Literal["http","mock"]` allows `mock` +Same as F-16, dead surface. + +### F-30. `TODO(2026-04-19 +15m)` self-imposed deadline in OAuth2Service +Same as F-10; documenting the elapsed-deadline flavour. + +## Summary table — findings by repo + +| Repo | P0 | P1 | P2 | P3 | Notable | +|---------------------|----|----|----|----|---------| +| identity-core-api | 1 (F-1) | 2 (F-4, F-10) | 4 (F-11, F-12, F-13, F-19/F-20 OK) | 2 | WatchlistCheck mock + NoOp{Sms,Email} matchIfMissing | +| biometric-processor | 1 (F-2) + 1 latent (F-3) | 3 (F-7, F-8, F-9) | 2 (F-16, F-17) | 1 (F-26) | live_camera_analysis fail-open + occlusion=0 + multi-face stub | +| web-app | 0 | 1 (F-6) | 2 (F-15, F-25 OK) | 2 (F-22, F-24, F-25) | LoggerService is silent no-op in prod | +| client-apps | 0 | 1 (F-5) | 1 (F-14) | 0 | iOS HMAC unimplemented (out of scope per memory) | +| docs | 0 | 0 | 0 | 0 | n/a | +| verify-widget | 0 | 0 | 0 | 1 (F-22) | dist artifacts only | + +**Top concerns to address first** (concrete production damage potential): +1. **F-1** WatchlistCheckHandler — KYC compliance. +2. **F-2** live_camera_analysis fail-open `is_live=True`. +3. **F-7** Occlusion always 0 — quality gate is a lie. +4. **F-9** Pin `DEEPFACE_FACENET512_SHA256`. +5. **F-12 + F-13** Flip `matchIfMissing` to `false` and stop logging plaintext OTPs. + +End of report. diff --git a/INVESTIGATION_PIPELINES_2026-05-07.md b/INVESTIGATION_PIPELINES_2026-05-07.md new file mode 100644 index 0000000..24fbe65 --- /dev/null +++ b/INVESTIGATION_PIPELINES_2026-05-07.md @@ -0,0 +1,285 @@ +# Pipeline Completeness Audit — 2026-05-07 + +Read-only end-to-end verification of 12 production-claimed features at HEAD. +No CLAUDE.md / roadmap claims trusted; verdicts derived from code inspection +only. Submodule HEADs as of run: bio + api + web (master-tracking, dirty +index entries listed in run-time `git status`). + +## Methodology + +For each feature I traced the request path end-to-end (UI step → web service +→ Java controller → service/handler → infrastructure adapter → bio FastAPI +route → repository → DB) and read the load-bearing decision lines. Verdicts +are anchored to file:line. Every "NOT WIRED" / "STUB" claim is corroborated +by a contradiction between the claim and the actual control flow. Severity +labels follow the brief's rubric (P0 = wrong-result-success, P1 = +known-bypass, P2 = missing defense-in-depth, P3 = doc drift). + +## Verdict Table + +| # | Feature | Verdict | Severity | +|---|---|---|---| +| 1 | Voice auth (Resemblyzer 256-dim) | WORKING | — | +| 2 | NFC document auth | STUB (serial-only; no MRZ/DG1/DG2) | P1 | +| 3 | Anti-spoof verdict in /verify | PARTIAL (vetoed only when liveness conf < 0.85) | P2 | +| 4 | UniFace passive liveness in /verify | WORKING | — | +| 5 | Refresh-token reuse → family revoke | WORKING | — | +| 6 | GDPR data export | WORKING | — | +| 7 | SoftDeletePurgeJob hard-delete | NOT WIRED in prod (feature flag default false) | P2 | +| 8 | WebAuthn registration + assertion | WORKING | — | +| 9 | Audit-log persistence (5 ops) | PARTIAL (password change + user/tenant delete = silent) | P2 | +| 10 | Embedding encryption (Fernet) | BROKEN (write-only — never read back) | P0 | +| 11 | OAuth2 /authorize tenant guard | WORKING | — | +| 12 | Hosted-login MFA full flow | WORKING | — | + +--- + +## Detailed Findings (severity-ordered) + +### #10 — Embedding Encryption (Fernet) — **BROKEN, P0** + +`pgvector_embedding_repository.py:227,304,766` write `embedding_ciphertext` +in lockstep with the plaintext `embedding` column on every save/centroid +update. `pgvector_voice_repository.py:110,155` do the same for voice. The +ciphertext column is therefore populated. + +But on every read path the plaintext column is used: + +- `pgvector_embedding_repository.py:382-403` — `find_by_user_id` selects + `embedding` (plaintext) and converts to numpy. +- `pgvector_embedding_repository.py:418+` — `find_similar` runs `embedding + <=> $1::vector` cosine search against plaintext. +- `pgvector_voice_repository.py:222-243` and `:313-338` — same pattern. + +`grep -rn "decrypt_vector\|cipher.decrypt" biometric-processor/app` returns +exactly one hit: the function definition at +`embedding_cipher.py:75`. **Zero call sites.** The Fernet ciphertext column +is write-only / read-never; an operator who dumped the DB still gets a +fully-functional plaintext recognition store. Encryption-at-rest is a +deception, not a defense. Severity P0 because the feature is *claimed* +to protect biometric-class personal data (GDPR Art. 9) and does not. + +**Fix path**: ANN search must continue to use plaintext (pgvector has no +ciphertext-aware operator); but `find_by_user_id` (the 1:1 verify path) +should re-derive the vector from `decrypt_vector(embedding_ciphertext)`, +and `find_similar` should optionally cross-check the closest-match row's +ciphertext on a hit. + +### #2 — NFC Document Auth — **STUB, P1** + +`NfcController.java:35-71` (enroll) and `:73-108` (verify) accept only a +`cardSerial` string from the client. The frontend `NfcStep.tsx` reads the +NFC chip via Web NFC API and forwards just the serial number +(`NfcEnrollment.tsx:124`, posts `{userId, cardSerial}`). No MRZ. No DG1. +No DG2. No checksum validation. The server stores `cardSerial` and on +verify performs a row lookup. + +A bio service `mrz_parser.py` does exist +(`biometric-processor/app/domain/services/mrz_parser.py`) and is wired +into `verification_pipeline.py` — but that is a separate manual-KYC flow, +not the NFC auth method. The NFC AuthMethod's verify-step has zero +contact with the parser. + +**Practical impact**: an attacker with a writable NFC card (Mifare +Classic / NTAG215) clones the published serial and is authenticated as +the victim. There is no cryptographic challenge–response with the card +chip, no BAC/PACE handshake, no chip authentication. Severity P1 because +"NFC verified" is presented to relying parties at face value while it +proves nothing about the card holder. + +### #3 — Face Anti-spoof in /verify — **PARTIAL, P2** + +`check_liveness.py:148-175` reads `detection.antispoof_label / +antispoof_score` from DeepFace, but the spoof verdict is *only* applied +as a veto when the liveness confidence is **below 0.85**: +`if deepface_spoof_detected and liveness_result.confidence < +DEEPFACE_VETO_CONFIDENCE_THRESHOLD` (line 157). If UniFace says the +face is real with confidence ≥ 0.85, an explicit DeepFace `spoof` label +is **ignored** and `is_live` stays True. + +`ANTI_SPOOFING_ENABLED=true` is honored (line 151), so the gate exists, +but its effective scope is narrow. UniFace MiniFASNetV2 is competent but +still has FPR > 0 on screen-replay attacks; pairing it with a +high-confidence rejection of a contradicting DeepFace verdict would be +strictly safer. + +### #7 — SoftDeletePurgeJob — **NOT WIRED in prod, P2** + +`SoftDeletePurgeJob.java:74-90` carries `@Scheduled(cron = "0 30 3 * * *")` +and `@SchedulerLock`. It calls `userRepository.hardDeleteById(userId)` +and `flush()` so FK cascades (V11/V16/V18/V19/V22/V30/V6) execute, and +emits a `USER_HARD_PURGED` audit event. `purgeBatch` issues +`SET LOCAL app.allow_hard_delete = 'on'` to bypass V53's BEFORE-DELETE +trigger (line 147). **The implementation is correct.** + +However: `application.yml:26-27` defaults the gate to +`APP_PURGE_SOFT_DELETE_ENABLED:false`, and on every invocation the job +short-circuits at line 84-87 (or line 99 if `purge()` is called +directly). Unless the operator has explicitly set +`APP_PURGE_SOFT_DELETE_ENABLED=true` in `.env.prod` (cannot be verified +from this thread — no SSH key access), the GDPR Art. 17 / KVKK +right-to-erasure obligation is unfulfilled in production. Documentation +implies the job is doing work; runtime evidence in the codebase is that +it is idle by default. + +### #9 — Audit-log persistence — **PARTIAL, P2** + +5 spot-checks: + +1. **Login success / fail** — `AuthenticateUserService.java:78,108,128,131,161,247` + — emits `logAuthenticationFailed` and `logUserAuthenticated`. WORKING. +2. **MFA step pass / fail** — `VerifyMfaStepService.java:187,206,246,261,294,338,383` + — emits `logMfaStepFailed`, `logMfaStepCompleted`, `logMfaComplete`. WORKING. +3. **Password change** — `ChangePasswordService.java` has no `auditLogPort` + field at all. `grep -n "audit" ChangePasswordService.java` returns + nothing. **Silently NOT logged.** A successful password rotation + leaves no audit trace. (Reset flow `ResetPasswordService.java:41-94` + does emit; only the in-session change path is silent.) +4. **Token refresh / revoke** — `RefreshTokenService.java:117-121` emits + `REFRESH_TOKEN_REUSE_DETECTED` only on the reuse-detection path; a + normal mint or normal revoke does not write an audit row. The reuse- + detect line is the WORKING path; routine mint/revoke is silent. +5. **User delete / Tenant create / Tenant delete** — + `ManageUserService.java` and `ManageTenantService.java` import + `AuditLogQueryPort` (read) but neither has an `AuditLogPort` field + (write). User soft-delete and tenant create/delete actions + **do not emit audit rows**. Forensics on a tenant-removed-by-mistake + incident would have to fall back to DB triggers (none exist for + `tenants`) or container logs. + +Severity P2 because the audit-log-as-compliance-evidence claim +(SOC2 / ISO 27001 / KVKK 7-year retention rationale, cited at +`SoftDeletePurgeJob.java:39`) leaks holes for three high-stakes +operation classes. + +### #1 — Voice auth (Resemblyzer 256-dim) — **WORKING** + +- `speaker_embedder.py:51-58` instantiates `resemblyzer.VoiceEncoder` — + the real GE2E pretrained model, not a stub. +- `speaker_embedder.py:97-106` runs `preprocess_wav` and + `embed_utterance` to produce a 256-dim L2-normalized vector + (`VOICE_EMBEDDING_DIM = 256` at line 27). +- `voice.py:114-167` /verify: cosine similarity at line 148, threshold + 0.65 at line 120; clamps `[0,1]` and decides `verified = similarity + >= VERIFY_THRESHOLD`. +- Centroid weighting: `pgvector_voice_repository.py:134-186` writes + individual rows + computes `AVG(embedding)::vector(256)` centroid; the + centroid is read by `find_by_user_id` (line 219-243), with INDIVIDUAL + fallback if no CENTROID exists yet. Quality-weighted is *advertised* + but the actual SQL `AVG(embedding)` is unweighted (each enrollment + contributes equally regardless of `quality_score`). Minor P3 doc + drift; not flagged as a verdict change. + +### #4 — UniFace passive liveness — **WORKING** + +`verification.py:104-124` calls `liveness_use_case.execute()`, then +*rejects* the entire request with HTTP 400 LIVENESS_FAILED if either +`is_live == False` or `score < 0.4`. The check runs **before** the +embedding extract / similarity step — so a spoof never reaches the +1:1 matcher. The verdict is genuinely load-bearing. Backend resolves +to UniFace MiniFASNetV2 via `LIVENESS_BACKEND=uniface + +LIVENESS_MODE=passive` per the deployed config. + +### #5 — Refresh-token family-revoke (V50) — **WORKING** + +`RefreshTokenService.java:107-123` — when a presented refresh token is +already revoked, `revokeFamily(token.getFamilyId(), Instant.now())` runs +and the count is logged to audit +(`REFRESH_TOKEN_REUSE_DETECTED`). Rotation +(`RefreshTokenService.java:213-220`) revokes the parent and mints a +sibling sharing `familyId`, so a single compromised token blows the +entire chain on re-presentation. Behavior conforms to RFC 6749 §10.4 +and OAuth 2.0 Security BCP §4.13. + +### #6 — GDPR data export — **WORKING** + +`UserDataExportService.java:67-86` returns a bundle including: +user core fields, enrollments (metadata only — `enrollmentData` +deliberately stripped at line 129), authFlows, audit logs (max 10k), +verificationSessions, oauth2Clients (only for tenant admins — line +202). + +Excluded by design: +- `password_hash`, `two_factor_secret`, backup codes (line 41-44) +- raw biometric vectors (line 80-84 — empty lists for + `voiceEnrollments` and `biometricEnrollments`) +- session tokens (line 153) +- client secrets (line 218) +- WebAuthn private material (handled at registration time) + +Refresh tokens are not enumerated because they are not exposed via this +service at all — the refresh-token table is not a serialize source. + +### #8 — WebAuthn registration + assertion — **WORKING** + +- Registration: `WebAuthnService.java:70-117` validates clientDataJSON + type (`webauthn.create`), challenge match, origin allowlist (RFC 6454 + §4 exact-match), then consumes the Redis challenge. P1-3 fix at line + 82-85 prevents null/empty `clientDataJSON` from passing. +- Assertion: `WebAuthnService.java:130-179` validates clientData → + authenticatorData → presence of `credentialId` + `signature` → + ECDSA SHA256 signature verify (line 184-215, real `Signature` + cryptographic verify, not a string compare). +- Sign-counter monotonic check: `validateSignCount` (line 247-258) + enforces `newCount > storedCount` unless both are zero (spec-permitted + for privacy-preserving authenticators). +- Origin allowlist: requires `app.webauthn.allowed-origins` env; + startup logs warn (line 47) if empty, in which case all assertions + fail-closed. + +### #11 — OAuth2 /authorize tenant guard — **WORKING** + +`OAuth2Controller.java:142,254` both call `validateAuthorizeRequest(...)` +before code minting (single-step authenticated branch + post-MFA +hosted-complete branch). The shared method at line 321-369 performs: +PKCE S256 enforcement for public clients (line 332-342), user lookup, +and exact tenant-id equality on `user.getTenant().getId() == +client.getTenant().getId()` (line 359-366). On mismatch: +HTTP 400 `invalid_request` with state echo (RFC 6749 §5.2 shape). + +### #12 — Hosted-login MFA full flow — **WORKING** + +- `HostedLoginApp.tsx:83-97` parses URL params (`state` is read at + line 89; not validated client-side — that's the relying party's job + per RFC 6749, correct). +- After password + MFA, posts to `/oauth2/authorize/complete` + (`HostedLoginApp.tsx:291-304`) with `mfaSessionToken` + `state`. +- `OAuth2Controller.java:213-241` validates the MFA session: existence, + not-expired, not-consumed (anti-replay at line 226-230), and + client-id binding (line 236-241). `OAuth2Service.java:172-200` + consumes + mints + deletes the session row inside a single + `@Transactional`, so a crash leaves the session burned. +- Code single-use: `OAuth2Service.java:228-229` deletes the Redis key + immediately on first /token call. Second presentation hits line 217 + (`stored == null`) and throws `CODE_NOT_FOUND`. +- `redirect_uri` re-validated at exchange: `OAuth2Service.java:279-281` + exact-match against the `storedRedirectUri` recorded at code mint + (RFC 6749 §4.1.3). Front-end `assertSafeRedirectScheme` + (`HostedLoginApp.tsx:347`) adds a defense-in-depth scheme allowlist. + +--- + +## Top Recommendations + +1. **Wire the Fernet ciphertext into read paths** in + `pgvector_embedding_repository.find_by_user_id` and + `pgvector_voice_repository.find_by_user_id`. Today the encryption is + security theater — the plaintext column is the source of truth for + every read. (P0) +2. **Replace NFC serial-only auth with chip authentication.** + At minimum require a challenge–response signed by the chip's + per-card key, or move NFC to a verification-pipeline-only feature + and remove it from the auth methods enum. Cloned-tag attack is + trivial today. (P1) +3. **Lift the DeepFace anti-spoof veto threshold** from + `liveness_result.confidence < 0.85` to "always-veto-when-spoof-label- + set". Honoring the explicit spoof verdict regardless of UniFace's + confidence is a single-line change in `check_liveness.py:157`. (P2) +4. **Confirm `APP_PURGE_SOFT_DELETE_ENABLED=true` in prod `.env.prod`**. + The job is wired but feature-flag-default-disabled; without it + GDPR Art. 17 is unfulfilled. Add a startup banner that logs the + flag's value at WARN when disabled. (P2) +5. **Emit audit rows from ChangePasswordService, ManageUserService + (delete), and ManageTenantService (create/delete).** Each is a + 2-line addition (`auditLogPort.logSecurityEvent(...)`) and removes + three blind spots in the compliance trail. (P2) diff --git a/INVESTIGATION_USER_CONSTRAINTS_2026-05-07.md b/INVESTIGATION_USER_CONSTRAINTS_2026-05-07.md new file mode 100644 index 0000000..aaac58a --- /dev/null +++ b/INVESTIGATION_USER_CONSTRAINTS_2026-05-07.md @@ -0,0 +1,213 @@ +# User Constraints Audit — 2026-05-07 + +## Methodology + +Read-only audit of `/opt/projects/fivucsas/` HEAD against the constraint +inventory in the brief. For each constraint I answered four questions: + +1. **Defined?** — Is the value present in code/config (file:line)? +2. **Enforced server-side?** — Not only client-side? +3. **Surfaced?** — Does the user get a clear, i18n'd message? +4. **Reasonable?** — Is the value defensible for the use case? + +Pattern: `grep` on per-domain anchor terms (`MAX_*`, `Duration.of*`, `*_TTL`, +`RATE_LIMIT_*`, `RETENTION`), then file reads to confirm enforcement and +exception/error-code routing. i18n confirmed by matching error codes +between Java exceptions, `GlobalExceptionHandler`, and +`web-app/src/i18n/locales/{en,tr}.json`. No code edited; no DB read. + +Per `feedback_verify_completion_claims.md` I trusted only HEAD source, not +status memos. + +## Findings table + +| Constraint | Defined? | Enforced server-side? | Surfaced (i18n)? | Value | Severity | File:line | +|---|---|---|---|---|---|---| +| Password min length | Yes | Yes | EN + TR | 8 | OK | `identity-core-api/.../PasswordPolicy.java:23` / `web-app/.../PasswordStep.tsx:35` | +| Password max length | Yes | Yes | No (validation lumped) | 128 | P3 | `PasswordPolicy.java:24` | +| Password complexity (4 classes) | Yes | Yes | EN-only inside back-end string | upper/lower/digit/special | P2 | `PasswordPolicy.java:52-65` | +| Frontend complexity validation | No | n/a | EN+TR helper text only | min 8 only | P2 | `PasswordStep.tsx:35`, `en.json:222` | +| Max login attempts | Yes | Yes | i18n key exists but never fires | 5 | **P0** | `AuthenticateUserService.java:57`, `:79`, `en.json:1576` | +| Lockout duration | Yes | Yes | Same — wrapped as INVALID_CREDENTIALS | 15 min | **P0** | `AuthenticateUserService.java:58`, `:79`, `:126` | +| Email-OTP TTL | Yes | Yes | Generic "Invalid or expired" | 5 min | P1 | `OtpService.java:19`, `EmailOtpAuthHandler.java:49` | +| Email-OTP max attempts | **No** | **No** | n/a | unlimited until TTL | **P0** | `OtpService.java` (no counter) | +| SMS-OTP TTL | Yes | Yes | Generic | 5 min (Redis OtpService) or Twilio Verify default | P1 | `OtpService.java:19`, `SmsOtpAuthHandler.java:51` | +| SMS-OTP max attempts | **No** | **No** | n/a | unlimited until TTL (Twilio Verify enforces its own; in-house path does not) | **P0** | same | +| SMS cost guard | No | No | n/a | none beyond rate limit | P1 | `SmsOtpAuthHandler.java`, no per-user/day cap | +| TOTP window tolerance | Yes (library default) | Yes | Generic | 0 (strict, single 30s window) | P3 | `TotpService.java:22` | +| TOTP digits / period | Yes | Yes | n/a | 6 / 30s SHA1 | OK | `TotpService.java:21,32` | +| MFA session TTL | Yes | Yes | "session expired" generic | 10 min | OK | `AuthenticateUserService.java:59,222` | +| Auth-session step TTL | Yes | Yes | Generic | 10 min | OK | `ExecuteAuthSessionService.java:55,98` | +| Access token TTL | Yes (config) | Yes | n/a (silent refresh) | 24h | P1 (long for SaaS auth) | `application.yml:81`, `JwtService.java:53` | +| Refresh token TTL | Yes (config) | Yes | n/a | 7d | OK | `application.yml:82`, `RefreshTokenService.java:41` | +| Refresh token rotation + reuse-detect | Yes | Yes | TokenRevokedException → 401 | family revoke on reuse | OK | `RefreshTokenService.java:108-121` | +| WebAuthn challenge TTL | Yes | Yes | Surfaced as generic | 5 min | OK | `WebAuthnService.java:24` | +| Step-up challenge TTL | Yes | Yes | Generic | 5 min | OK | `StepUpChallengeService.java:17` | +| QR code (auth) TTL | Yes | Yes | "Invalid or expired QR token" generic | 5 min | P2 | `QrCodeService.java:20`, `QrSessionService.java:34`, `QrCodeAuthHandler.java:44` | +| Face image upload max size (bio) | Yes | Yes (middleware before auth) | 413 PAYLOAD_TOO_LARGE — message NOT i18n'd, surfaced raw | 10 MB | P2 | `biometric-processor/app/core/config.py:85`, `app/main.py:229-256` | +| Face capture frame rate (client) | Implicit | n/a | n/a | `requestAnimationFrame` (~60fps) | P2 (battery/CPU) | `useFaceDetection.ts:182,266,327,406,418` | +| Voice clip max duration | **No explicit cap** | Only by MAX_UPLOAD_SIZE (10 MB) | n/a | unlimited within 10 MB ≈ 10 min @ 16 kHz wav | **P1** | `app/api/routes/voice.py` (no duration check) | +| Voice clip min duration | **No** | **No** | n/a | none | P1 | same | +| Devices per user | **No cap** | **No** | n/a | unlimited | P1 | `DeviceController` / `WebAuthnCredential` no count check | +| Biometric enrollments per method | **No cap** (1 active per method by upsert) | Implicitly via upsert pattern | n/a | 1 active | OK | `ManageEnrollmentService.java` | +| Rate limit /auth/login | Yes | Yes | EN+TR | 10 per 5 min per IP | OK | `RateLimitService.java:71,316` | +| Rate limit register | Yes | Yes | EN+TR | 5 / hour / IP | OK | `RateLimitService.java:87,323` | +| Rate limit password reset | Yes | Yes | EN+TR | 5 / hour / IP | OK | `RateLimitService.java:104,330` | +| Rate limit /auth/mfa/step | Yes | Yes | "Too many attempts" + Retry-After | 30/min/IP | OK | `RateLimitService.java:181,362` | +| Rate limit biometric verify | Yes | Yes (api side) | EN+TR | 20/min/user | OK | `RateLimitService.java:121,338` | +| Rate limit API generic | Yes | Yes | EN+TR | 100/min/user | OK | `RateLimitService.java:138,346` | +| Rate limit GDPR export | Yes | Yes | EN+TR (`exportRateLimit`) | 1/hour/user | OK | `RateLimitService.java:157,354` | +| Rate limit PKCE failures | Yes | Yes | OAuth2 error response | 30/5min/clientId | OK | `RateLimitService.java:219,370` | +| Rate limit (bio) generic | Yes | Yes | 429 raw | 60/min default | OK | `biometric-processor/app/core/config.py:343` | +| Rate limit bio per endpoint | Yes | Yes | 429 raw | enroll 10, verify 30, search 20, liveness 15, batch 5 / min | OK | `config.py:350-379` | +| Anti-spoof confidence threshold | Yes | Yes | None — fails as generic verify-fail | 0.5 | OK | `config.py:131` | +| Liveness confidence (server) | Yes | Yes | Surfaced via score in response | 70 (0-100 scale) | OK | `config.py:140` | +| Passive liveness threshold (client) | Yes | n/a (client only) | None | 0.45 (0-1 scale) | P2 | `web-app/.../useFaceChallenge.ts:13` | +| Face match threshold | Yes | Yes | None — surface as is_match=false | 0.45 | OK | `config.py:139` | +| Face match aged threshold | Yes | Yes | None | 0.38 after 2 years | OK | `config.py:146,154` | +| Quality gate threshold | Yes | Yes | "Image quality low" | 70 (0-100) | OK | `config.py:141` | +| GDPR purge retention window | Yes | Yes (job, default OFF) | n/a (admin) | 30 days | OK | `SoftDeletePurgeJob.java:56`, `application.yml:24-27` | +| Audit log retention | Yes (V57) | Yes (pg_partman, fail-soft) | n/a | 24 months | OK | `V57__audit_logs_pg_partman.sql:265-286` | + +## P0 narrative + +### P0-1 — Account-lockout error code never reaches the user + +`AuthenticateUserService.java:79` and `:127` both throw `InvalidCredentialsException` +when an account is locked or just got locked. The dedicated +`AccountLockedException` class exists at +`identity-core-api/.../domain/exception/AccountLockedException.java:7-25` +with error code `ACCOUNT_LOCKED` and a `remainingLockTimeSeconds` payload +field, but is **not used by the auth path** (only referenced in +`ActivityLogResponse.java:51` for past-tense audit rendering). + +Consequence: the frontend's localized message +`en.json:1576` / `tr.json:1576` (`ACCOUNT_LOCKED`: "Your account is +temporarily locked. Try again in {{minutes}} minutes.") is dead. Users get +the generic "Invalid email or password" instead — they cannot tell the +account was locked, nor when to retry. This breaks the explicit UX promise +the i18n catalogue makes. + +The lockout itself does work server-side (5 attempts → 15 min lock, +auto-unlock check at `:71-75`). It is the surfacing that is broken. + +### P0-2 — Email/SMS OTP have no per-code attempt counter + +`OtpService.java:29-43` validates an OTP by comparing the submitted code +to the stored Redis value. On mismatch the code is **kept** (no delete, +no counter increment). An attacker holding a valid MFA session token can +issue up to 30 attempts/min/IP (the `mfaStepBuckets` +limit, `RateLimitService.java:181,362`) for the full 5-min TTL = ~150 +guesses against a 10⁶ space ≈ 0.015% per code. Acceptable today, but the +defense relies entirely on the IP rate limit; an attacker rotating IPs +or spreading across 10 minutes can substantially raise the success +probability. Standard practice (NIST SP 800-63B §5.1.3.2, RFC 6238 §5.2) +is to invalidate after 3-5 failures and force a regenerate. + +The Twilio Verify provider path (`SmsOtpAuthHandler` when +`SMS_PROVIDER=twilio`) inherits Twilio's own attempt budget (5) — but the +in-house Redis-backed path used for email and the noop SMS provider has +no such limit. + +## P1 narrative + +### P1-1 — Access-token lifetime is 24 h + +`application.yml:81` sets `JWT_EXPIRATION=86400000` (24 h). Industry norm +for refresh-token-bearing systems is 5-15 min. A stolen access token is +usable for an entire day. Refresh-token rotation + family revoke +(`RefreshTokenService.java:108-121`) mitigates *long-term* compromise but +does nothing for the active stolen access token. Recommend cut to 15 min +in `application-prod.yml`. + +### P1-2 — Voice clip has no duration cap + +`biometric-processor/app/api/routes/voice.py` accepts whatever the +upstream sends, gated only by the 10 MB `MAX_UPLOAD_SIZE`. At 16 kHz +mono PCM that is ~5 minutes. Adversary uploading near-cap voice files +times out the embedder (`ML_MODEL_TIMEOUT_SECONDS=30`, +`config.py:166`) and consumes a verify rate-limit slot per request. +Recommend a hard 10s cap (Resemblyzer typical enrolment 3-5s). + +### P1-3 — No cap on devices per user + +`WebAuthnCredential` and `user_devices` allow unbounded growth. A single +user can register thousands of devices, bloating the user_devices table +and the WebAuthn allowList payload at challenge time (which is sent to +the browser, increasing latency for everyone on that account). + +### P1-4 — SMS cost guard is per-IP only + +`RateLimitService` rate-limits login per IP and registration per IP. +There is no per-user-per-day SMS cost cap. An attacker who creates many +accounts (registration is 5/hour/IP — easily multi-IP'd) can trigger +thousands of SMS sends to victim numbers. Cost can spiral on Twilio. +Recommend a per-`phoneNumber` and per-tenant-per-day SMS quota. + +## P2 narrative + +- **OTP failure messages are generic** ("Invalid or expired …"). User + cannot tell whether to retry or to request a fresh code, and cannot + see remaining attempts (which today is "infinite", see P0-2). + `EmailOtpAuthHandler.java:49`, `SmsOtpAuthHandler.java:51,63`, + `QrCodeAuthHandler.java:44`. +- **Frontend register form does not preview password complexity** + (`PasswordStep.tsx:35` only validates `min(8)`). Backend + rejects with a long English string concatenation + (`PasswordPolicy.java:69`) that is not i18n'd. User on `tr` locale + sees English error. +- **Bio 413 response body is not i18n'd** + (`biometric-processor/app/main.py:247-256` returns raw English). +- **Client passive liveness threshold** is hard-coded + (`useFaceChallenge.ts:13` `0.45`). Server-side has the same value as + config but the client has no way to read the operator's choice — they + drift if operator tunes one side only. +- **Face capture loop is RAF-bound** + (`useFaceDetection.ts`). On a 144Hz monitor a low-end phone burns + battery for nothing. Throttle to ≥ 30 ms per detection. + +## P3 narrative + +- **Password max length 128** is reasonable but never surfaced in helper + text (`en.json:222` only mentions min). Edge case for passphrase users. +- **TOTP strict (`allowedTimePeriodDiscrepancy=0`, + `TotpService.java:22`)** rejects clock-skewed codes. dev.samstevens + default. Reasonable but causes user-facing failures when device clock + drifts > 30 s. Setting tolerance to 1 (allow ±30 s) is the typical + middle ground. + +## Recommendation list + +1. **P0**: Throw `AccountLockedException` (with `remainingLockTimeSeconds`) + from `AuthenticateUserService.java:79,127` instead of + `InvalidCredentialsException`. Add a handler to + `GlobalExceptionHandler.java` that maps it to HTTP 423 (or 429) with + error code `ACCOUNT_LOCKED` and a `retryAfterSeconds` field. The i18n + catalogue is already there. +2. **P0**: Add per-OTP attempt counter in `OtpService.java`. After 5 + wrong attempts, delete the code, force regenerate, and surface + `OTP_LOCKED` with retry-after. Apply to email + SMS (in-house path) + + QR code. +3. **P1**: Cut `JWT_EXPIRATION` default to 900 000 ms (15 min) in + `application-prod.yml`. Audit any frontend code that assumes 24 h. +4. **P1**: Add `MAX_VOICE_DURATION_SECONDS` in `biometric-processor` and + reject in `voice.py` before embedding. Default 10 s. +5. **P1**: Add `MAX_DEVICES_PER_USER` (suggest 20). Enforce in + `DeviceController` create/register paths and in `WebAuthnCredential` + registration. +6. **P1**: Add per-phone and per-tenant-per-day SMS quotas; track in + Redis with daily key. +7. **P2**: i18n the password-complexity violation list — return error + code `PASSWORD_POLICY_VIOLATION` plus a structured + `requirements: string[]` in the response so the frontend can render + each missing rule against `i18n` keys. +8. **P2**: Read passive liveness threshold from server config rather + than hard-coding in `useFaceChallenge.ts:13`. +9. **P2**: Add i18n for biometric-processor 413 bodies (or have + identity-core-api translate them when proxying). +10. **P2**: Throttle the face-detect RAF loop to ≥ 33 ms. +11. **P2**: Surface remaining attempts on OTP/login failure responses + (`attemptsRemaining: n`) so the frontend can render a counter. +12. **P3**: Document password max length (128) in `passwordHelper`. +13. **P3**: Set TOTP `allowedTimePeriodDiscrepancy=1` for clock-skew + grace. diff --git a/INVESTIGATION_VERIFY_FLOW_2026-05-07.md b/INVESTIGATION_VERIFY_FLOW_2026-05-07.md new file mode 100644 index 0000000..100d332 --- /dev/null +++ b/INVESTIGATION_VERIFY_FLOW_2026-05-07.md @@ -0,0 +1,207 @@ +# verify.fivucsas E2E Flow Audit — 2026-05-07 + +Read-only walkthrough of the hosted-login surface at `verify.fivucsas.com/login` +(also reached as `demo.fivucsas.com` with `client_id=marmara-bys-demo`). Code at +HEAD: `web-app/master`, `identity-core-api/master`. No code edited. + +## Methodology + +1. Walked every render branch of `HostedLoginApp.tsx`. +2. Walked the state machine in `LoginMfaFlow.tsx` (phases: `password` → + `method-picker` → `mfa-step` → `complete`). +3. Read each step component: Password / EmailOtpMfa / SmsOtp / Totp / Face / + Voice / Fingerprint / HardwareKey / NFC / QrCode. +4. Cross-referenced backend `OAuth2Controller.java` `/authorize`, + `/authorize/complete`, `/clients/{id}/public`. +5. Compared findings against `INVESTIGATION_WIRES_2026-05-07.md` and + `INVESTIGATION_PIPELINES_2026-05-07.md` to focus on NEW issues. +6. Verified the user's three reported bugs (PASSWORD-wrong-tenant, + NFC double-message, SMS dark-input) and the two coordinator add-ons + (step-counter visibility, generic-error reappearance). + +## Step-by-step walkthrough + +### 1. URL-param parsing (`HostedLoginApp.tsx:83-97`) +`client_id`, `redirect_uri`, `state`, `nonce`, `code_challenge`, +`code_challenge_method`, `scope`, `ui_locales`, `theme`, `api_base_url`. +Defaults are forgiving (`scope` defaults to `openid profile email`, +`code_challenge_method` to `S256`). The `state` is echoed back on success +(line 356) — **but not on the `setFinalError` paths** (lines 332/338/350/381), +so a tenant that was watching for `state` to reappear on cancel/error sees +nothing. The user is left on `verify.fivucsas.com/login?...`. + +### 2. Tenant-meta fetch (`HostedLoginApp.tsx:179-258`) +Calls `/oauth2/clients/{client_id}/public`. 404/400 → `paramError = +'invalidClient'`. Other failures → `metaLoadFailed` (retry button). +Timeout 10s. + +**Gap**: when `client_id` resolves but the client is **`disabled`** at the +backend, `OAuth2Controller` returns 200 + meta with no `disabled` field; +the FE proceeds and only fails at `/authorize/complete` with a generic +exchange error. No early disabled-client UX. + +### 3. Password step (`LoginMfaFlow.tsx:107-155`, `PasswordStep.tsx:1-161`) +Calls `authRepository.login({email, password})` — `AuthRepository.ts:54-83` +posts `/auth/login` with **email + password only**. There is **no `client_id`, +no tenant slug, no tenant_hint** sent. The user authenticates against the +*global* user table; tenant binding is checked only at +`/oauth2/authorize/complete` (`OAuth2Controller.java:359-366`) at the very end +of the flow, after the user has finished MFA. + +**This is the user-reported P0 #1**: ahabgu@gmail.com (system tenant) on +`client_id=marmara-bys-demo` passes the password step because login does not +care which client they came from. + +### 4. Method picker / MFA dispatch (`LoginMfaFlow.tsx:195-249`) +Step components emit a single `onSubmit` callback. `verifyStep` +(line 181) posts `/auth/mfa/step` with `{sessionToken, method, data}`. +Result branches: `AUTHENTICATED` → `onComplete`; `STEP_COMPLETED` → next step; +**any other status falls through to a generic +`t('widget.verificationFailed')`** (line 247). Backend can return a +`MfaStepResponse` with status `INVALID_CODE`, `RATE_LIMITED`, `LOCKED_OUT`, +`METHOD_ALREADY_USED` (per `identity-core-api/CLAUDE.md` PR #65) — none of these +are surfaced specifically; the user always sees "Verification failed." + +### 5. NFC step (`NfcStep.tsx:30-261`) +`onSubmit` is fired inside the `reading` event handler at line 81-83 *after* +`setScanResult(serialNumber)` at line 79. The success Alert at line 191-194 +renders as soon as `scanResult` is non-null. The parent +(`LoginMfaFlow.verifyStep`) then posts `/auth/mfa/step` and on backend +rejection passes `error` back through the `error` prop. Both `error` (line 168) +**and** `scanResult`-Alert (line 191) render simultaneously. + +**This is the user-reported P0 #2**: "NFC belge başarıyla okundu!" + "Doğrulama +yapılamadı". + +### 6. Hosted-login completion (`HostedLoginApp.tsx:263-397`) +Two paths: with `mfaSessionToken` → POST `/oauth2/authorize/complete`; without +→ GET `/oauth2/authorize` carrying `Authorization: Bearer {accessToken}`. +Both are caught at line 358 → only one specific error message +(`tenantMismatch`, lines 370-379). Every other failure mode (PKCE mismatch, +scope mismatch, redirect-URI exact-match miss, MFA session expired, MFA session +already consumed, code_challenge missing for public clients, etc.) collapses to +`hosted.exchangeFailed`. + +## Findings table + +| Sev | File:line | Issue | Fix sketch | +|---|---|---|---| +| P0 | `LoginMfaFlow.tsx:114` + `AuthRepository.ts:59` | `/auth/login` carries no `client_id`/tenant, lets cross-tenant users pass step 1 (user-reported #1) | Forward `_clientId` (already prop, currently `_`-prefixed unused — line 51) into `login(...)` and a new `tenantHint` field on `/auth/login`; backend rejects mismatch early with 403 + `tenantMismatch` error code | +| P0 | `NfcStep.tsx:79-83 + 168 + 191` | Success Alert (`scanResult`) and error prop render together — "successfully read" + "verification failed" (user-reported #2) | Don't render success Alert until parent confirms; alternatively, when `error` becomes truthy, also clear `scanResult` | +| P1 | `HostedLoginApp.tsx:332,338,350,381` | All non-tenant errors collapse to `hosted.exchangeFailed`. PKCE failure, expired/consumed MFA session, scope mismatch, redirect-URI mismatch all look identical to the user (coordinator-reported #2) | Inspect `error_description` for known substrings: `"MFA session expired"`, `"MFA session already used"`, `"MFA not completed"`, `"code_challenge"`, `"Invalid scope"`, `"redirect_uri"`. Map each to a dedicated `t()` key and a recovery action (re-login, contact-support, switch-account) | +| P1 | `HostedLoginApp.tsx:332,338,350,381` | `state` parameter is not echoed on error paths — RFC-6749 requires error to be returned to `redirect_uri` with `state`, but the hosted page just sets `finalError` and stays | When `redirect_uri` is known-safe, optionally redirect-back with `error=...&state=...` via a "Return to app" CTA; or document that this hosted page is the terminal failure surface | +| P1 | `LoginMfaFlow.tsx:247` | All non-`AUTHENTICATED`/`STEP_COMPLETED` MFA results collapse to `widget.verificationFailed`; no rate-limit / lockout / replay specific messages | Switch on `res.status`/`res.errorCode` and map each to its own i18n key (re-uses keys already in `tr.json`/`en.json` such as `auth.errors.locked`, `auth.errors.tooManyAttempts`) | +| P1 | `LoginMfaFlow.tsx:343` (``) | Literal-string sentinel passed as a userId; the QR token is generated via session token but the prop name lies | Either remove the prop (refactor `QrCodeStep` to never need `userId` in MFA-mode) or pass `mfaSessionToken` and rename | +| P1 | `HostedLoginApp.tsx:418-420` | If framed, app renders `null` and *only* attempts top-window navigation in an effect. Result: blank tab if the frame-bust is blocked. No "click to open in new tab" fallback | Render a minimal "open in new window" CTA so non-script-driven framing isn't a dead end | +| P1 | `HostedLoginApp.tsx` (no disabled-client branch) | A revoked/disabled OAuth2 client returns meta successfully but fails at `/authorize/complete`. The user only learns at the very end | Add `disabled`/`revoked` field to `/oauth2/clients/{id}/public` response and render an upfront "this app is no longer enabled" panel | +| P1 | Step counter `MultiStepAuthFlow.tsx:552-565` (cross-cutting; mirrored in `LoginMfaFlow.tsx:421-430` which is the verify-app path) | `LoginMfaFlow` already renders `StepProgress` at the *top* (line 426/429) — but `MultiStepAuthFlow` (the in-app Settings re-auth flow) renders it at the BOTTOM. Coordinator-reported user bug refers to the in-app flow, not verify; verify is correct | Move `MultiStepAuthFlow` step counter to top of CardContent above the step content (mirrors `LoginMfaFlow.tsx:421`) | +| P1 | `LoginMfaFlow.tsx:135` fallback `'EMAIL_OTP'` | When `enrolledMethods` is empty (no enrolled MFA methods) and backend returns no `twoFactorMethod`, defaults silently to `EMAIL_OTP` and renders `EmailOtpMfaStep`. If user has no email enrollment either, they hit a dead-end after the empty-OTP send | When `enrolledMethods.length === 0` and no `twoFactorMethod`, show an explicit "no MFA method available — contact support" UI | +| P2 | `PasswordStep.tsx:95-97`, `QrCodeStep.tsx:46-49` (line numbers approximate to grep), `TotpStep.tsx` | Hardcoded `#f8fafc` / `#f1f5f9` / `#fff` for input backgrounds — light-only. With `theme=dark` URL param these inputs become bright-white islands inside the dark card | Replace with `(th) => alpha(th.palette.action.hover, 0.4)` and `'background.paper'` etc. | +| P2 | `EmailOtpMfaStep.tsx:181,188,227,242` | Hardcoded `color: '#1a1a2e'` (dark text on white), `'rgba(0,0,0,0.4)'`, `'rgba(0,0,0,0.6)'`, `#6366f1` — entire step ignores theme mode | Replace literal hex with `'text.primary'` / `'text.secondary'` / `'primary.main'` | +| P2 | `SmsOtpStep.tsx:122-126` | Background hover/focus all collapse to `'background.default'`; user-reported "blackness" theme bug — in dark theme the input border is invisible against the same-colored background | Use distinct `background.paper` for normal vs `action.hover` for hover; verify against light AND dark themes | +| P2 | `EmailOtpMfaStep.tsx:54-56` | `useEffect(() => { sendOtp() }, [])` with eslint-disable. In React strict-mode dev, fires twice; in prod the user can hit a `/auth/mfa/send-otp` rate limit if they refresh fast | Guard with a ref the same way `QrCodeStep` uses `didInitialGenerateRef` | +| P2 | `SmsOtpStep.tsx:45-53` | No `submitted` flag — if the user pastes a 6-digit code AND the resend timer hits zero AND they click resend, two requests can race | Mirror `EmailOtpMfaStep.tsx:84-87` (`submitted` ref) | +| P2 | `HostedLoginApp.tsx:122-123, 418` | `isFramed` is computed once at module load; if the page is opened in a new window from a framed parent, evaluation may stale on hot-reload (dev-only) | Move `window.top !== window.self` into the effect and recompute on focus | +| P2 | `LoginMfaFlow.tsx:51` | `clientId` is destructured as `_clientId` (unused) — the prop exists but is dead. Dev-readability + masks the missing tenant-context wiring (the unused prop *should* be the fix surface for the P0 above) | Wire `_clientId` through to `authRepository.login` and `verifyStep` payloads | +| P2 | `HostedLoginApp.tsx:662-664` | `appOrigin` hardcoded to `https://app.fivucsas.com` — staging/preview environments ship a broken "Open Developer Portal" CTA | Read from `envConfig.appOrigin` (mirrors how `apiBaseUrl` is handled at line 95) | +| P2 | `HostedLoginApp.tsx:399-411` (`handleCancel`) | If `redirect_uri` is missing the user gets `window.history.back()` or `window.close()` — `close()` is no-op for windows the script didn't open | Route to a friendly "/" landing on the same origin | +| P3 | `HostedLoginApp.tsx:200-215` | DEV-only `console.warn` for malformed `redirect_uri`. Prod operators cannot triage without dev tools open | Surface a non-blocking `Alert severity="info"` in dev-build (`import.meta.env.DEV`) AND log to `LoggerService` so prod is observable | +| P3 | `LoginMfaFlow.tsx:74-92` | `BiometricEngine.initialize()` warm-up runs on EVERY hosted-login page-load even when MFA may never call FACE — burns ~3-5MB WASM/model cache for users who only have password+SMS | Warm-up only once `availableMethods` is known to include FACE | +| P3 | `LoginMfaFlow.tsx:480` | `key={phase + selectedMethod}` causes a remount whenever method changes — fine for state hygiene but kills any in-progress camera stream on FACE → method-picker → FACE oscillation | Track per-method state externally | + +## P0 narrative — security/correctness boundary + +**1. Cross-tenant password leakage** (user-reported, confirmed). The hosted +login is advertised as "tenant-scoped sign-in" (the page renders +`signingInTo: { tenant: clientLabel }` — `HostedLoginApp.tsx:538`). A user who +belongs to a *different* tenant can fully type a password and pass step 1 +because `/auth/login` is tenant-blind. They will only be rejected at the very +final exchange step (`OAuth2Controller.java:359-366`). This is bad UX, but +also a **timing-oracle**: the time-to-rejection differs depending on whether +`(email, password)` was valid (full MFA latency vs immediate +`invalid_credentials`), so an attacker can enumerate which emails are valid +across the whole platform. + +**2. NFC contradictory state.** Frontend success races backend rejection. The +"successfully read" Alert is technically truthful (the chip *was* read) but in +the user's mental model "success" should mean "MFA passed." Fix: either reword +the Alert to "NFC chip read — verifying…" and clear it on parent error, or +delay rendering the success state until parent confirms `STEP_COMPLETED`. + +## P1 narrative — UX-breaking + +**Generic "Login could not be completed."** The coordinator-reported +re-occurrence is real: of the 6 distinct backend rejection paths in +`OAuth2Controller.java:215-249, 262, 364`, only one (`tenant`-substring) is +mapped on the FE (`HostedLoginApp.tsx:370-379`). All five others — *Unknown +MFA session, MFA session expired, MFA not completed, MFA session already used, +client_id mismatch* — fall through to `hosted.exchangeFailed`. Users see the +same useless message regardless of root cause, which means they cannot +self-recover (e.g. an "MFA session expired" should restart the flow, not +"return to app"). + +**OAuth-2 error-on-redirect missing.** The hosted page is the terminal +surface — when something fails after MFA, the user is stranded on +`verify.fivucsas.com` and the relying party never sees an `error=...` callback. +RFC 6749 §4.1.2.1 expects errors *to be redirected back* to the registered +`redirect_uri` with `error` and `state`. Right now we eat the error. + +## P2 narrative — UX-rough + +The dark-theme drift is widespread: 4 step components hardcode +light-only colors. `HostedLoginApp` honours `theme=light|dark` and forces +`document.documentElement.lang` (line 167), but theme is *not* propagated as an +override — only `theme=dark` URL param triggers it. Any tenant who omits the +param gets light mode regardless of `prefers-color-scheme`. The "blackness" +the user reported on the SMS step is the dark-theme surface meeting a +light-mode-hardcoded input. + +The auto-OTP-send race (EmailOtp on mount) hasn't bitten yet because OTP +backend is rate-limited per session, but it will fire two audit-log rows for +every fresh page load, doubling observability noise. + +## P3 narrative — cosmetic + +Dev console warnings for `redirect_uri` shape are invisible in prod. The +QR-step `userId="mfa-session"` literal is misleading. The +`BiometricEngine.initialize()` warm-up is unconditional. None block release. + +## Top recommendations to ship next + +1. **Fix the user-reported NFC double-message** — single-line edit in + `NfcStep.tsx`: clear `scanResult` when `error` becomes truthy. Smallest + blast-radius, ships immediately. +2. **Map all `/oauth2/authorize/complete` errors to specific i18n keys** — + regex-match `error_description` against the 5 known substrings, add + `hosted.mfaExpired`, `hosted.mfaConsumed`, `hosted.pkceMismatch`, + `hosted.redirectMismatch`, `hosted.scopeMismatch`. Same pattern as the + tenant-mismatch fix in web #78. +3. **Tenant-bind the password step** — extend `/auth/login` request DTO with + optional `tenantHint`/`clientId`, backend short-circuits with a 403 if the + user's tenant doesn't own the client. Closes the cross-tenant timing-oracle + AND fixes the user-reported wrong-tenant pass. +4. **Theme-correct the 4 step components** — replace hardcoded hex with + theme-aware tokens. Mechanical change, biggest visual win in dark mode. +5. **Surface MFA-step rejection codes** — replace + `LoginMfaFlow.tsx:247` generic message with a switch on `res.status`/ + `res.errorCode`. Reuses existing i18n keys. + +## Constraints honoured + +- Read-only investigation. No code edited. +- ≤2500 words (currently ~1450 prose + table). +- Every finding has a file:line citation. +- User's three reported bugs cross-checked: PASSWORD wrong-tenant + (`AuthRepository.ts:59`, `LoginMfaFlow.tsx:114` — confirmed); NFC double + message (`NfcStep.tsx:79-83 + 168 + 191` — confirmed); SMS dark theme + (`SmsOtpStep.tsx:122-126` — confirmed, broader pattern across 4 components). +- Coordinator add-ons cross-checked: step-counter visibility — verify-app's + `LoginMfaFlow` renders counter at TOP (line 421-430), so the user-reported + bug is for `MultiStepAuthFlow.tsx:552-565` (in-app re-auth), not the verify + flow — flagged for completeness; generic "Login could not be completed" — + confirmed five missing error mappings. +- Cross-checked against `INVESTIGATION_WIRES_2026-05-07.md` and + `INVESTIGATION_PIPELINES_2026-05-07.md` — those covered the OAuth2 wire + contract and PKCE/state echo on success, but did NOT cover step-component + state contradictions, theme drift, or error-mapping gaps. Those are this + doc's contribution. diff --git a/INVESTIGATION_WIRES_2026-05-07.md b/INVESTIGATION_WIRES_2026-05-07.md new file mode 100644 index 0000000..7f5a546 --- /dev/null +++ b/INVESTIGATION_WIRES_2026-05-07.md @@ -0,0 +1,128 @@ +# Wire Contract Audit — 2026-05-07 + +Read-only audit of frontend ↔ backend payload contracts across 10 user flows. +File:line citations are literal at HEAD; verified per memory rule +`feedback_verify_completion_claims.md` by re-reading code, not docs. + +## Methodology + +For each flow: +1. Read frontend method (request body keys + types). +2. Read backend `@PostMapping` / DTO declaration (required vs optional). +3. Diff: missing required, extra-dropped, naming, types, error shape. +4. Cross-check downstream proxy (BiometricServiceAdapter) when relevant. + +## Per-flow contract table + +| # | Flow | Path | Req shape match | Resp shape match | Error shape | Severity | +|---|------|------|-----------------|------------------|-------------|----------| +| 1 | Login | `POST /api/v1/auth/login` | OK | drops `tokenType`, `completedMethods` | `ErrorResponse` envelope | P3 | +| 2 | MFA step | `POST /api/v1/auth/mfa/step` | OK | extra `error`/`expectedMethods` keys not on FE type | mixed (200+`status:ERROR` vs 400/401/409) | **P1** | +| 3 | Refresh | `POST /api/v1/auth/refresh` | OK (`refreshToken`) | drops `tokenType` | OK | P3 | +| 4 | Face verify | `POST /api/v1/biometric/verify/{userId}` | tenant_id never sent | `distance`, `threshold` never returned | OK | **P1** | +| 5 | Face enroll | `POST /api/v1/biometric/enroll/{userId}` | tenant_id, client_embedding(s) never sent | OK | OK | **P1** | +| 6 | Face search | `POST /api/v1/biometric/search` | tenant_id, max_results never sent | mixed `matches`/`results`/`best_match` | OK | **P1** | +| 7 | OAuth2 authorize / complete | GET/POST `/api/v1/oauth2/authorize[/complete]` | OK | OK | RFC 6749 `{error,error_description}` (different from rest) | P1 | +| 8 | User register (open) | `POST /api/v1/auth/register` | OK | OK | OK | — | +| 8b | User create (admin) | `POST /api/v1/users` | OK | OK | OK | — | +| 9 | Auth-session step | `POST /api/v1/auth/sessions/{id}/steps/{order}` | path mismatch + DTO mismatch | OK | OK | **P0** (dead path) | +| 10 | Voice verify | `POST /api/v1/biometric/voice/verify/{userId}` | OK | OK (proxy translates) | OK | — | + +## Cited evidence + +### 1. POST /auth/login +- FE: `web-app/src/core/repositories/AuthRepository.ts:59-62` sends `{email, password}`. +- BE: `identity-core-api/.../controller/AuthController.java:132-150` accepts `LoginRequest`. +- BE DTO: `identity-core-api/.../dto/LoginRequest.java:9-26` — fields `email`, `password`, optional `clientId`. FE never sends `clientId` (memory hint: OAuth widget call would benefit from it; today it is silently null in audit logs). +- Response: `dto/AuthResponse.java:17-48` carries `tokenType`, `completedMethods`. FE `AuthApiResponse` (`AuthRepository.ts:18-37`) does not declare either field — silently dropped. Type-confusion-safe but loses the post-password completed-methods hint that informs MFA UI. + +### 2. POST /auth/mfa/step +- FE: `AuthRepository.ts:137-141` sends `{sessionToken, method, data}`. Type `MfaStepResponse` (`domain/interfaces/IAuthRepository.ts:28-45`) expects `status` ∈ `STEP_COMPLETED|AUTHENTICATED|FAILED|ERROR|CHALLENGE`. +- BE controller: `AuthController.java:580-604` reads `Map` keys `sessionToken`, `method`, `data`. No DTO; missing key is null and surfaces inside `VerifyMfaStepService`. +- BE response shape: `application/service/mfa/VerifyMfaStepResponse.java:28-97`. + - `status: "FAILED"` (line 28) — returns HTTP 200 with body `{status:"FAILED", message, currentStep, totalSteps, expectedMethod, completedMethods, nextAction}`. + - `status: "ERROR"` (line 54) — returns HTTP 200. + - `status: "ERROR"` + `error: "METHOD_ALREADY_USED"` (line 82-97) — returns HTTP **409** with extra fields `error`, `expectedMethods`, `nextAction` that are NOT in the FE TS type. + - `badRequest`/`unauthorized` (lines 59,64) return HTTP 400/401 with `{status:"ERROR", message}` — same body shape as 200-ERROR; FE distinguishes only via HTTP status, never via body. +- **P1**: success uses 200 always; failure uses a *mix* of 200+`status:"FAILED"` AND 400/401/409. FE error mapping (`utils/formatApiError.ts`) only sees the HTTP code, so a `200/FAILED` is treated as success unless every caller also reads `data.status`. The two MFA callers (`TwoFactorDispatcher.tsx:82`, `LoginMfaFlow.tsx:186`) do read `.status`, so live-correct, but the contract is fragile. + +### 3. POST /auth/refresh +- FE: `AuthRepository.ts:110-112` sends `{refreshToken}`. +- BE: `AuthController.java:153-169` + `dto/RefreshTokenRequest.java:11-15` accept `{refreshToken}` (NotBlank). Match. P3: response `tokenType` again silently dropped. + +### 4. POST /biometric/verify/{userId} +- FE: `BiometricService.ts:201-211` sends multipart `image` field. **Does NOT send** `tenant_id` (signature uses `_tenantId`). +- BE: `BiometricController.java:120-128` accepts `image`, `tenant_id` (snake_case), `client_embedding`, `client_embeddings`. +- BE forwarder: `BiometricServiceAdapter.java:127-139` sets `user_id`, `tenant_id` parts and POSTs to `/verify` on bio-processor. +- Bio: `biometric-processor/app/api/routes/verification.py:32-43` requires `user_id`+`file`, optional `tenant_id`. +- **P1 silent-drop**: FE never sends `tenant_id` so multi-tenant scoping is lost at the proxy boundary. Today the backend derives tenant from JWT principal, so this is masked, but the comment on `BiometricService.ts:103-106` explicitly says "currently dropped at the proxy boundary" — that comment is outdated. The adapter at line 290 actually does forward `tenant_id` if present. +- **P1 response gap**: FE expects `{verified, confidence, distance, threshold, message}` (`BiometricService.ts:14-20`). BE returns `BiometricVerificationResponse` with only `{verified, confidence, message}` — `distance`/`threshold` default to sentinel `1` / `0.4` (line 218-219). Decisions cannot be re-tuned client-side. + +### 5. POST /biometric/enroll/{userId} +- FE: `BiometricService.ts:115-124`, multipart only `image`. Multi: `enrollFaceMulti` sends `files` only (line 146). +- BE: `BiometricController.java:80-95` accepts `image`, `tenant_id`, `client_embedding`, `client_embeddings`. Multi: `:104-110` accepts `files`, same optional 3. +- **P1**: `_tenantId`+`_clientEmbeddings` parameters on FE are accepted but not forwarded — comment line 96-99 admits this. D2 telemetry captures nothing client-side because the proxy sees no payload. + +### 6. POST /biometric/search +- FE: `BiometricService.ts:229-240` multipart only `file`. Underscore params `_tenantId`, `_maxResults`. +- BE: `BiometricController.java:249-256` accepts `file`, `tenant_id`, `client_embedding`, `client_embeddings`. +- **P1**: confirmed memory note. `_tenantId` and `_maxResults` silently dropped. Cross-tenant hits possible if RBAC misconfigured (mitigated by Sec-P0 #54 cross-tenant guard, but defense-in-depth `tenant_id` form field never reaches bio-processor). +- Response shape (`BiometricService.ts:242-253`) reads three different keys: `data.matches`, `data.results`, `data.best_match`. The fact the FE tries all three is itself evidence the contract is undefined — bio-processor `search.py` returns one canonical shape; the proxy may not pass through unchanged. + +### 7. OAuth2 authorize / complete +- FE GET: `verify-app/HostedLoginApp.tsx:323` — query params `client_id`, `redirect_uri`, `response_type`, `scope`, optional `state`/`nonce`/`code_challenge`/`code_challenge_method`. +- BE GET: `controller/OAuth2Controller.java:77-99` matches. +- FE POST: `HostedLoginApp.tsx:291-303` sends `{mfaSessionToken, clientId, redirectUri, scope, state, nonce, codeChallenge, codeChallengeMethod}` with **null** for missing optional fields. +- BE POST: `OAuth2Controller.java:420-451` (`HostedAuthorizeCompleteRequest`) — required `mfaSessionToken`, `clientId`, `redirectUri`. Optional `state`/`nonce`/`codeChallenge`/`codeChallengeMethod` use `@Size` only — null is accepted (Bean Validation `@Size` does not trigger on null). Match. +- **P1 redirectUri regex pitfall**: `OAuth2Controller.java:430-433` — `^https?://[\w.-]+(:\d+)?(/[\w./?%&=#:+~,@!$'()*;\[\]-]*)?$`. Custom-scheme tenants (e.g. `com.acme://auth`) **WILL be rejected by Bean Validation** with HTTP 400 `invalid_request` even though `web-app/CLAUDE.md` advertises "custom schemes (com.acme://auth) and loopback per RFC 8252" support. This is a documented-vs-enforced mismatch. +- Error envelope: `OAuth2Controller.java:463-478` — uses RFC-6749 `{error, error_description, state?}` for this controller, but the rest of the API uses `dto/ErrorResponse.java:16-43` `{timestamp, status, error, message, path, errors[]}`. **P1 inconsistent error shape across endpoints** — `formatApiError` must branch. + +### 8. POST /api/v1/users (admin create) and POST /auth/register (open) +- FE admin: `core/repositories/UserRepository.ts:160` sends `CreateUserData` body verbatim. Type `domain/interfaces/IUserRepository.CreateUserData` matches `CreateUserRequest` fields. +- BE admin: `controller/UserController.java:131-149` + `dto/CreateUserRequest.java:16-56`. Required `firstName`, `lastName`, `email`, `password`. `phoneNumber` optional but if present must match strict E.164 (`^\+[1-9]\d{9,14}$`). FE phone input does not enforce — backend will 400 with `phone.e164` code (`formatApiError.ts` does map this — confirmed in `feedback_no_hardcode.md` follow-up). OK. +- FE open register: `features/auth/components/RegisterPage.tsx:199-204` sends `{firstName, lastName, email, password}`. +- BE open register: `AuthController.java:110-130` + `dto/RegisterRequest.java:9-27` — exact match. OK. + +### 9. POST /auth/sessions/{id}/steps/{order} — auth-session step +- FE: `core/repositories/AuthSessionRepository.ts:178-181` sends `{data}` to `/auth/sessions/${sessionId}/steps/${stepOrder}`. +- BE: `controller/AuthSessionController.java:151-157` matches path; `CompleteAuthStepCommand.java:7-9` accepts `{data: Map}`. **Match — for the step.** +- **P0 mismatch on session START**: `AuthSessionRepository.ts:128-139` sends `{tenantId, userId, operationType}` (defined in `StartSessionCommand` lines 9-15). BE `StartAuthSessionCommand.java:7-15` expects `{tenantSlug, operationType, platform, deviceFingerprint, email, ipAddress, userAgent}`. **No field overlap except `operationType`**; backend would reject as `tenantSlug` `@NotBlank`. The FE-`AuthSessionService.startSession` call is currently used by no production caller (verified by grep of `web-app/src` excluding tests — sole import is from the service file itself). So this dead surface is a **P0 latent bug** if anyone wires it up. Recommend either deleting the FE method or fixing the field names. + +### 10. POST /biometric/voice/verify/{userId} +- FE: `features/auth/components/VoiceEnrollmentFlow.tsx:281-282` and `WidgetAuthPage.tsx:388` send `{voiceData}` JSON. +- BE: `controller/BiometricController.java:183-209` reads `request.get("voiceData")`, then forwards to bio-processor as `voice_data` (`BiometricServiceAdapter.java:178-179`). Bio: `voice.py:35-37` (`VoiceRequest.voice_data`). Match through translation. OK. + +## P0/P1 findings (consolidated) + +**P0** — `AuthSessionRepository.startSession` is broken end-to-end (different field names, missing `tenantSlug` `@NotBlank`). Today it has no live callers, so no live damage. Risk: silent failure when someone wires it up. Action: delete or repair. + +**P1.1** — `BiometricService` underscore-ignored params (`_tenantId`, `_maxResults`, `_clientEmbeddings`) are accepted at the public method signature but never reach the wire. The verbal contract suggests tenant scoping; the runtime behavior depends entirely on JWT principal. D2 client-embedding telemetry never lands. Confirms memory note `project_biometric_pipeline.md`. + +**P1.2** — Error-shape inconsistency across the API: +- Standard endpoints: `ErrorResponse` `{timestamp, status, error, message, path, errors[]}` (`dto/ErrorResponse.java:16-43`). +- MFA-step: HTTP 200 + `{status, message, ...}` for "soft" failures; HTTP 400/401/409 + `{status:"ERROR", ...}` for "hard" (`VerifyMfaStepResponse.java`). +- OAuth2: RFC-6749 `{error, error_description, state?}` (`OAuth2Controller.java:463-478`). +- Bio: `{detail}` (FastAPI default) when proxy passes through; `BiometricService.ts:173-191` already special-cases this. +Frontend `formatApiError` cannot reliably map all three. Recommend adopting one envelope. + +**P1.3** — `tokenType: "Bearer"` and `completedMethods` on `AuthResponse` (login/refresh) silently dropped by FE — the latter loses MFA-flow continuity. + +**P1.4** — `BiometricVerificationResponse` lacks `distance` and `threshold`; `BiometricService.ts:218-219` substitutes hardcoded `1` and `0.4`. Any client-side decision is bogus. + +**P1.5** — OAuth2 `redirectUri` Bean Validation regex blocks custom-scheme RFC 8252 redirects despite SDK marketing. Tenants integrating from native apps will see HTTP 400 `invalid_request`. + +**P1.6** — Auth-session `StartAuthSessionCommand` expects `email` to identify user; FE sends `userId` (UUID). If anyone re-enables this, no login by email works. + +## Recommendations + +1. **Delete dead surface or repair contract**: `AuthSessionRepository.startSession` and the Java `StartAuthSessionCommand` must agree. Either FE migrates to `{tenantSlug, email, operationType, platform, deviceFingerprint}` or BE adds `tenantId`/`userId` aliases. +2. **Forward `tenantId` everywhere**: drop the `_tenantId` underscore convention in `BiometricService` — either send it (preferred for defense-in-depth) or remove from public signatures. +3. **Unify error envelope**: pick one of `ErrorResponse`, RFC-6749, or `{detail}`. OAuth2 must stay RFC-compliant, so isolate that and standardize the rest. Document at `docs/04-api/error-shapes.md`. +4. **Add `distance` + `threshold` to `BiometricVerificationResponse`**: trivially proxy-able from bio-processor's `VerificationResponse`. Removes hardcoded sentinels. +5. **Fix OAuth2 `redirectUri` regex** to accept custom schemes (`^([a-z][a-z0-9+.-]*://|http://127\.0\.0\.1:\d+).*$`) or document the limitation. +6. **Promote `completedMethods` from /auth/login response** to `AuthApiResponse` so the MFA UI doesn't have to re-derive after step 1. +7. **MFA-step: collapse 200+`status:"FAILED"` to HTTP 4xx** (or document the mixed convention loudly). At minimum, add `error` field to FE `MfaStepResponse` type so `METHOD_ALREADY_USED` discriminator is type-safe. +8. **Add explicit FE `tokenType` field or strip it from BE** — currently inert but invites future bearer-vs-mac confusion. +9. **Document the BiometricController/bio-processor field-name translation** (`voiceData`↔`voice_data`, `tenant_id` form vs JSON) in a single integration note. Today it's distributed across three files and a spec table. + +— end — diff --git a/LIVENESS_ANTISPOOF_INVESTIGATION_2026-05-09.md b/LIVENESS_ANTISPOOF_INVESTIGATION_2026-05-09.md new file mode 100644 index 0000000..7d4251a --- /dev/null +++ b/LIVENESS_ANTISPOOF_INVESTIGATION_2026-05-09.md @@ -0,0 +1,584 @@ +# Liveness & Anti-Spoofing — Comprehensive Investigation 2026-05-09 + +> Read-only audit of every liveness / anti-spoof artifact across `biometric-processor` +> branches and `practice-and-test` submodules. Cross-referenced against the prod +> binary `b670f218` (api `b670f218` + bio `a0a763b5`, 2026-05-07). + +## Executive summary + +- **There is *substantially more* anti-spoof code in side branches than in prod.** + `origin/working_spoof_detection` (the most advanced of Aysenur's lines) adds + ~9.5k LoC across 74 app files vs `main`, including Gabor/FFT moire analysis, + flash-color challenges, focal-blur/cutout anomaly detection, face-usability + gates and a 4 803-line `live_liveness_preview.py` desktop tuner. Prod ships + only the UniFace MiniFASNet ONNX backend gated by DeepFace's anti-spoof flag. +- **The strongest standalone work is the user's own `practice-and-test/spoof-detector/`** + — a session-engine architecture, 14 analyzers, calibrated 7-class fusion, 60+ + unit tests, and a near-publishable paper outline (BIOSIG/IJCB 2026 target). + It is the only artifact in the inventory with measured ISO 30107-3 numbers + (`BPCER 0.00% / APCER 30% / ACER 15%`, Grade C, 4 scenarios). +- **Aysenur's branches are not mergeable as-is.** Every branch reverts the + Dependabot security pins in `requirements.txt`, deletes shipped tests + (`test_embedding_cipher`, `test_request_size_limit`, gesture liveness), + drops `embedding_cipher.py`, drops migrations work, and bundles a 6.5 MB + `yolov8n.pt` binary. The signal is real but the deliverable is unreviewably + large and security-regressing. +- **Prod gap is narrow but real.** UniFace MiniFASNet alone covers print and + some screen replay; it has no replay-burst aggregation, no flash challenge, + no rPPG, no moire/Gabor screen check, no device-bezel detection, and no + session-level verdict. `ANTI_SPOOFING_ENABLED=true` only toggles + *DeepFace's* per-frame veto (`app/application/use_cases/check_liveness.py:155-186`), + not Aysenur's pipeline. +- **Recommendation.** Ship the user's `spoof-detector` as the basis: extract + it as a sidecar microservice (or library), integrate the MiniFASNet+device-boundary + layer into `biometric-processor` first, write the paper around the + session-engine + calibrated fusion novelty. Treat Aysenur's `working_spoof_detection` + as a *donor branch* — cherry-pick `screen_replay_anti_spoof.py`, + `moire_pattern_analysis.py`, `flash_spoof_analyzer.py`, + `device_spoof_risk_evaluator.py`, `cutout_anomaly_detector.py`, + `face_usability_gate.py`, `critical_region_visibility_gate.py`, and + `light_challenge_service.py` into focused PRs against `main` (each with + its own tests and zero `requirements.txt` regressions). + +--- + +## Inventory + +### Aysenur's branches (biometric-processor) + +| Branch | Tip | Unique commits vs main | Authors | Key techniques | Wired into prod? | Tests? | +|---|---|---:|---|---|---|---| +| `origin/liveness_capture` | `504067e` Color Shaded Screen | 6 | Ayşe Gülsüm EREN | enhanced+UniFace baseline, face bbox, "color-shaded screen" | No | New tests added but several existing tests deleted | +| `origin/liveness_capture2` | `504067e` | 6 | Ayşe Gülsüm EREN | identical commit set to `liveness_capture` (no divergence found in tip log) | No | same | +| `origin/working_spoof_detection` | `cbdbe0b` Spoof Detection | 27 | Ayşe Gülsüm EREN + Aysenur15 | superset of `liveness_capture` + Gabor/FFT moire + flash challenge + cutout anomaly + face-usability gate + critical-region visibility gate + reaction baseline + sklearn `train_spoof_classifier.py` + `test_data_collector.py` + 4 803-line tuner | No | adds `test_hybrid_fusion_evaluator.py`, `test_critical_region_visibility_gate.py`, `test_face_quality_illumination_gate.py`, `test_face_usability_gate.py`, `test_live_liveness_preview.py` (2 314 lines); deletes ~10 prod tests | +| `origin/Spoof-Detection` | `0685f05` No Face Update | 9 | Aysenur15 | subset of `working_spoof_detection` (no-face handling fixes, color-shaded screen, face bbox, liveness score) | No | same | +| `origin/fix/liveness-cascade-frr-reduction` | `b730a6d` | 27 | Ahmet + Aysenur | hijab/head-turn FRR fix (nose-alone occlusion no longer critical), built on `working_spoof_detection` | No | shares branch tests | +| `origin/fix/liveness-p0-frr-reduction` | `00bf4d7` | 28 | Ahmet | EMA freeze on skipped frames + decision-guard re-enable + multiple revert/iter cycles (P1, P2, cascade-guard) | No | shares branch tests | +| `origin/fix/liveness-p3-frr-reduction` | `1229f48` | 24 | Aysenur15 | "P3 phase" + previous P0 work | No | shares branch tests | +| `feat/anti-spoof-pipeline` (local-only) | `9ca51a2` | 6 | Aysenur15 + Ahmet | cleaned-up squash of the anti-spoof integration: moire + device-spoof + reaction + baseline + flash-spoof + cutout + screen-replay veto + hybrid backend + strict-profile config | No (local branch, never pushed to PR) | yes (subset) | + +#### Per-branch notes + +**`origin/liveness_capture` / `liveness_capture2`** — Identical history. Promotes +`enhanced` to default backend (`set enhanced as default liveness baseline`) +and adjusts `EnhancedLivenessDetector` confidence/passive scoring. Adds +`color-shaded-screen` heuristic (likely a macro flicker check on display +characteristic colors). Limited scope. Memory's claim that this branch already +has rPPG + screen-replay + MRZ does **not** match the actual file diff — +those features land in `working_spoof_detection` and never include MRZ in +`biometric-processor` at all (MRZ work is in `practice-and-test/`). + +**`origin/working_spoof_detection`** — The flagship branch. Net change: +`+45 249 / −23 327` across **624 files**. Substantively adds: + +- `app/infrastructure/ml/liveness/critical_region_visibility_gate.py` (901 lines): + per-region (left/right eye, nose, mouth, lower-face) pixel-based occlusion gate + with hijab-aware token exclusions (`mouth_roi_color_invalid`, + `mouth_chrominance_anomaly` excluded — see comments at lines 41-50 of that + file). Returns `CriticalRegionVisibilityResult` with blocking_regions, + suspicious_regions, occlusion_score. Threshold-based, no learned model. +- `app/infrastructure/ml/liveness/face_quality_illumination_gate.py` (241 lines): + brightness uniformity + shadow-asymmetry + over/underexposed-region detection. +- `app/infrastructure/ml/liveness/face_usability_gate.py` (341 lines): + composes the above two gates with frame-confirmation streaks + (LOW_QUALITY_CONFIRM_FRAMES=2, OCCLUSION_CONFIRM_FRAMES=2, + NO_FACE_CONFIRM_FRAMES=6). Outputs a `FaceUsabilityResult` that can `block` + liveness scoring entirely when the face is unusable. +- `app/infrastructure/ml/liveness/moire_pattern_analysis.py`: Gabor bank + (4 orientations, ksize 21, sigma 5, lambda 10) + FFT periodicity, returns + `moire_risk` ∈ [0,1] and `moire_score` ∈ [0,100] (also lives in main today). +- `app/infrastructure/ml/liveness/screen_replay_anti_spoof.py`: 5-signal cheap + layered fusion (FFT, Gabor, Laplacian, skin-coverage, specular). Hard veto + triggers when ≥ 2 sub-signals fall below 30. Already shipped in main and + consumed by `EnhancedLivenessDetector` (`app/infrastructure/ml/liveness/enhanced_liveness_detector.py:31,152,192-201`). +- `app/application/services/flash_spoof_analyzer.py`: per-region (forehead, + cheeks, nose) BGR delta analysis between baseline and flash frames; classifies + diffuse 3D skin response vs planar replay-media response. +- `app/application/services/device_spoof_risk_evaluator.py`: aggregates + moire_risk, reflection_risk, flicker_risk, flash_response_score, + hole_cutout_risk, focal_blur_anomaly_risk, screen_frame_risk into a + `device_replay_risk` with hard-coded weights (moire 0.28, reflection 0.20, + flicker 0.14, flash 0.28, screen-frame 0.10). Already in main but not wired + to `/verify`. +- `app/application/services/cutout_anomaly_detector.py`: per-region + (eyes, mouth) hole-detection + boundary-edge + sharpness-ratio + focus-jump + heuristics. Already in main. +- `app/application/services/light_challenge_service.py`: random color flash + challenge (red/green/blue/white/yellow), verifies BGR shift in expected + channel within `[minimum_delay_ms=50, expected_response_window_ms=500]` ms. + Already in main but no public route exposes it. +- `app/application/services/hybrid_fusion_evaluator.py` (190 lines): fuses + pretrained MiniFASNet score with flash/moire/device signals. Weights: + pretrained 0.30, flash 0.30, moire 0.20, device 0.20. Hard-veto if + flicker > 0.85 or (flicker > 0.75 AND device_replay > 0.55). **Not in main.** +- `app/tools/live_liveness_preview.py` (4 803 lines): standalone OpenCV + desktop tuner with frame metrics, temporal aggregator, baseline calibrator, + background reaction evaluator. Effectively a research workbench; replaces + the deleted `live_liveness_preview.py` paths. +- `app/tools/test_data_collector.py` (789 lines): interactive CV2 capture + tool that saves `(frame, metrics_json, label)` triples into + `data/test_frames/`. Used to generate ground truth for the sklearn + classifier below. +- `app/tools/train_spoof_classifier.py` (268 lines): sklearn + `GradientBoostingClassifier` / `RandomForest` / `LogisticRegression` / + `SVC` 5-fold CV trainer that pickles `models/spoof_classifier.pkl`. + + Authorship: 6 commits by **Ayşe Gülsüm EREN**, 21 by **Aysenur15** + (aysenurarici@hotmail.com). Latest commit 2026-05-06 20:57. The branch + reverts `requirements.txt` from `tensorflow-cpu==2.21.0`+pinned-transitives + back to `tensorflow-cpu==2.15.0` and removes Dependabot security pins — + this alone makes it un-mergeable without significant cleanup. + +**`origin/Spoof-Detection`** — A subset of `working_spoof_detection`. Same +authorship pattern (Aysenur15) but stops earlier in the iteration. Effectively +superseded. + +**`origin/fix/liveness-cascade-frr-reduction` / `p0-frr-reduction` / `p3-frr-reduction`** — +Three sibling FRR-tuning branches built on top of `working_spoof_detection`'s +post-Spoof-Detection commits. P0 branch is the most disciplined: each fix +(`fix(liveness): P0 FRR reduction — freeze EMA on skipped frames + re-enable decision guards`) +is followed by a revert if the user pushed back, suggesting Ayhmet was +shepherding the FRR knob. Cascade branch adds the hijab fix +(`fix(liveness): nose-alone physical block is not critical occlusion (head-turn FRR)`). +None of these can land without first landing the parent `working_spoof_detection`. + +**`feat/anti-spoof-pipeline` (local)** — 6 commits, presents as a clean +"squashed-history" view of the same work but **never pushed**. Net diff +vs main: `+5 248 / −7 102` across 80 files — much more contained than the +huge upstream branches because it's restricted to just the anti-spoof +modules and drops the test-deletion noise. **This is the most +review-friendly version of Aysenur's contribution and is the right +starting point if we want to upstream her work.** + +### `practice-and-test/` work (the user's R&D) + +| Directory | Owner | Models / techniques | Standalone? | Quality | +|---|---|---|---|---| +| `spoof-detector/` | Ahmet (user) | MiniFASNet ONNX, MediaPipe FaceLandmarker (478pt), IoU tracking, 14 analyzers in 3 layers, calibrated 7-class fusion, session engine with peak-sensitive verdict, blink (EAR), rPPG, screen-flicker, micro-tremor, landmark-variance, background-grid, AR-filter (heuristic), texture, moire, device-boundary, screen-replay, temporal | Yes (own `requirements.txt`, own `main.py`, own `tests/`) | High — best-organized artifact in the inventory; has paper outline, ROADMAP, ISO 30107-3 metrics, structured logging | +| `biometric-demo-optimized/` | Ahmet | MediaPipe Tasks + 468pt landmarks + Facenet512 + Hexagonal Architecture demo; threaded camera; vectorized cosine search; YOLO card detector | Yes | Moderate — used as the visual reference for the web `FacePuzzle` overlay (per `CLAUDE.md`) | +| `DeepFace_InsightFace_Pipeline/` | Ahmet | Side-by-side DeepFace vs InsightFace comparison scripts, FaceNet baseline, plain ID-pipeline | Yes | Low — pure exploration scripts, no paper-grade artifacts | +| `GestureAnalysis/` | Ahmet | MediaPipe HandLandmarker (`hand_landmarker.task` shipped), `anti_spoof.py`, motion analyzer, `liveness_session.py`, math/shape/sequential challenge sessions, finger-touch detector | Yes | Moderate — the gesture/active-liveness counterpart; some overlap with `biometric-processor`'s active liveness flow | +| `archive/`, `optimization-experiments/` | Ahmet | mostly stale, archived | Yes | n/a | + +#### Per-directory notes + +**`practice-and-test/spoof-detector/`** — The only artifact in the entire +audit with explicit ISO 30107-3 measurements +(`README.md:134`: `BPCER 0.00% | APCER 30% | ACER 15% | Grade C`). +3 274 LoC of actual source code, 60 unit tests, structured taxonomy +(`src/domain/taxonomy.py`), incident detection, peak-sensitive verdicts. +The novelty claim is sound: per-frame FAS is well-trodden; *session-based* +FAS with multi-timescale signal accumulation (per-frame → 1-5s → 5-30s → +30s-3hr) is genuinely under-explored. Paper outline targets BIOSIG 2026 / +IJCB 2026 — both legitimate venues. Phase 4-7 (AR-filter dataset +collection via amispoof.com) is unfinished. `data/captures/` and +`data/annotations/` are present but **empty** at HEAD (`ls` returns 0 +entries) — the calibration numbers in the README came from sessions that +weren't committed. This is the largest publishability gap. + +**`practice-and-test/biometric-demo-optimized/`** — Production-style hex-arch +demo with `presentation/ui/drawing.py` that the web-app `FacePuzzle` +component mirrors (`CLAUDE.md` references it explicitly). Useful as a +reference, not as a source of anti-spoof signals. + +**`practice-and-test/GestureAnalysis/`** — Active-liveness companion. Has its +own `anti_spoof.py` and `liveness_session.py`. Not yet examined in depth, +but worth keeping on the radar as the integration layer if the project +adds back challenge-response active liveness (which `biometric-processor`'s +`light_challenge_service.py` would also benefit from). + +### Current production state (`main` branch) + +What's *actually* compiled into the running prod image (`a0a763b5`, +2026-05-07 07:28 UTC): + +- **Backend selection** (`.env.prod`): `LIVENESS_BACKEND=hybrid`, + `LIVENESS_UNIFACE_DEFAULT_ENABLED=True`, `ANTI_SPOOFING_ENABLED=true`, + `ANTI_SPOOFING_THRESHOLD=0.5`. (Note: `CLAUDE.md` claim of + `LIVENESS_MODE=passive` + `LIVENESS_BACKEND=uniface` is now stale.) +- **Resolved detector**: `HybridLivenessDetector` + (`app/infrastructure/ml/liveness/hybrid_liveness_detector.py`) = + `EnhancedLivenessDetector` first → UniFace as second opinion → fall + back to enhanced verdict if UniFace returns indeterminate. +- **EnhancedLivenessDetector** (`enhanced_liveness_detector.py:31-201`) + *does* invoke `ScreenReplayAntiSpoof` (5-signal Gabor/FFT/Laplacian/skin/specular) + on every frame and keeps a 3-frame veto streak. So screen-replay + hard-veto **is shipping in prod today** — this contradicts the audit + memory saying screen-replay landed only in Aysenur's branches. +- **Anti-spoof gate**: DeepFace's built-in anti-spoof model runs in + `extract_face_with_detection` (called from `check_liveness.py:153`), + vetoes via `LIVENESS_VERDICT_POLICY=conservative` + (`app/application/use_cases/check_liveness.py:175-186`). +- **rPPG**: `RPPGAnalyzer` is *in main* and is wired *only* into + `LiveCameraAnalysisUseCase` (`app/application/use_cases/live_camera_analysis.py:32,134-147,515`) + which is exposed via `app/api/routes/live_analysis.py:50,119`. rPPG + is **not** invoked from `/verify` or `/enroll` paths. +- **Device spoof / cutout / flash / moire / hybrid-fusion**: present in + main as standalone modules but **not invoked** by `check_liveness` or + `live_camera_analysis`. They're dead code outside `live_liveness_preview.py` + (developer tuner). This is the largest "shipped but unused" surface. +- **Active liveness / gesture liveness / biometric-puzzle**: still in + main, used by `/verify_puzzle` and biometric-puzzle web flow. + +#### Gap list (prod vs. Aysenur's branches vs. spoof-detector) + +| Capability | In prod `main`? | In Aysenur's branches? | In `spoof-detector/`? | +|---|:-:|:-:|:-:| +| MiniFASNet ONNX | yes (UniFace) | yes | yes | +| Texture (LBP) | yes | yes | yes | +| Screen-replay (Gabor+FFT+skin+specular) | **yes (active veto)** | yes | yes | +| Moire (Gabor bank) | yes (module exists) | yes (wired into device-spoof) | yes | +| Device boundary (phone bezel) | no | yes | yes | +| Flash/color challenge | no (module unused) | yes (light_challenge_service wired) | no | +| Cutout anomaly | no (module unused) | yes (wired) | no | +| Face usability gate (occlusion+illumination) | no | yes | no | +| Critical region visibility | no | yes (901 LoC, hijab-aware) | no | +| Hybrid fusion evaluator (weighted multi-signal) | no | yes | yes (multi-class) | +| rPPG | yes (only in `/live_analysis`) | no (deferred) | yes (disabled — false pulse on screens) | +| Blink (EAR) | yes (in EnhancedLivenessDetector) | yes | yes | +| Smile / mouth movement | yes | yes | no | +| Micro-tremor (8-12Hz oscillation) | no | no | yes | +| Landmark variance (478pt) | no | no | yes | +| Screen flicker (50/60Hz aliasing) | no | yes (flicker_risk in device-spoof) | yes | +| Background-grid stability | no | no | yes | +| AR filter detector | no | no | yes (heuristic; ONNX planned) | +| Session engine (peak-sensitive verdict) | no | no | yes | +| Liveness prover (active challenges) | active-liveness/puzzle exists | flash challenge added | yes (blinks/motion/rotation/expression) | +| Calibrated fusion weights from ground truth | no | no | yes | +| ISO 30107-3 metrics measured | no | no | yes (Grade C) | +| 7-category spoof taxonomy | no | no | yes | +| Sklearn meta-classifier | no | yes (Aysenur, GradientBoosting/RF/LR/SVC) | no | + +--- + +## Technical deep-dive + +### 1. UniFace MiniFASNet (prod baseline) + +**Algorithm.** Binary real/spoof CNN distilled to ONNX via UniFace 3.0+. +Single forward pass on a face crop; no temporal accumulation. Per +`spoof-detector/README.md:142-147`, the user measured a **+94.7 +discrimination gap** on real vs spoof in their own ground-truth set — +this is the strongest single signal available. + +**Files.** `app/infrastructure/ml/liveness/uniface_liveness_detector.py`, +gated by `LIVENESS_UNIFACE_DEFAULT_ENABLED`, with cache pinned to +`/app/uniface-cache` (named volume, uid 100, see +`/opt/projects/fivucsas/CLAUDE.md`). + +**Limitation.** Per-frame only. No replay-burst aggregation. Confused +by high-quality screen replays where MiniFASNet alone votes "live" because +the screen renders convincing skin texture (the user's `data/test_protocol` +phase notes: video-replay session reads as LIVE 60% — see +`spoof-detector/ROADMAP.md`). + +### 2. Screen-replay anti-spoof (shipping in prod, not headlined) + +**Algorithm.** Five cheap signals fused into a [0, 100] live-likeness +score with hard-veto policy: +- FFT periodicity (radial energy ratio in a band centered on + `fft_ratio_center=0.85`, width `0.20`) +- Gabor bank `analyze_moire_pattern()` (4 orientations, sigma 5, lambda 10) +- Laplacian variance (blur-vs-sharp screen artifacts) +- Skin coverage (HSV mask, expected `[0.20, 0.95]`) +- Specular coverage (luminance percentile, warn=0.020, fail=0.060) + +Hard veto trips when `low_signal_count >= 2` and signals < 30, capped at +`veto_score_cap=35`. Veto streak of 3 needed inside +`EnhancedLivenessDetector` before it propagates to the verdict +(`enhanced_liveness_detector.py:153,154,198`). + +**Files.** `app/infrastructure/ml/liveness/screen_replay_anti_spoof.py:78-128`, +`app/infrastructure/ml/liveness/moire_pattern_analysis.py:1-130`. + +**Limitation.** Tuned by hand. No per-attack-class weighting. False +positives on macro-prints; false negatives on high-DPI screens at +distance. The `_blur_floor=25.0` short-circuit +(`screen_replay_anti_spoof.py:90-105`) silently abstains on out-of-focus +frames, which an attacker can exploit by deliberately defocusing. + +### 3. Aysenur's hybrid-fusion / device-spoof pipeline (working_spoof_detection only) + +**Algorithm.** Linear weighted sum +`w_pre · pretrained + w_flash · flash + w_moire · moire + w_dev · device` +with weights `(0.30, 0.30, 0.20, 0.20)` and threshold 0.45 +(`app/application/services/hybrid_fusion_evaluator.py:14-32,52-95`). +Hard veto if `flicker > 0.85` or (`flicker > 0.75` AND +`device_replay > 0.55`) — bypasses the linear sum entirely. + +**Limitation.** Weights are hand-coded, not learned from data. The +sklearn `train_spoof_classifier.py` exists but its data dir is empty at +HEAD (`data/training_data.csv` referenced but never committed). Treat +the weights as priors, not calibration. + +### 4. User's session engine (spoof-detector) + +**Algorithm.** Per-frame +`pipeline.process(frame)` → `engine.ingest(analysis, frame)` accumulates +into `SessionState`. Multi-timescale signals collected at frame, 1-5s, +5-30s, 30s-3hr horizons. Verdict = +`0.5 * average_p_real + 0.5 * worst_window_p_real` ("peak-sensitive" — +single sustained spoof burst permanently degrades the session even if +mostly real). Liveness prover (separate from category fusion) accumulates +gold proofs: blinks 25, motion 20, rotation 15, expression 15 → max 75 +(`spoof-detector/src/application/session_engine.py:1-60`, +`spoof-detector/src/application/liveness_prover.py:1-338`). + +**Limitation.** Single-author, untested at scale. Calibration data +(`data/captures/`, `data/annotations/`) not in git. Phase 3.7 ("connect +fusion ↔ liveness prover, fix video replay") and Phase 4-7 (AR-filter +dataset, MobileNetV3 training, paper) are unfinished +(`spoof-detector/README.md:160-173`). + +### 5. rPPG (prod, but only in `/live_analysis`) + +**Algorithm.** Sliding 5-second window of mean green-channel intensity, +detrend, Butterworth bandpass [0.83, 2.5] Hz (50-150 BPM), FFT, dominant +frequency → BPM. Score = `min(signal_strength * 2, 1)` if +`signal_strength > 0.3` else 0.2 (`rppg_analyzer.py:43-95`). + +**Limitation.** Confirmed false-pulse on screens +(`spoof-detector/README.md:144` — "rPPG: anti-correlated, detects screen +flicker as false pulse, disabled"). Currently weighted at 0.15 in +`live_camera_analysis.py:37`. **This is a known issue and rPPG should +be either removed from `/live_analysis` or only weighted when fused with +device-replay-low-risk gating.** + +### 6. Active light challenge (prod modules, no route) + +**Algorithm.** `LightChallengeService.generate_challenge()` returns a +random color from {red, green, blue, white, yellow}, expects screen flash +within `[50, 500] ms`, verifies BGR mean shift in expected channel +exceeds `min_color_shift=0.05` +(`light_challenge_service.py:1-90`). Aysenur's `flash_spoof_analyzer.py` +adds spatial verification: per-region (forehead, cheeks, nose) diffuse-vs-specular +response classifies skin (3D, region-correlated) vs replay media (planar). + +**Limitation.** Browser flash requires viewport overlay control. The +`web-app` widget needs a corresponding step component, which doesn't +exist in main. This is plumbing-blocked, not algorithmically blocked. + +--- + +## Convergence map + +Spoof-attack coverage matrix (✅ covered, ⚠ partial, ❌ uncovered): + +| Attack class | UniFace MiniFASNet (prod) | Screen-replay 5-signal (prod) | Aysenur hybrid (branches) | spoof-detector session engine | +|---|:-:|:-:|:-:|:-:| +| Printed photo | ✅ | ⚠ | ✅ | ✅ | +| Static digital photo on screen | ⚠ | ✅ | ✅ | ✅ | +| Video replay (screen) | ⚠ | ⚠ | ✅ (flash + flicker) | ⚠ (acknowledged FAIL — README target Phase 3.7) | +| 3D mask (silicone/latex) | ⚠ | ❌ | ⚠ (flash specular helps) | ⚠ | +| Heavy makeup | ❌ | ❌ | ❌ | ❌ (Phase 5 planned) | +| AR filter (Snapchat/IG/FaceApp) | ❌ | ❌ | ❌ | ⚠ (heuristic; Phase 5 ONNX planned) | +| Deepfake injection (virtual cam) | ❌ | ❌ | ❌ | ⚠ (active illumination Phase 5) | +| Cutout / hole-mask | ❌ | ❌ | ✅ (cutout_anomaly_detector) | ❌ | +| Hijab/headscarf occlusion (legitimate) | n/a | n/a | ✅ (face_usability_gate hijab fix) | n/a | + +The composite "coverage if we ship everything" leaves only AR filter, +deepfake injection, and heavy makeup uncovered. AR filter is the user's +chosen paper novelty (Phase 5). + +--- + +## Extraction proposal + +### Subproject A: Academic paper + +**Working title.** "Session-Based Multi-Method Face Presentation Attack +Detection with Calibrated Multi-Class Fusion" — already drafted at +`practice-and-test/spoof-detector/paper/outline.md`. + +**Likely contribution / novelty.** +1. Session-based verdict engine (vs per-frame classification — most FAS + literature is single-frame). +2. Peak-sensitive verdict computation that prevents spoof dilution in + mixed sessions (concretely: 10 % cheating = SPOOF, not LIVE). +3. Calibrated fusion weights derived from ground-truth testing showing + that texture and moire are *anti-correlated* with screen attacks + (a non-obvious empirical finding contradicting LBP-based FAS papers). +4. AR-filter detection dataset (Phase 5, via amispoof.com). + +**Baseline comparisons available.** OULU-NPU, SiW, CASIA-SURF, +CelebA-Spoof are namechecked but **not yet run**. The current numbers +(BPCER 0.00 / APCER 30 / ACER 15) are on the user's own 4-scenario set, +not a public benchmark. This is the single biggest publishability gap. + +**What's missing for first submission.** +- Run on at least one public benchmark (OULU-NPU is the cheapest start — + 4 protocols, ~2k videos, free academic license). +- Collect ≥ 500 AR-filter samples (Phase 5). +- Ablation: session engine vs averaged per-frame; calibrated vs equal + weights; with/without peak-sensitive verdict. +- Cross-validation: ≥ 100 samples per scenario, not 4 sessions. +- Session-engine throughput numbers on CX43 CPU (the user's only + available hardware). + +**Estimated effort to first draft.** 6–10 weeks if user does it solo, +3–5 weeks with a co-author handling OULU-NPU evaluation. The text +scaffolding is already in `paper/outline.md`. The blocker is **data** +(public benchmark + AR-filter set). + +### Subproject B: Professional working module + +**Architecture.** Two-tier extraction: + +1. **Library tier**: lift `practice-and-test/spoof-detector/src/` into a + pip-installable package `fivucsas-antispoof` (single namespace + `fas/`) with the public API: + ```python + from fas import SpoofDetectionPipeline, SessionEngine + pipeline = SpoofDetectionPipeline.from_config("config.yaml") + engine = SessionEngine() + engine.start() + while frame := camera.read(): + analysis = pipeline.process(frame) + engine.ingest(analysis, frame) + verdict = engine.conclude() # SessionVerdict + ``` + Already structured this way — minimal refactor needed. Tests already + pass (60 unit tests). + +2. **Sidecar microservice tier**: wrap as `antispoof-processor` FastAPI + service exposing: + - `POST /sessions` → `{session_id, expires_at}` + - `POST /sessions/{id}/frames` (multipart frame + face_bbox) → + `{frame_index, p_real, classification, incidents[]}` + - `POST /sessions/{id}/conclude` → `SessionVerdict` JSON + - `GET /sessions/{id}/verdict` (poll) + - X-API-Key auth (mirror `biometric-processor` pattern) + +**Plug-back into FIVUCSAS prod.** Two integration points: +- `biometric-processor` `/verify` and `/enroll` open a session, push the + single submitted frame, conclude, take verdict — degrades gracefully + to per-frame mode for one-shot APIs. +- `verify-app` web widget opens a session, streams frames over the 5s + enrollment window, surfaces incidents in real-time UI, blocks the + flow if SPOOF verdict. + +**Dependencies it'd need.** Same as `spoof-detector/requirements.txt`: +`opencv-python`, `mediapipe>=0.10.9`, `uniface>=3.0`, `numpy`, +`scipy`, `onnxruntime>=1.18`. All already present in `biometric-processor` +prod image — zero new system deps. + +**Estimated effort to MVP.** 2–3 weeks for library tier + 1 week for +microservice wrapper + 2 weeks for FIVUCSAS integration (api-side +client + web-side step component) = ~5–6 weeks. + +### Subproject C: Surgical donor-branch upstream + +Independent of A/B, the highest-ROI immediate work is upstreaming five +files from Aysenur's `feat/anti-spoof-pipeline` (the cleanest of her +branches): + +1. **`face_usability_gate.py` + `critical_region_visibility_gate.py` + + `face_quality_illumination_gate.py`** as one PR — adds + pre-liveness occlusion/illumination gating. Hijab-aware (already + tested by Aysenur). Reduces FRR for Marmara users. +2. **`flash_spoof_analyzer.py` + `light_challenge_service.py` route + exposure** — `/liveness/challenge` endpoint that issues a color flash + and verifies response. Web-app widget needs a corresponding step + component, but the backend is ready. +3. **`hybrid_fusion_evaluator.py`** — once 1 and 2 land, wire it into + `check_liveness.py:130-200` as the new fusion layer behind a + `LIVENESS_FUSION_ENABLED` feature flag. +4. **`device_spoof_risk_evaluator.py` invocation** — already in main but + dead code; light wiring into `check_liveness.py`. +5. **`cutout_anomaly_detector.py` invocation** — same: live in main, no + call site. + +Each can ship as its own PR with its own tests, against `main`, with +**zero `requirements.txt` changes** (avoiding the regression issue). + +--- + +## Risks & open questions + +- **Authorship / attribution.** `working_spoof_detection` has 6 commits + by **Ayşe Gülsüm EREN** and 21 by **Aysenur15** (hotmail email). Are + these the same person under different commit identities, or two + collaborators? If the latter, both deserve paper authorship. + `practice-and-test/spoof-detector/` is 100 % the user (Ahmet) per + `git log --pretty='%an'` — 42 commits with no other contributors to + that subtree. The paper should credit Aysenur(s) only if their + branches' techniques are integrated. +- **Memory overstated `liveness_capture` content.** The audit memory + (project_aysenur_liveness_branch.md) claims `liveness_capture` has + rPPG + screen-replay + MRZ. The actual diff shows: enhanced/passive + scoring + face bbox + color-shaded screen + liveness score. The rPPG + analyzer is in `main`, not in `liveness_capture`. Screen-replay + veto is in `main`. MRZ work is in `practice-and-test/`, not in + `biometric-processor` at all. Memory needs an update. +- **License compatibility.** UniFace is Apache-2.0. MediaPipe is + Apache-2.0. DeepFace is MIT. Resemblyzer is BSD-3. scikit-image is + BSD-3. scipy is BSD-3. sklearn is BSD-3. ONNX Runtime is MIT. + **No GPL / AGPL surface in the inventory** — paper + commercial + productization are both safe. Verify Aysenur's branch licenses + before merge if she pulled in any external snippets. +- **Performance budget on CX43 (no GPU).** UniFace MiniFASNet ONNX is + ~30-50 ms per frame on CPU. Aysenur's hybrid fusion is ~+30-60 ms + (Gabor + flash + cutout). Session-engine adds ~5 ms per frame (mostly + bookkeeping). At 30 fps live, the budget is 33 ms/frame — **the + full pipeline cannot run synchronously on CX43 at 30 fps**. Either + (a) downsample to 10-15 fps, (b) async pipeline (acceptable since + session engine doesn't need strict ordering), or (c) tier analyzers + by cost (MiniFASNet every frame, Gabor every 3rd, rPPG every 5th). +- **Aysenur's `requirements.txt` regression.** Every push of her branches + reverts security pins. This is the single biggest review blocker. If + upstreaming her work, **rebase commits onto `main`'s `requirements.txt` + before merging** — easy mechanically, but worth flagging on each PR. +- **Bundle bloat.** `working_spoof_detection` ships a 6.5 MB + `yolov8n.pt` binary at root. Strip before merge. +- **Empty datasets.** `practice-and-test/spoof-detector/data/captures/` + and `data/annotations/` are empty in git. The README's measured + numbers are not reproducible from HEAD. Either commit the dataset + (with KVKK/GDPR consent paperwork) or document the protocol + + publish via amispoof.com. +- **Branch hygiene.** Seven liveness-related remote branches with + significant overlap. Once a path forward is chosen, fold the + superseded branches into `archived/` namespace and delete the + duplicates (`liveness_capture` ≡ `liveness_capture2`). + +--- + +## Appendix: file-path quick reference + +Prod-active liveness/anti-spoof code (`biometric-processor` `main`): + +- `app/application/use_cases/check_liveness.py:130-200` — verdict policy + DeepFace veto +- `app/application/use_cases/live_camera_analysis.py:32,134-147,515` — rPPG wiring +- `app/infrastructure/ml/liveness/uniface_liveness_detector.py` — MiniFASNet ONNX +- `app/infrastructure/ml/liveness/enhanced_liveness_detector.py:31,152-201` — screen-replay veto +- `app/infrastructure/ml/liveness/hybrid_liveness_detector.py` — enhanced+UniFace fusion +- `app/infrastructure/ml/liveness/screen_replay_anti_spoof.py` — 5-signal layered detector +- `app/infrastructure/ml/liveness/moire_pattern_analysis.py` — Gabor + FFT +- `app/infrastructure/ml/liveness/rppg_analyzer.py` — pulse detection +- `app/application/services/light_challenge_service.py` — flash challenge (no route) +- `app/application/services/flash_spoof_analyzer.py` — flash response analysis (orphan) +- `app/application/services/device_spoof_risk_evaluator.py` — multi-signal fusion (orphan) +- `app/application/services/cutout_anomaly_detector.py` — cutout/focal-blur (orphan) + +Aysenur's branches (additions on top of `main`): + +- `app/infrastructure/ml/liveness/critical_region_visibility_gate.py` (901 LoC) +- `app/infrastructure/ml/liveness/face_quality_illumination_gate.py` (241 LoC) +- `app/infrastructure/ml/liveness/face_usability_gate.py` (341 LoC) +- `app/application/services/hybrid_fusion_evaluator.py` (190 LoC) +- `app/application/services/preview_biometric_puzzle.py` (218 LoC) +- `app/tools/live_liveness_preview.py` (4 803 LoC, dev tuner) +- `app/tools/test_data_collector.py` (789 LoC) +- `app/tools/train_spoof_classifier.py` (268 LoC, sklearn) +- `app/tools/export_training_data.py` (148 LoC) + +User's standalone work (`practice-and-test/spoof-detector/`): + +- `src/domain/{models,session,interfaces,taxonomy}.py` — 7-class spoof taxonomy +- `src/application/session_engine.py` (494 LoC) — session verdict engine +- `src/application/liveness_prover.py` (338 LoC) — guilty-until-proven prover +- `src/application/pipeline.py` (118 LoC) — per-frame orchestrator +- `src/infrastructure/analyzers/{minifasnet,device_boundary,blink,rppg,screen_replay,screen_flicker,moire,texture,temporal,landmark_variance,micro_tremor,background_grid,ar_filter}_analyzer.py` (~3 000 LoC, 14 analyzers) +- `src/infrastructure/fusion/multi_class_fuser.py` — calibrated 7-class fusion +- `tests/test_{analyzers,domain,session}.py` — 60 unit tests +- `paper/outline.md` — BIOSIG/IJCB 2026 paper draft +- `ROADMAP.md` — Phase 1–8 plan, current state v1 diff --git a/RESEARCH_PROCTORING_AMISPOOF_2026-05-02.md b/RESEARCH_PROCTORING_AMISPOOF_2026-05-02.md new file mode 100644 index 0000000..e464100 --- /dev/null +++ b/RESEARCH_PROCTORING_AMISPOOF_2026-05-02.md @@ -0,0 +1,254 @@ +# Research & Design — Proctoring submodule + amispoof.com demo + +**Date:** 2026-05-02 +**Author:** Claude (Opus 4.7) on behalf of Ahmet +**Status:** Design-stage; no code written. Awaiting user direction before any repo split or DNS purchase. + +--- + +## 1. What the user is asking + +Two related but separable questions: + +1. **Repo strategy.** Should proctoring (test integrity, KYC video, fraud-monitoring, live-stream anti-spoof) live in its own repo/submodule, or stay folded into `biometric-processor`? +2. **Public demo + research surface.** Should we stand up a separate site — provisional name `amispoof.com` — that lets visitors record a short clip (or live-stream) and get back a probability breakdown: + - % static image / printed photo + - % pre-recorded video replay + - % live-stream with sub-categorisation: + - 3D realistic mask (silicone/latex) + - Hard makeup / contouring + - AR filter / live overlay app + - Genuine live capture + +The user wants this for two reasons: (a) **academic paper** material, (b) **product advancement** for FIVUCSAS as a whole. + +--- + +## 2. What we already have in the codebase (verified 2026-05-02) + +This is not greenfield. Verified via `git branch -a` + `git log` on `biometric-processor`: + +- `feat/anti-spoof-pipeline` — moire pattern detection, device-spoof signature, reaction-based liveness, baseline pipeline, **flash-spoof analyzer**, cutout anomaly detection, strict-profile config, screen-replay veto. Multi-stage active+passive pipeline. +- `liveness_capture` / `liveness_capture2` — color-shaded-screen challenge, face-bbox refinements, liveness scoring, passive scoring improvements, enhanced-liveness as default baseline. Per memory: rPPG (remote photoplethysmography — pulse-from-skin-colour-changes) + screen-replay detection + MRZ already delivered. +- `test_proctoring_workflow.sh` at repo root — full proctoring session flow against `/api/v1` endpoints with real fixture images (which itself is a P0 GDPR violation per `TEST_REVIEW_2026-05-01.md` F1). + +So the engineering work is partially done in feature branches that **never landed on `main`**. That is actually the central design pressure: we have ~1 person-month of proctoring-specific work sitting in branches because it doesn't fit the auth-flow product story. + +That alone is the strongest argument that proctoring is a different product, not a feature. + +--- + +## 3. Recommendation in two sentences + +**Split proctoring into its own submodule (`fivucsas-proctoring` or `proctoring-engine`) and rebase the existing `feat/anti-spoof-pipeline` + `liveness_capture` branches into its `main`.** Stand up `amispoof.com` as a **hybrid client+server demo**, sharing the proctoring submodule's WASM/ONNX-exported detector for cheap classifications and calling the server only for the ambiguous cases — that gives a viable cost profile and protects the academic-paper data pipeline. + +The main tradeoff: a third backend submodule increases ops surface (one more container, one more set of releases, one more CI matrix) at a moment where we have a known operator-stuck CI runner (Task #55) and an unrebuilt prod (Task #25). The split should land *after* those operator items resolve, not before. + +--- + +## 4. Why proctoring is a separate product + +It's tempting to keep one biometric backend. Walking through it carefully: + +| Axis | FIVUCSAS auth | Proctoring | +|---|---|---| +| Caller | Tenant SDK / OAuth widget | LMS, exam platform, KYC flow, contact-centre app | +| Decision unit | Single transient verify (200–600 ms) | Long session (30 s – 4 h) | +| Data model | Per-user enrolment templates, 1:1 match | Per-session event log, behaviour timeline | +| Compliance posture | KVKK / GDPR Art. 9 (auth purpose) | KVKK / GDPR Art. 9 + Art. 22 (automated decisions about exam outcomes), often + ETSI TS 119 461 (KYC/video-ID) | +| Storage horizon | Embeddings (Fernet-encrypted as of PR #65) | Hours of raw video/audio in many jurisdictions; **bigger DPIA** | +| Failure cost | One bad login retry | Wrongly-flagged exam → academic appeal | +| SLA | p99 < 600 ms | Streaming p99 < 200 ms per frame, 99.9% session uptime | +| Test fixture | One enrolment, one verify | Long video corpus, adversarial sample bank, mask dataset | +| ML stack stability | Mature (FaceNet, MiniFASNet) | Active research (rPPG, mask, AR-filter detection) | + +Different decision unit, different compliance, different SLA, different test data, different model release cadence. That's textbook separate-product. The fact that they share `face-detection` is not an architectural argument for monorepo — it's an argument for a **shared-detection library** (more on this below). + +Counter-argument the user might raise: "But shared infra (Postgres, Traefik, observability, keys) is a hassle to duplicate." Reply: it isn't duplicated. The split is about source repo + container + release cycle. The proctoring container reuses the same `shared-postgres`, the same Traefik, the same Loki/Promtail/Grafana stack, the same `infra/` runbooks. We already do this for `biometric-processor` ↔ `identity-core-api`. + +--- + +## 5. Repo / submodule layout proposal + +``` +FIVUCSAS/ # parent (this repo) +├── identity-core-api/ # auth backend (existing) +├── biometric-processor/ # face/voice auth ML (existing) +├── proctoring-engine/ # NEW — extracted from feat/anti-spoof-pipeline +│ ├── app/ +│ │ ├── api/routes/proctoring_session.py +│ │ ├── api/routes/spoof_classify.py # serves amispoof.com +│ │ ├── ml/spoof/ # moire, replay, mask, AR-filter +│ │ ├── ml/liveness/ # rPPG, blink, reaction (from Aysenur) +│ │ └── persistence/ +│ ├── docker/ +│ ├── models/ # Versioned ONNX/TFLite weights +│ └── alembic/ +├── shared-detection/ # NEW — pure-library, no service +│ ├── face_detect_wasm/ # MediaPipe-WASM build, used by web-app, verify-app, amispoof.com +│ └── face_detect_python/ # cv2/MediaPipe Python wrappers, used by both backends +├── web-app/ # admin / tenant dashboard +├── verify-app/ # OAuth widget (already separate) +├── client-apps/ # mobile (Kotlin Multiplatform) +├── landing-website/ # FIVUCSAS pricing / hero +├── amispoof-website/ # NEW — separate static + serverless functions +└── infra/ +``` + +Three new submodules, one of which (`shared-detection`) is a library — that tames the duplication concern at the file level rather than the repo level. + +`proctoring-engine` is the **only** new long-running container. It exposes a stable API with three families of endpoints: + +- `/api/v1/proctoring/sessions/...` — long-running session lifecycle (LMS / KYC integrators). +- `/api/v1/spoof/classify` — single-shot multi-class spoof classification for `amispoof.com` and ad-hoc probes. +- `/api/v1/liveness/...` — re-export of the active+passive liveness work from `liveness_capture` for `biometric-processor` to call when it needs deep liveness (instead of duplicating rPPG inside biometric-processor). + +That last bullet is critical: **after the split, `biometric-processor` calls `proctoring-engine` for deep liveness in the `verify` path**. This inverts the current direction (where liveness work was attempted inside biometric-processor and got partially blocked), and aligns with the existing memory note that "Liveness Priority 1: active-illumination challenge — defeats deepfake injection via virtual camera, 4–6 weeks" can be the proctoring team's first ship after the extraction. + +--- + +## 6. amispoof.com — design + +### 6.1 Product framing + +amispoof.com is two things at once: + +1. **Marketing surface.** "Are you spoofing yourself?" is a memorable hook. It generates social shares, which generate traffic, which generates trial signups for FIVUCSAS proper. It is far better at this than a feature page on the corporate site — single-purpose pages convert. +2. **Research data flywheel.** Every clip submitted, **with explicit informed consent and clear donation banner**, can become labeled training data. That is the closed-loop advantage open-source benchmarks don't have: visitors believe they're "real" and try to fool us; the fooling attempts are the most valuable adversarial samples. + +A single page with one big record button and one big "What did we find?" reveal is the entire UX. + +### 6.2 Architecture: hybrid, not pure-serverless + +The user mentioned "serverless / client-side only." That's tempting (zero compute cost) but wrong for this problem. Three reasons: + +- **rPPG needs ≥1.0–1.5 s of pulsation samples.** That's fine in-browser via MediaPipe Tasks, but the strongest signal needs FFT over 1–3 s of cheek-region pixel differences. Doable in WASM but at 30 fps it's tight. +- **MiniFASNet (anti-spoof) is ~70 MB ONNX.** First-load cost on a 4G mobile is ~7 s — visitors leave. +- **AR-filter classification is the differentiator.** It's the academic gap. Filter detectors are not yet small enough to ship to the browser; current SOTA models are 200+ MB. + +So: **hybrid**. + +``` +Visitor browser + ├── MediaPipe Tasks (face landmarker, ~6 MB) for capture + ROI + ├── Tiny WASM "is-this-clearly-a-still-image" detector (~2 MB) ← first-pass, cheap + │ If confidence > 0.95: return verdict in browser, no upload, no PII transit. + │ Else: submit clip (1–3 s, 480p, with EXPLICIT consent) → + │ + └── proctoring-engine /api/v1/spoof/classify + ├── moire / replay detector + ├── rPPG analyzer + ├── MiniFASNet 3D-mask + ├── AR-filter detector (ours, novel) + └── return {static: 0.02, replay: 0.31, live_genuine: 0.62, live_with_filter: 0.05, ...} +``` + +Server-side budget per ambiguous clip: ~120 ms inference on CX43 + ~80 ms IO. That's well within "feel snappy" on a single-page demo. Cost target: ≤ €0.001 per submission, budgetable up to ~50k visitors/month before we'd want to throttle. + +### 6.3 Categories and what we can actually detect + +The user listed three buckets. Sharpening them with what's literature-supported: + +**Bucket 1 — Static image attack** (printed photo, screen still) +- Detectable by: complete absence of micro-motion, eye-blink absence, pixel-noise pattern matching screen/print, MediaPipe landmark stability over 0 motion +- Confidence after 1 s: > 95% +- Maturity: solved + +**Bucket 2 — Pre-recorded video replay** (the most common attack) +- Detectable by: moire interference (camera shooting screen), reduced colour gamut, refresh-rate aliasing, no rPPG signal even though motion exists, screen-bezel artifacts in periphery +- Confidence after 2 s: > 85% (drops on high-end OLED replays) +- Maturity: well-studied; we have moire + screen-replay code on `feat/anti-spoof-pipeline` + +**Bucket 3 — Live-stream attacks** (the hard category, biggest paper opportunity) + +Sub-categories with decreasing detection confidence: + +- **3D realistic mask (silicone/latex):** detectable via rPPG absence (no pulse beneath the mask) + thermal-edge cues if camera supports it (most don't) + landmark warping at expression changes. Confidence after 3 s: ~75%. Datasets: 3DMAD, HKBU-MARs. +- **Hard makeup / heavy contouring:** weakest detection signal — makeup generally preserves pulse and motion. Best signals are landmark-set drift relative to identity-baseline (assumes enrolled user) and unusual specularity. **Honest answer: we cannot detect strong contouring at single-encounter accuracy >60%.** This should be reported as low-confidence ("possible") rather than overclaimed. +- **Live AR filter / overlay app** (Snap, Instagram, FaceApp Live, custom OBS plugins): detectable via temporal coherence breaks at landmark borders, GPU-rendering signature artifacts, predictable filter-library fingerprints (specific points where Snap's filters all distort identically). Confidence after 2 s: ~70% with current research, **not benchmarked publicly** — this is the gap. +- **Deepfake live re-render** (state-of-the-art injection via virtual webcam — DeepFaceLive et al.): the hardest. Active-illumination challenge (random colour flash on screen) is the strongest defense — the deepfake pipeline can't react to a screen colour change in real-time. Confidence with active-illumination after 1 s: > 90%. This matches the existing `liveness_capture` "Color Shaded Screen" commit. + +### 6.4 The academic paper angle + +The publishable contribution is **not** "we built another anti-spoof system." It is one of: + +1. **Labeled benchmark + detector for live AR-filter spoofing.** No major public dataset isolates this category. Current proctoring research lumps it into "video replay" or assumes it's out of scope. A 5–10k clip dataset (collected via amispoof.com with consent) plus a ResNet-class detector that outperforms generic anti-spoof on this category is a strong CVPR Workshop / IJCB / BIOSIG paper. +2. **Browser-deployable lightweight liveness.** Most published liveness models are 100+ MB. A WASM-deployable model achieving competitive AUC at <5 MB has genuine product impact and a clean ablation story. +3. **Active-illumination-as-a-service.** The existing `Color Shaded Screen` commit can be the seed of a paper that quantifies how much active illumination buys you against deepfake injection (the deepfake pipeline's reaction-time limit is concrete and measurable). + +(1) is the highest-leverage of the three because it uses the data flywheel that amispoof.com creates. + +### 6.5 Naming, domain, ethics + +`amispoof.com` is fine but a bit uninviting (suggests the user is the spoof). Alternatives: + +- `amispoof.com` (current pick, available — operator should verify) +- `arelive.app` — "are we live?" — friendlier; positions as positive ("verifying I'm real") rather than accusatory +- `notabot.live` — already taken, skip +- `realityattest.com` — sounds enterprise-y, less viral +- `ispoof.me` — short, viral; available (operator verifies) + +The strongest ethics rule is **opt-in research donation**. Every submission is destroyed within 30 minutes unless the visitor explicitly clicks "Donate this clip to FIVUCSAS research" with a separate checkbox for "make it public for academic benchmarks." Keep the deletion job auditable. This is the difference between a paper that gets accepted and a paper that gets retracted. + +The site must publish: +- A `security.txt` (RFC 9116) — already a Phase 4 plan item. +- A separate DPIA for amispoof.com (this is *not* the FIVUCSAS DPIA; the data subjects are not customers and the legal basis is consent, not contract). +- A clear "Data we keep / Data we throw away" page above the fold. + +--- + +## 7. Build sequence (only if approved) + +This sequence assumes the user says "yes do it." It is not the act of saying yes. + +**Phase 0 — Prep (operator + 1 day me).** Resolve Task #25 (container rebuild) and Task #55 (CI runner stall) first. Don't add a third backend container while two are unrebuilt and one runner is stuck. + +**Phase 1 — Library extraction (3 days me).** Create `shared-detection` submodule. Move common face-detection wrappers out of `biometric-processor` and `web-app`. Stays as a pure library — no service. Verify both existing services still build against the extracted lib. No behaviour change. + +**Phase 2 — Proctoring extraction (5 days me).** Create `proctoring-engine` submodule. Rebase `feat/anti-spoof-pipeline` onto its new `main`. Lift the proctoring routes (`/api/v1/proctoring/sessions/...`) out of `biometric-processor` into `proctoring-engine`. Add docker-compose entry. Wire Traefik. Keep on internal-network only initially. Add Loki/Promtail labels. CI on the same self-hosted runner pool (one more reason to fix that stall first). + +**Phase 3 — `amispoof.com` static site (2 days me).** A single Vite-built page in a new `amispoof-website` submodule. Pure static + a single serverless function (or just a `/spoof/classify` proxy via Traefik to `proctoring-engine`). Hostinger DNS, GitHub Actions deploy, same pattern as `landing-website`. Put up a "Coming soon — research preview" banner first; flip to live demo once Phase 4 lands. + +**Phase 4 — Spoof-classify endpoint hardening (5 days me).** Lift the multi-class detector heads from `feat/anti-spoof-pipeline` into a single classification endpoint. Add explicit per-attack-type confidence outputs. Write the consent + ephemeral-storage logic. Deploy DPIA. + +**Phase 5 — Active illumination + AR-filter detector (research, 4–6 weeks me).** This is the paper work. New model training. Dataset collection via amispoof.com (with consent). Ablation study. Submit to nearest deadline (BIOSIG / IJCB). + +Total time-to-paper-draft assuming a reasonable cadence: ~3 months of solo time. Time-to-public-amispoof.com-demo: ~3 weeks. + +--- + +## 8. Risks I want the user to be aware of + +1. **Operator surface is already stretched** (CI runner stall, container rebuild gate, JWT_SECRET rotation, GDPR fixture removal). Adding a third backend before those land creates compounding ops debt. Phase 0 above is mandatory. +2. **Proctoring's strict GDPR Art. 22 posture.** Automated decisions about exam outcomes are special-category processing. We'll need a human-review fallback path baked into the proctoring API from day one. That's a constraint, not a blocker. +3. **AR-filter detection is research-stage.** We may publish a paper but find the model is not robust enough to ship in product. That's fine — the paper is value on its own. Don't tie product roadmap to the research outcome. +4. **amispoof.com data collection has a regulatory tail.** TR data-protection law (KVKK) requires explicit consent for biometric processing; visitors can withdraw consent and demand deletion. Build the deletion job before the form goes live, not after. +5. **Brand confusion.** `amispoof.com` sitting outside `fivucsas.com` should still link back to FIVUCSAS prominently in the footer. Don't accidentally build a brand silo that doesn't drive customer acquisition. + +--- + +## 9. Decision the user needs to make + +Three branches: + +A) **"Do it now."** Enter Phase 0 → Phase 1 → … as above. Three new submodules over ~3 weeks; paper work begins ~Phase 5. + +B) **"Do the extraction now, defer amispoof.com."** Phase 1 + 2 only. Proctoring becomes shippable as an enterprise-only feature (LMS / KYC integrators). Skip the public demo until enterprise pipeline validates the value. Lower risk, slower paper path. + +C) **"Don't split — fold proctoring into biometric-processor permanently."** Cheapest option, but I'd argue against it for the reasons in §4. The branches sitting unmerged for months are the strongest evidence the current monorepo shape is fighting the work. + +I'd choose **B** if forced to pick. It validates the architecture without taking on the consent + DPIA work for amispoof.com upfront. amispoof.com can come in Phase 4–5 once the engine is live and we have a few real LMS / KYC pilots informing what the demo should actually demonstrate. But A is also defensible if the user's primary motivation is the academic paper — the paper needs the data flywheel, and the data flywheel needs amispoof.com early. + +C should be ruled out. + +--- + +## 10. What I'm NOT doing without the user's go-ahead + +- Splitting any repo. +- Buying any domain. +- Touching `feat/anti-spoof-pipeline` or `liveness_capture` branches. +- Writing code in this direction. +- Adding entries to `docker-compose.prod.yml` for a proctoring service. + +This memo is research. The user picks A / B / C and tells me; then I plan the chosen path concretely. diff --git a/ROADMAP_2026-04-28.md b/ROADMAP_2026-04-28.md new file mode 100644 index 0000000..44a610b --- /dev/null +++ b/ROADMAP_2026-04-28.md @@ -0,0 +1,195 @@ +# FIVUCSAS Roadmap — 2026-04-28 (afternoon update) + +This file supersedes the morning `BIOMETRIC_ROADMAP_2026-04-28.md` and +the open-item list in `web-app/TODO.md` Phase A. It captures the current +production-verified state plus the new bugs the user found in the +2026-04-28 afternoon test pass. + +## A. Verified Done (today, 2026-04-28) + +Source + JAR + deployed JS bundle + DB row + container env all checked. + +| # | Fix | Where | +|---|---|---| +| A1 | LoginPage 401 → `auth.invalidCredentials` (was generic "unauthorized") | `web-app LoginPage.tsx:308` + `LoginPage-Bl36gQe5.js` bundle | +| A2 | `UserRepository.findByEmail/existsByEmail` filter `deletedAt IS NULL` | `identity-core-api UserRepository.java:29-33` + JAR strings | +| A3 | `AuthenticateUserService` skips optional MFA steps with no biometric enrollment | `AuthenticateUserService.java:150-152, 254` | +| A4 | Fivucsas tenant `contact_email` populated (was NPE on every user fetch) | `tenants` row | +| A5 | UniFace passive liveness — writable model cache + `LIVENESS_MODE=passive` (was `hybrid` requiring blink+smile that /verify never asks for) | `biometric-processor docker-compose.prod.yml`, env, `/app/uniface-cache/minifasnet_v2.onnx` | +| A6 | Quality scoring no longer caps at 70% — bbox fallback when no 478-pt landmarks | `useQualityAssessment.ts`, `FaceCaptureStep.tsx:72`, deployed bundle | + +## B. Stale doc claims — already Done in code, doc says Open + +| Old doc item | Reality | +|---|---| +| BIOMETRIC_ROADMAP F1-1 Facenet512 | DONE — `EMBEDDING_DIMENSION=512`, all embeddings 512-dim | +| F1-3 Anti-spoofing enable | DONE — `ANTI_SPOOFING_ENABLED=true` | +| F1-4 UniFace liveness | DONE — passive-only deviation (justified) | +| F2-1, F2-2 Liveness on /enroll + /verify | DONE — commit 3606064 + today's logs | +| F2-4 Passive liveness client | DONE — commit ce20c59 | +| F2-5 MediaPipe FaceLandmarker | DONE — verified by audit | +| F3-1 Adaptive threshold | DONE — `VERIFICATION_THRESHOLD_AGED_*` | +| web-app/TODO Phase A1-A4 lint | DONE — PR #47 (`eslint src/` shows 0 errors, 1 warning) | +| INFRA_REVIEW biometric_db backup gap | DONE — daily GPG backup includes `biometric_db.sql.gz.gpg` | + +## C. New bugs found 2026-04-28 afternoon test pass + +User-reported, in order found, with hypothesized root cause. + +| # | Symptom | Severity | Hypothesized root cause | +|---|---|---|---| +| C1 | Sidebar: tapping "Kayıtlar" highlights both "Kayıtlar" and "Kimlik Doğrulama Yöntemleri" | P2 | NavLink `isActive` matches via `startsWith` instead of exact path; one path is a prefix of the other | +| C2 | TOTP — re-enrolled but page still says "not enrolled" | P0 | `EnrollmentHealthService.validateEnrollments` checks for actual backing data (e.g., decrypted secret) and disagrees with the row in `user_enrollments` (status=ENROLLED). Or: enrollment write path doesn't persist the secret. | +| C3 | Email — page said "not enrolled", tap to enroll → page refreshes → "enrolled" with no verification step | P1 | Email is not a real enrollment method (no secret to bind), but UI treats it as one. Either drop the enroll button entirely or wire it to a real one-time-code verification. | +| C4 | SMS OTP — entered phone number → "success" without OTP verification | P0 | Enrollment endpoint accepts phone number alone and marks ENROLLED. Should require: send OTP → user enters code → verify code → only then mark ENROLLED. | +| C5 | QR Code — "not enrolled" → tap enroll → instantly enrolled with no QR scan | P1 | Same pattern: enrollment endpoint flips state without doing the work. | +| C6 | Fingerprint login step ran twice in one login | P1 | `MultiStepAuthFlow` doesn't check `completedMethods` before re-running the same step. Or: server returns the same step twice in `availableMethods`. | +| C7 | Client-apps not visually identical to `app.fivucsas.com` / `verify.fivucsas.com` | P1 | Out of scope of today's auth fixes — needs Compose Multiplatform UI parity sweep. | +| C8 | APK not released to GitHub | P2 | `client-apps` workflow doesn't publish a Release with attached APK on `main` push. | +| C9 | Biometric tools page throws "network error" | P0 | Likely: wrong baseURL / missing tenantId / 401 redirect / endpoint moved. Diagnose first. | +| C10 | Biometric puzzles look low quality, need pro polish | P1 | Visual + content quality lift — typography, spacing, illustrations, real challenges, smooth transitions. | +| C11 | Need a public face-features demo page (showcases all face capabilities) | P1 | New page that walks through detection, landmarks, head pose, liveness, anti-spoof, quality assessment, embedding. Marketing surface. | + +## D. Real prod-impacting issues from morning audit (still open) + +P0 / blocking +- **V42** (TOTP strict) and **V43** (drop `biometric_data` PII) Flyway migrations missing from prod (PR #32 audit, confirmed via `flyway_schema_history`) +- `identity-core-api` test suite does not compile (`OperationType.LOGIN` → `OperationType.APP_LOGIN`) +- Secrets never rotated since deploy — JWT_SECRET, postgres, Redis, Twilio, biometric API key, SMTP (TODO Phase C1a-f) +- `.env.prod` in git history of multiple repos — needs `git filter-repo` + push-protection + gitleaks +- Today's session: ahabgu lost TOTP / fingerprint / NFC / device enrollments via FK cascade when his old soft-deleted Marmara row was hard-deleted. Face + voice survived (separate biometric_db). User must re-enroll. Lesson: never hard-delete a `users` row with active FK chains; always rely on `deletedAt IS NULL` filters. + +P1 +- OIDC conformance suite never run (TODO D4) +- Backup-restore verification cron (backups happen daily, restore is untested) +- Traefik bio.fivucsas.com rate-limit + admin allowlist (TODO C4) +- mizan-api CPU bottleneck (324% / 5892 ms) — outside FIVUCSAS auth path but same VPS +- TOTP secret encryption uses per-call-site cipher rather than `@Convert` JPA converter (commit f1ea4b0). Defense-in-depth gap. +- Pose control gate (yaw/pitch hard-reject) — code exists, not enforced +- Audit DRAFT PRs **#45 (web-app)** + **#32 (api)** still open — land or close + +P2 +- 135 `Map.of` → typed record DTOs +- `@Transactional(readOnly=true)` sweep +- Recharts route-level `React.lazy()` +- size-limit CI gate +- Loki + Grafana log shipping +- pg_stat_statements + shared_buffers tuning +- biometric-api memory limit 3 GB → 512 MB + +## E. 2026-04-28 afternoon — final state + +| Team | Issue | Status | Branch | +|---|---|---|---| +| A | C1 sidebar dual-highlight | ✅ Done | `web fix/sidebar-dual-highlight` (d217c64) | +| B | C2-C5 enrollment correctness | ✅ Done | `web + api fix/enrollment-correctness` (4 web + 3 api commits) | +| C | C6 fingerprint twice | ✅ Done | `api fix/mfa-step-no-double` (8d36c7d) | +| D | C7-C8 client-apps parity + APK | 📋 Plan saved | `CLIENT_APPS_PARITY_PLAN_2026-04-28.md` (research only) | +| E | C9 biometric tools network error | 🟡 Team E3 in flight | `web fix/biometric-tools-network` | +| F | C10 puzzles polish | 🟡 Team F3 in flight | `web polish/biometric-puzzles` | +| G | C11 face demo page | 🟡 Team G3 in flight | `web feat/face-demo-page` | + +11 agent runs today exhausted the 5-hour Anthropic quota at 14:08 UTC. +Reset 18:30 UTC. C/E/F/G first-wave + redeploy stalled / rate-limited; +B was 75% done locally and resumed from worktree state. C done in main +thread. E/F/G running on quota reset (3rd dispatch). + +### Team scopes (historical reference) + +- **Team A — Sidebar dual-highlight (C1)** + scope: `web-app/src/layouts/components/Sidebar.tsx` (or wherever the + NavLink active-match lives). 1 file. ~5 min. + +- **Team B — Enrollment correctness sweep (C2, C3, C4, C5)** + scope: identify root causes for TOTP / Email / SMS / QR enrollment-vs- + health mismatch. Repos: `web-app` (enrollment dialogs) + + `identity-core-api` (enrollment endpoints + `EnrollmentHealthService`). + Diagnose first, then fix each. May land 4 separate commits. + +- **Team C — Fingerprint duplicate step (C6)** + scope: `web-app MultiStepAuthFlow.tsx` + step components + + `identity-core-api MfaSession.completedMethods` flow. Find why the + same step runs twice. Diagnose and fix. + +- **Team D — Client-apps parity + APK release (C7, C8)** — diagnose only + scope: read-only audit of `client-apps` (Compose Multiplatform). + Compare login screens against `app.fivucsas.com` and + `verify.fivucsas.com`. Propose a parity work plan + GH Actions APK + release workflow. Do **not** ship code yet — APK signing + release + publishing is too consequential to delegate without user review. + +- **Team E — Biometric tools network error (C9)** + scope: `web-app` biometric-tools surfaces. Diagnose + fix the network + error (wrong base URL, missing tenant header, auth scope, etc.). + +- **Team F — Biometric puzzles quality lift (C10)** + scope: `web-app/src/pages/BiometricPuzzles*.tsx` (or wherever the page + lives). Visual + content polish to "professional" bar. Typography, + spacing, motion, copy, illustrations, real challenge variety. + +- **Team G — Face-features demo page (C11)** — new file, no conflict + scope: build a new `/demo/face` page (or similar route) that walks a + visitor through every face capability we ship: detection, 478-pt + landmarks, head pose, passive liveness, anti-spoof, quality scoring, + embedding visualization. Public/no-auth section, marketing-grade + visuals. + +## E.bis Round 2 + Round 3 (afternoon → evening) + +After the original 11-task wave landed, two more rounds shipped on +the user's bug reports. + +**Round 2** (TOTP / Users-list / 10-method flow) +- TOTP "Beklemede" — repaired ahabgu's PENDING row in DB; the secret + was already encrypted-at-rest, only the bookkeeping flip was + missing. New enrollments use the deployed b87593c path. +- Users page "Last Login: Never" — `enrichWithLoginInfo` queried + audit_logs for action `'USER_AUTHENTICATED'` but the API actually + emits `'USER_LOGIN'`. Always returned null. Fixed + added fallback + to entity `lastLoginAt`. Commit c25b731. +- Login flow now exposes all 10 methods — added VOICE + + NFC_DOCUMENT to all 3 Fivucsas auth-flow CHOICE steps via + `auth_flow_step_methods` insert. + +**Round 3** (V42 + tenant-lock + multi-email design) +- V42 TOTP encrypted-at-rest CHECK constraint restored from the + unmerged `security/phase-1-auth-hardening` branch (commit 3eb0161 + was never landed; audit DRAFT PR #32 was right). Applied manually + to prod since Flyway has out-of-order=false. Commit bad7262. +- V43 already shipped as V48 (`drop_biometric_data`) — no separate + restore. +- Test suite compile error already fixed; PR #32 finding stale. +- **OAuth tenant-lock**: when login carries a `clientId` and the + client is bound to a non-system tenant, refuse the login if the + user's tenant doesn't match. Closes the demo.fivucsas + cross-tenant login that the user reported. Smoke test confirms + ahabgu+marmara-bys-demo → 401, ahabgu w/o clientId → MFA + proceeds. Commit 5446d57. +- `MULTI_EMAIL_TENANT_DESIGN_2026-04-28.md` — design note for + multi-email per identity. Today the schema supports two distinct + rows for one human in different tenants (and `tenant_email_domains` + V44 auto-routes by domain). Three options laid out for the + future: A status quo, B identities + memberships, C aliases. A + recommended for now. + +## E.tris Doc archive (2026-04-28 evening) + +16 superseded reports moved to +`archive/2026-04-pre-roadmap-2028/` with a README index. Top +level now carries only the 6 canonical docs (CHANGELOG, CLAUDE, +README, ROADMAP_2026-04-28, CLIENT_APPS_PARITY_PLAN, MULTI_EMAIL). + +## F. Session memory — lessons saved + +- Hard-delete on `users` with active FK chains cascades through + `webauthn_credentials`, `nfc_cards`, `user_devices`, + `user_enrollments`, plus inline TOTP secret. Always patch the query + (`deletedAt IS NULL`) instead of hard-delete. +- Read-only Docker rootfs needs `HOME=/tmp` + explicit cache volume for + any library that writes under `~/`. UniFace was the third such case + this month (after DeepFace, Numba). +- "hybrid" liveness backend ANDs passive deep-learning with active + challenge response. The /verify UI captures one still frame with no + active prompts — `challenge_completed` will always be False, vetoing + every login. Use `LIVENESS_BACKEND=uniface` (passive only) for + /verify-style flows. diff --git a/ROADMAP_OPTIMIZED_2026-05-04.md b/ROADMAP_OPTIMIZED_2026-05-04.md new file mode 100644 index 0000000..700024e --- /dev/null +++ b/ROADMAP_OPTIMIZED_2026-05-04.md @@ -0,0 +1,361 @@ +# FIVUCSAS Optimized Roadmap — 2026-05-04 (post-Wave-2 + late-day P0 deploy) + +**Supersedes:** `archive/2026-05/roadmaps/ROADMAP_OPTIMIZED_2026-05-02.md` (kept for history). + +**Authoritative source-of-truth review docs (all under `/opt/projects/fivucsas/` unless noted):** +- 2026-05-04 — `SENIOR_DB_REVIEW_2026-05-04.md`, `SENIOR_UIUX_REVIEW_2026-05-04.md`, `CICD_AUDIT_2026-05-04.md` +- 2026-05-01 — `/opt/projects/SECURITY_REVIEW_2026-05-01.md`, `TEST_REVIEW_2026-05-01.md`, `QUALITY_REVIEW_2026-05-01.md`, `FRONTEND_REVIEW_2026-05-01.md` +- 2026-04-30 — DevOps / DB / Performance / Architecture / Principal reviews + +This document is the **single source of truth for what is open**. Anything not listed here is either closed (see `CHANGELOG.md` 2026-05-04 entry) or out of scope. + +--- + +## Headline — what changed in this session + +- **16 PRs squashed into `main` today** across api (#63–#73), web (#67–#73), and bio (#69 docs). +- **2 senior reviews + 1 CI/CD audit landed** as authoritative read-only docs. +- **P0-PROD refresh-token mint bug DIAGNOSED, FIXED, AND DEPLOYED** (api PR #71 → image `e9a33cef`, recreated 2026-05-04 12:01 UTC). Closed the 6 audit-log `MFA_STEP_FAILED` rows for `ahabgu@gmail.com` between 06:34–06:38 UTC. No new orchestration errors since. +- **web auto-deployed to Hostinger** on every push — confirmed working (Hostinger CI run `25317948466` SUCCESS). +- **SENIOR_DB Appendix C 7 prod queries** — answered in §A below; carryover items added to Tier 4. +- **Concurrent-agent collision pattern reconfirmed** — agents now use `/tmp//` worktrees when ≥3 share a submodule. + +--- + +## A. SENIOR_DB Appendix C — prod-query results (run 2026-05-04 12:30 UTC) + +| Q | Finding | State | Action | +|---|---|---|---| +| 1 | Flyway schema_history NULL checksums on V40, V41, V42, V43, V49, V50 | Drift carrying forward from baseline-skip + emergency rebuilds | T6.1 — `flyway repair` after next rebuild lands | +| 2 | `archive_mode = off`, `archive_command = (disabled)`, `wal_level = replica` | WAL archiving NOT live despite parent commit `1ab95e9` | Commit body declared "deploy DEFERRED" — confirmed | +| 3 | `audit_logs.tenant_id IS NULL` count = **140 / 1107 = 12.6%** | Drifted up from 12.4% on 2026-04-30 | T4.7 — investigate which paths still write null tenant; backfill | +| 4 | 25+ `pg_stat_user_indexes` rows with `idx_scan = 0` (active_sessions, api_keys, audit_logs subset, auth_flow_*) | Some are unused; some are tables not exercised yet | T4.8 — careful audit; do NOT just drop | +| 5 | `webauthn_credentials` dead/live = **9.00** (27 dead / 3 live, never autovacuumed) | Bloat | T4.9 — `VACUUM ANALYZE` + investigate why autovacuum hasn't fired | +| 6 | `alembic_version` table EXISTS in `biometric_db`, version_num = `0005_embedding_ciphertext` | Closed — Senior DB's "missing entirely" claim was stale from 04-30 | — | +| 7 | `face_embeddings` = 19, `voice_enrollments` = 35 | Same as 2026-05-02 (no new biometric enrollments this week) | Indicator-only, no action | + +--- + +## Tier 1 — Operator-only, cannot be done from this session + +These require an external system the host cannot reach (registrar console, upstream SaaS console, hardware), or are destructive enough to require explicit user authorization beyond the standing "fix everything" mandate. + +| ID | Item | Why operator | Priority | +|---|---|---|---| +| T1.A | DNS A record `grafana.fivucsas.com → 116.203.222.213` | TurkTicaret nameserver registrar console | P2 (Grafana works on direct IP; cosmetic) | +| T1.B | GDPR fixture force-push (`git filter-repo --path tests/fixtures/images --invert-paths` on bio repo) | History-rewrite + force-push to public; needs explicit user OK because every collaborator's clone diverges irreversibly | P3 (cosmetic — leaked secret was already rotated 2026-04-30) | +| T1.C | Twilio + SMTP credential rotation | Upstream provider consoles | P2 hygiene (was NOT part of any leak; routine rotation only) | +| T1.D | iBeta certification submission for Phase 5 | Legal / vendor process | Phase 5 trigger | +| T1.E | Stripe / payment gateway provisioning for Phase 4 | Vendor account setup | Phase 4 trigger | +| T1.F | APK signing keystore generation + 4 GitHub secrets | User keystore + secret-manager UI | Phase 5 trigger | +| T1.G | Custom postgres image with `postgresql-16-partman` + `postgresql-16-cron` (V57 Option A — full pg_partman) | Maintenance-window decision; `RUNBOOK_AUDIT_LOG_PARTMAN.md` Option B (skip GUC) is in effect today | P2 (V57 already fail-soft live) | +| T1.H | amispoof.com domain purchase + setup | Domain registration; Option B in `RESEARCH_PROCTORING_AMISPOOF_2026-05-02.md` defers this | Phase 4+ | +| T1.I | Hetzner self-hosted GitHub Actions runner re-pairing (Task #55) | Runner shows online but does not pull jobs; needs SSH-level inspection of `_work/` quarantine OR Runner-Group repo-access scope adjustment in GitHub org settings | P0 per CICD_AUDIT but cosmetic-only because we already have ubuntu-latest fallbacks queued (T4.A) | +| T1.J | Hetzner CX43 vertical scale or runner pool expansion | Hetzner console | Decide once T4.A lands | + +--- + +## Tier 2 — Decisions awaiting the user + +| ID | Decision | Recommendation | Blocking | +|---|---|---|---| +| T2.1 | Proctoring direction A/B/C | **B** (extract submodule, defer amispoof.com) — confirmed in `RESEARCH_PROCTORING_AMISPOOF_2026-05-02.md` | T5.4 | +| T2.2 | Bifurcated `User` domain | **Keep both + ArchUnit guard** — shipped via PR #63 | (closed) | +| T2.3 | Web `IFoo` interface convention (Quality P0-Q3) | **Keep** — matches Inversify DI pattern, solo-dev preference | T4.2 cleanup scope | +| T2.4 | V56-real — drop `refresh_tokens.token` plaintext column | Schedule for **2026-05-09** (T+7d soak from 2026-05-02 rebuild). Now safer because PR #71 fix is live and refresh-token mint flow is healthy. | T5.6 | +| T2.5 | Developer portal / widget demo gating | **Already public** — verified at HEAD (`App.tsx:223-226` in ``); SENIOR_UIUX P0-1 closed by inspection | (closed) | +| T2.6 | "Fix all" mandate scope on prod-side actions | **Standing approval until otherwise told** — destructive actions still ask, audit-driven rebuilds proceed | (process) | + +--- + +## Tier 0 — INVESTIGATION 2026-05-07 P0 batch (highest priority, dispatched 2026-05-07 06:00 UTC) + +Six-lens audit on 2026-05-07 surfaced 10 P0 items. Source: `INVESTIGATION_MASTER_2026-05-07.md` (synthesis) + 6 sibling docs. Plain-language description for each. + +### T0.1 — Legacy `/2fa/verify-method` 2FA bypass [P0, agent-actionable] +`AuthController.java:526-532` accepts ANY non-empty `assertion` string for FINGERPRINT/HARDWARE_KEY without signature/public-key/sign-counter checks; audit log records "success." N-step `WebAuthnVerifySupport` does this correctly — legacy route was missed. **Fix**: delegate to `WebAuthnVerifySupport.verifyAssertion(...)` or remove the legacy route if no client uses it. + +### T0.2 — Embedding encryption never invoked [P0, agent-actionable] +`pgvector_embedding_repository.py` and `pgvector_voice_repository.py` write `embedding_ciphertext` on save but `find_similar` / `find_by_user_id` read the plaintext `embedding` column. `decrypt_vector` defined at `embedding_cipher.py:75` and **called nowhere**. **Fix**: replace plaintext column reads with decryption from ciphertext, then drop plaintext column once verified. Embedding-at-rest "encryption" is theater until this lands. + +### T0.3 — `WatchlistCheckHandler` is a hardcoded mock in production [P0, decision-then-agent] +Live `@Component` returning `cleared=true, match_count=0` for every input (`WatchlistCheckHandler.java:14-50`). KYC/AML claim broken on any flow including `WATCHLIST_CHECK`. **Decision needed**: profile-gate to `@Profile("dev")` for now (default-safe — flow misses are explicit) OR commit to a real provider (Refinitiv/Dow Jones/etc.). **Default**: profile-gate this round, treat real-provider as a separate Tier-1 backlog. + +### T0.4 — `live_camera_analysis.py` boot fail-open [P0, agent-actionable] +Returns `is_live=True` when `self._liveness_detector` is None (`live_camera_analysis.py:184-193`). DI failure at boot = silent fail-open. **Fix**: fail-closed (`is_live=False, reason="liveness_detector_unavailable"`) and add a startup health-check that aborts boot if detector is None. + +### T0.5 — Account-lockout error never surfaces to user [P0, agent-actionable] +`AuthenticateUserService.java:79,127` throws `InvalidCredentialsException` for locked accounts. The dedicated `AccountLockedException` exists at `AccountLockedException.java:7` with `remainingLockTimeSeconds` but is never thrown. Frontend has full i18n keys for `ACCOUNT_LOCKED` (`web-app/.../tr.json:1576`) — dead code. Server lockout works (5 attempts → 15 min); only the surfacing is broken. **Fix**: throw `AccountLockedException` with remaining seconds when lockout fires; map in `GlobalExceptionHandler` to a structured 423 LOCKED response carrying the seconds. + +### T0.6 — OTP no per-code attempt counter [P0, agent-actionable] +`OtpService.java:29-43` keeps the OTP on mismatch (no counter, no delete). Only defense is 30/min/IP `mfa-step` rate-limit. ~150 guesses/code against 10⁶ space; rotating IPs improves attacker odds. NIST 800-63B requires 3-5 failures then invalidate. **Fix**: add `attempts INTEGER NOT NULL DEFAULT 0` to OTP entity (or Redis key), increment on mismatch, invalidate at 5, surface remaining attempts in error response. + +### T0.7 — `tenants.max_users` never enforced [P0, agent-actionable] +Field exists at `Tenant.java:86-88` (default 100), surfaced in admin UI; zero insert-path readers. **Fix**: in `RegisterUserService` (and `ManageUserService.create`) check `userRepository.countByTenantId(tenantId) < tenant.getMaxUsers()` before insert; new error code `TENANT_USER_QUOTA_EXCEEDED`. Decision needed on default cap policy for unmigrated tenants — sensible default is 1000 with admin-overridable. + +### T0.8 — Suspended tenants keep minting JWTs [P0, agent-actionable] +`Tenant.canAcceptUsers()` exists (`Tenant.java:249-251`) with zero non-DTO callers. `AuthenticateUserService` has no `tenant.isActive()` gate. **Fix**: gate auth path with `if (tenant.getStatus() != ACTIVE) throw TenantSuspendedException` mapping to 423; same gate in token-refresh path. + +### T0.9 — Anti-replay spot-check defeated by corrupt frames [P0, agent-actionable] +`verify_puzzle.py:171-196` only counts `is_live=False` outcomes as failures; any `continue` (decode error, detector exception) skips silently. 3 corrupt JPEGs → `failed_count=0` → spot-check pass. **Fix**: count exceptions/decode-errors as failures, raise `failed_count` and abort spot-check on threshold. + +### T0.10 — Face confidence fallback override [P0, agent-actionable] +`FaceAuthHandler.java:65-75` and `AuthController.java:509-518` fall back to hardcoded 0.7 cosine threshold when processor's `verified` field is missing/false, ignoring the adaptive aging threshold. Override never logs. **Fix**: trust the processor `verified` field; on missing field log + reject; remove the duplicated fallback in both call sites. + +### T0.11 — INVESTIGATION P1 hardening backlog [staged after P0 batch] +Round 2-6 in `INVESTIGATION_MASTER_2026-05-07.md`: AddressProofHandler real impl or profile-gate, LoggerService prod wiring, occlusion implementation, NFC MRZ wiring, access-token TTL = 15 min, voice/device caps, OAuth2 client RBAC, `/userinfo` scope filter, client_secret rotation, per-tenant rate limit, BiometricService underscore-prefix surfacing, error-shape unification, face-verify response shape, RFC 8252 redirect_uri schemes, AuthSessionRepository contract, AuditEventPublisher exception counter, 3 audit-log blind spots, `/2fa/verify*` HTTP status corrections, anti-spoof contradiction policy, SoftDeletePurgeJob default-on confirmation. + +--- + +## Tier 3 — Next active wave (agent-actionable, dispatch when quota window opens) + +These are the highest-leverage open items, ranked. Each is scoped tight enough for a single sub-agent to ship in one PR. + +### T3.1 — CI/CD pipeline P0 fixes (CICD_AUDIT) +- **T3.1.a** — Move bio CI 5 jobs (`Lint & Type Check`, `Unit Tests`, `Security Scan`, `Integration Tests`, `Build Frontend`) from `[self-hosted, linux, x64]` to `ubuntu-latest`. Bio CI dead 27 days; 82/100 cancelled, 2/100 success. +- **T3.1.b** — Move api `Integration tests (Testcontainers)` job to `ubuntu-latest`. Currently queues 5h+ then cancels; 0/30 success on `main`. +- **T3.1.c** — Fix `.github/workflows/deploy-landing.yml` — last successful run 2026-03-28 (5+ weeks). Switch to `ubuntu-latest` + rsync (web-app already does this with `HOSTINGER_SSH_KEY`). +- **T3.1.d** — Decide branch protection: enable on `main` for api/bio/web with 1-review requirement (admin bypass allowed for emergency hotfixes), OR explicitly document accept-no-protection rationale in `CICD_AUDIT_2026-05-04.md`. Currently OFF on all 5 repos. + +### T3.2 — TEST_REVIEW deferred items +- **T3.2.a** — F6: 11 controller slice tests using `addFilters=false`. Switch to filter-chain-enabled `@WebMvcTest` so SecurityConfig regressions surface. (Memory rule `feedback_pr_review_workflow.md` directly applies.) +- **T3.2.b** — F8: Vitest "e2e" specs are unit-shaped — rename or relocate to `src/__tests__/`. +- **T3.2.c** — NEW: smoke test that loads `application-prod.yml` in CI to catch YAML duplicate-key bugs (would have prevented PR #62 emergency hotfix on 2026-05-02). + +### T3.3 — SENIOR_UIUX P1-3 + P2 batch (deferred from today's UIUX-P1) +- **T3.3.a** — P1-3 sidebar dev-tools collapse (M-effort, single PR scope). +- **T3.3.b** — P2 batch (20 items): component decomposition (3 files >900 LOC remaining), zod parsing at API boundary, exhaustive-deps disabled in 28 spots, `any` casts in critical paths, microcopy consistency sweep. + +### T3.4 — Login edge-case carryover +- **T3.4.a** — Item #10: Session/flow tight binding — DB migration + entity column for flow snapshot so admin can mid-flight reassign user without invalidating session. (Adds V58.) +- **T3.4.b** — Item #7: "Adaptive" flow naming — UI hint not migration (per PR #65 deferral note). + +### T3.5 — SENIOR_DB Appendix C derived actions +- **T3.5.a** — `flyway repair` on prod for V40/V41/V42/V43/V49/V50 NULL-checksum rows; then flip `SPRING_FLYWAY_VALIDATE_ON_MIGRATE=true`. Removes Task #80 emergency override. +- **T3.5.b** — Investigate the 12.6% `audit_logs.tenant_id IS NULL` paths; ship a one-shot backfill + tighten anonymous-endpoint audit-emission. +- **T3.5.c** — `VACUUM (FULL, ANALYZE) webauthn_credentials;` + check why autovacuum has never fired. (3 live / 27 dead is a local outlier.) +- **T3.5.d** — Unused-index audit: reset `pg_stat_user_indexes`, monitor for 7 days, then `DROP INDEX` for confirmed-zero-scan indexes. Caution: do NOT drop indexes on `webauthn_credentials`, `oauth2_clients`, `refresh_tokens`, `audit_logs` until traffic patterns settle post-rebuild. + +### T3.6 — 2026-05-04 deploy follow-up +- **T3.6.a** — Rebuild api with PR #73 included (today's image is from PR #71). Deferred because #73's contents (SoftDeletePurgeJob `hardDeleteById`, WebAuthn `deleteByCredentialId` enrollment-revoke, `Locale.ROOT`, OAuth2 `invalid_token`) are not user-visible blockers. +- **T3.6.b** — Confirm post-#71 prod stability over 24h. If clean, close P0-PROD definitively in CHANGELOG. + +--- + +## Tier 4 — Open backlog (leverage-ranked) + +### T4.1 Test infrastructure (carryover from `TEST_REVIEW_2026-05-01.md`) +- F7 — JaCoCo + 70% gate on `mvn verify` (P1) +- F11 — Java test factory builders (`aUser().withEmail("x").build()`) (P2) +- F12 — MSW + API contract tests on web side (P2) +- F13 — k6 load tests not invoked from CI (P2) +- F14 — Audit-log emission asserted at port-mock layer, not row count (P2) +- F17 — No CodeQL / Semgrep / DAST despite RS256 + TOTP + WebAuthn surface (P1, ops-coordination) + +### T4.2 Quality cleanup (carryover from `QUALITY_REVIEW_2026-05-01.md`) +- P1-Q8 — Service-class naming inconsistency (P1) +- P2-Q11..Q16 — bio `container.py` 1133 LOC, dead `@deprecated` aliases, dup script trees, 16 `@SuppressWarnings`, MDC under-used, audit-log prefix inconsistency +- P3-Q17..Q22 — minor hygiene + +### T4.3 Frontend P3 (carryover from `FRONTEND_REVIEW_2026-05-01.md` + UI/UX 2026-05-04) +- 11 P3 items in `SENIOR_UIUX_REVIEW_2026-05-04.md` + +### T4.4 Security tail +- See `SECURITY_REVIEW_2026-05-01.md` deferred P2/P3 items. +- ✅ DeviceController WebAuthn boundary — closed by PR #66 today. +- ✅ `/oauth2/userinfo` type-claim check — closed by PR #67 today. + +### T4.5 RLS hardening (Task #27, multi-day) +Application DB role + `FORCE ROW LEVEL SECURITY` + per-table policies. Currently RLS is opt-in via app-level `SET app.current_tenant_id`; hardening makes it impossible for any session to bypass. Multi-day scope; needs dedicated session. + +### T4.6 Architecture (carryover) +- DTO triplication — Performance + Architecture review item +- ✅ Audit-log partitioning advanced — closed by PR #68 today (V57 pg_partman, fail-soft) + +### T4.7 Audit-log tenant_id NULL backfill (NEW from Appendix C) +12.6% (140/1107) audit-log rows still have `tenant_id IS NULL`. Drift up from 12.4% on 2026-04-30 means new rows are still being written without it. Two work items: +- One-shot backfill via `audit_logs.user_id → users.tenant_id` JOIN (mirrors V46) +- Audit emission tightening: anonymous endpoints (login attempt before auth, oauth2 token endpoint, etc.) need a deliberate decision to write `tenant_id` (e.g. tenant from email-domain lookup, or explicit "system" tenant UUID) + +### T4.8 Unused-index audit (NEW from Appendix C) +25+ indexes with `idx_scan = 0`. Need a 7-day post-stats-reset re-audit before any `DROP INDEX`. Some are likely fine to keep (FK-covering indexes that just haven't seen traffic yet); some are likely droppable (e.g. duplicate composite indexes). + +### T4.9 webauthn_credentials VACUUM (NEW from Appendix C) +27 dead / 3 live, never autovacuumed. ✅ **`VACUUM ANALYZE` run 2026-05-04 12:27 UTC — ratio 9.00 → 0.00, 14 dead tuples reclaimed.** Follow-up: tune `autovacuum_vacuum_scale_factor` for low-row-count tables so this doesn't recur. + +### T4.12 Documentation gaps (NEW from `DOC_AUDIT_2026-05-04.md` commit `f2efeac`) + +**P0 — quick wins, ≤2 hours each:** +- **T4.12.a** — Add `SECURITY.md` to all 6 repos (parent + 5 submodules). Vulnerability-disclosure policy is the single highest-priority gap on an auth/biometric platform with JWT + OAuth2 + WebAuthn + RFC 6749 §10.4 family-revoke. Use the GitHub-recommended template; route to `info@app.fivucsas.com`. +- **T4.12.b** — Add `LICENSE` file to every repo. All READMEs display MIT badges but no `LICENSE` file exists anywhere. Legally weak. XS effort. +- **T4.12.c** — Add `landing-website/README.md`. The repo is currently completely undocumented. + +**P1 — multi-day initiatives:** +- **T4.12.d** — Tenant onboarding playbook in `docs/01-getting-started/tenant-onboarding.md`. Covers OIDC client provisioning, redirect-URI allowlist, SDK install, first-signup → first-MFA flow. +- **T4.12.e** — ADR (Architecture Decision Records) directory `docs/adr/`. Backfill the major decisions that currently live only in CHANGELOG narrative + session memos: hosted-first OIDC, pgvector, MobileFaceNet removal, Facenet512, log-only client embeddings, RFC 6749 §10.4 family-revoke, V53 BEFORE-DELETE trigger pattern, Persistable wire-format trade-off. + +**P2 — cleanup:** +- **T4.12.f** — `docs/` submodule has duplicate hierarchies (`02-architecture/` vs `architecture/`, `05-testing/` vs `testing/`, `01-getting-started/` vs `guides/`). Broken links to `docs/4-testing/`, `docs/5-security/`, `07-status/IMPLEMENTATION_STATUS_REPORT.md`. Consolidate into the numbered tree. +- **T4.12.g** — `/opt/projects/infra/` has 8 strong runbooks but no public docs-index entry points to them. Cross-link from `docs/06-operations/` (or wherever the right index lives). +- **T4.12.h** — Dated-doc reorganization: move 25+ `AUDIT_*` / `*REVIEW_*` / `SESSION_STATUS_*` / `ROADMAP_*` / `ANALYSIS_*` / `RESEARCH_*` files out of parent root into `docs/reviews/YYYY-MM-DD/.md`. **Hold for explicit user OK** — moves git history for many files. + +### T4.11 User-reported bugs 2026-05-04 afternoon (NEW) + +User testing surfaced 6 issues. Each entry below shows status as of 12:45 UTC. + +- **USER-BUG-1 — Documentation gap audit (META).** Research what a professional production-grade SaaS platform should document, audit current state, recommend reorganization. ⏳ T-DOC-AUDIT in flight. +- **USER-BUG-2 — Guest invitation creation crashed with `column "metadata" is of type jsonb but expression is of type character varying`.** Root cause: `GuestInvitation.metadata` had `@Column(columnDefinition = "jsonb")` but no `@JdbcTypeCode(SqlTypes.JSON)`, so Hibernate bound the String as varchar at runtime. ✅ Closed by api PR #74 (squash `5096e8d`); api container rebuilt + recreated 2026-05-04 12:39 UTC with image `0fd02c48`. Operator-side: try `POST /api/v1/guests/invite` again — should now return 201. +- **USER-BUG-3 — SMS step has black/wrong colors for code label and resend button in dark mode.** `SmsOtpStep.tsx` lines 102-131 (TextField label) + 167-199 (resend Button). Likely missing theme-aware overrides on `MuiInputLabel-root` color and on the outlined Button text color. ⏳ T-WEB-USERBUGS in flight. +- **USER-BUG-4 — Biometric Tools → Face Search returns "Eşleşme Bulunamadı" for a face that successfully logs in via face-verify.** Same face, different result. Likely tenant-scope mismatch in the search endpoint, OR a stricter threshold than verify, OR an image-encoding mismatch. ⏳ T-FACE-SEARCH in flight. +- **USER-BUG-5 — Auth Methods Testing page mostly broken.** Most stubbed cards don't let the user click through. Possibly stale imports after PR web#69 EnrollmentPage decomposition relocated method-flows. ⏳ T-WEB-USERBUGS in flight. +- **USER-BUG-6 — Settings page "Two-Factor Authentication" section is misleading.** Header says "Required by your organization / Managed by your organization's admin via Auth Flows" but then shows 3 buttons (Setup TOTP, Register Passkey, Register Hardware Key) which are actually for *device registration*, not method enablement. Rename + reword + i18n. ⏳ T-WEB-USERBUGS in flight. + +### T4.10 Copilot-deferred items from today's review (NEW from T-COPILOT-DEEP report) +Issues raised by Copilot that the post-merge follow-up agent deliberately deferred — each has a real fix but is out of scope for a single PR. + +- **T4.10.a (api PR #66 follow-up)** — `WebAuthnCredentialService.saveCredential` lacks `completeEnrollment` rollback on partial failure. Needs proxy-level integration test + transaction redesign. **P2 — security-positive but not user-blocking.** +- **T4.10.b (api PR #68 follow-up)** — V57 pg_partman migration has 3 secondary issues: RLS predicate not yet plumbed into the partitioned hierarchy, missing FK on partition key, oversized legacy 2026-01..2026-06 static partition. V57 is fail-soft and not yet deployed via Option A; redesign needs prod-data testing in a dedicated PR. **P1 — must-fix before T1.G Option A (custom postgres image with pg_partman + pg_cron).** +- **T4.10.c (api PR #67 follow-up)** — `/oauth2/userinfo` parses the JWT twice (once for type-claim extract, once for email). Minor perf. Would require widening `JwtService.extractAllClaims` visibility from package-private to public, OR extracting both claims in a single parse via `extractClaim`-with-tuple. **P3.** +- **T4.10.d (api PR #65 follow-up)** — `verifyUserCanCompleteFlow` duplicated across `AuthenticateUserService` and `ExecuteAuthSessionService`. Pure refactor; pull into a shared service. **P2 — Quality-cleanup tier.** +- **T4.10.e (api PR #69 follow-up)** — `isTokenValid` is not exercised in the expired-token test path. Minor coverage gap; ship a single test case calling `jwtService.isTokenValid(alreadyExpired)` and asserting false. **P3 — Test-infra tier.** +- **T4.10.f (web PR #67 + PR #68 follow-ups)** — Cosmetic: `vite.config.ts` has a duplicate CSP block (one in JS, one in injected meta — only one path is hot); PR #68 ratchet's doc-name nit. **P3 — single-PR cleanup.** + +--- + +## Tier 5 — Long-running, scheduled + +| ID | Item | Trigger | Phase | +|---|---|---|---| +| T5.1 | Phase 4 productization — self-serve signup, Stripe, tenant-branded hosted login, status page | T1.E (Stripe) ready | Phase 4 | +| T5.2 | Phase 5 mobile parity (KMP Compose UI) + iBeta Level-1 prep | T1.F (keystore) + T1.D (iBeta) | Phase 5 | +| T5.3 | Aysenur rPPG integration | Post-rebase, into proctoring submodule | Phase 4+ | +| T5.4 | Proctoring submodule extraction (Option B) — Phase 1 `shared-detection`, Phase 2 `proctoring-engine` | Gated on T1.I CI runner stall (operator) | Phase 4 | +| T5.5 | JWT soak end (Task #82) — flip `ALLOW_HS512=false`, set issuer/audience env, restart api | ~2026-06-01 (env-only, no rebuild) | Scheduled | +| T5.6 | V56-real — drop `refresh_tokens.token` plaintext column | T2.4 confirmed; ~2026-05-09 | Scheduled | + +--- + +## Tier 6 — Today's prod state and verification cadence + +| ID | Item | Cadence | +|---|---|---| +| T6.1 | Flyway repair (Task #80) | Within next 7 days | +| T6.2 | DR drill cadence (after the 2026-04-30 one-off success) | Weekly | +| T6.3 | Rebuild for PR #73 (T3.6.a) | When #73 contents become user-visible OR next batch lands | +| T6.4 | 24h post-#71 prod-stability confirmation | 2026-05-05 by ~12:30 UTC | +| T6.5 | Smoke-test live login from a real user device | Ongoing (user-driven) | + +--- + +## Tier 7 — Next-phase preparation (foundation for "complete project perfectly") + +These are not blockers for current work but should be set up now so Phase 4+ rolls smoothly. + +### T7.1 — Branch protection on `main` +Per CICD_AUDIT P0: enable 1-review requirement on api/bio/web `main`. Admin bypass allowed for emergency hotfixes. Disclosed in repo settings + CHANGELOG so collaborators understand the discipline shift. + +### T7.2 — Senior-reviewer doc archive +The growing collection of dated review docs (`SENIOR_DB_REVIEW_*`, `SENIOR_UIUX_REVIEW_*`, etc.) should move to `/opt/projects/fivucsas/docs/reviews/` rather than the project root. One commit, no behavior change. + +### T7.3 — RUNBOOK_AUDIT_LOG_PARTMAN.md operator decision Option A vs B +T1.G is the long-form Option A; today's V57 fail-soft is Option B. Document the decision criteria explicitly so the next operator knows what cadence to monitor. + +### T7.4 — `application-prod.yml` content-test (NEW T3.2.c) + add similar config-validation tests +After T3.2.c lands, generalize the pattern: a per-profile config sanity test that loads each YAML and asserts (a) no duplicate keys, (b) all `${ENV_VAR}` placeholders are reachable from a known list. Closes a class of boot-time failures. + +### T7.5 — Memory hygiene +The auto-memory `MEMORY.md` index has accumulated 13 session notes. Consolidate sessions older than 7 days into a single archive entry; keep recent sessions verbatim for resume context. Manual one-shot. + +### T7.6 — Secret-rotation calendar +Document a 90-day rotation cadence for: `JWT_SECRET` (kid-based, no logout), Twilio API token, SMTP password, biometric API key, refresh-token plaintext column drop (one-shot). Calendar lives in `infra/RUNBOOK_SECRET_ROTATION.md`. + +### T7.7 — `docs/architecture/` synthesis +The three senior reviews (Architecture, Principal, DB) plus today's CICD_AUDIT contain the full picture of the platform. A Phase-4 readiness doc should distill them into one architecture overview that a new hire can read in 30 minutes. + +### T7.8 — Status page + uptime tracking (Phase 4 gate) +Pick a vendor (Statuspage, BetterUptime, Instatus) and stand up a basic page showing api / bio / web / verify / demo / landing health. Operator-driven choice; product-side prerequisite for self-serve Phase 4. + +--- + +## Closed since 2026-05-02 — do not re-list + +- 9 backend PRs `#63–#71` — see `identity-core-api/CHANGELOG.md` 2026-05-04 entry; #72 docs sweep + #73 Copilot follow-ups also landed. +- 4 web-app PRs `#67–#70` — see `web-app/CHANGELOG.md` 2026-05-04 entry; #71 docs sweep + #72 UIUX P1 + #73 Copilot follow-ups also landed. +- 1 bio PR `#69` (docs). +- T-MERGE / T-FRONTEND-HYGIENE / T-SEC-TAIL / T-ARCH / T-LOGIN-EDGE / T-TEST-INFRA-F15 / T-DB-P0 / T-QUALITY / T-UIUX-P1 / T-CICD-AUDIT / T-COPILOT-DEEP / T-DOC-SWEEP — all dispatched and reported. +- DB-P0-1 (chain contiguity) — pre-empted by T-ARCH's V56 placeholder +- DB-P0-2 + DB-P0-3 — closed by PR api#70 +- F15 deterministic clock — closed by PR api#69 +- Q1-Q7 lint ratchet, Q1-Q7 EnrollmentPage decomposition — closed by PR web#68 + #69 +- DeviceController WebAuthn boundary — closed by PR api#66 +- `/oauth2/userinfo` type-claim — closed by PR api#67 +- pg_partman audit-log advanced — closed by PR api#68 (fail-soft) +- **P0-PROD refresh-token mint** — closed by PR api#71, **DEPLOYED 12:01 UTC** +- SENIOR_UIUX P0-1 + P1-1 + P1-2 + P1-4 — closed by PR web#72 +- T-DOC-SWEEP per-submodule docs — closed by api#72 + web#71 + bio#69 + parent `28f2b33` +- T-CICD-AUDIT — closed by parent `ac0b78d` (CICD_AUDIT_2026-05-04.md) + +--- + +## What I won't do without explicit user go-ahead + +- Touch the prod live `JWT_SECRET` on Hetzner (kid registry is the no-logout path; rotation per T5.5) +- Force-push `git filter-repo` on any submodule history (T1.B) +- Buy domain names or stand up new public services +- Rotate Twilio / SMTP credentials (T1.C) +- Touch `feat/anti-spoof-pipeline` or `liveness_capture` branches before T2.1 confirms direction +- Hard-delete user rows outside the `SoftDeletePurgeJob` 30-day path (memory `feedback_no_hard_delete_users.md`) +- Drop refresh-token plaintext column before T+7d soak (T5.6 / 2026-05-09) +- Schedule a Hetzner downtime window (T1.G Option A custom postgres image) +- Change branch-protection settings without confirmation (T7.1) + +--- + +## Snapshot questions for the user (decision queue) + +1. **T2.4 / T5.6** — green-light V56-real (drop `refresh_tokens.token` plaintext column) on or after **2026-05-09**? +2. **T1.B** — proceed with the GDPR fixture history rewrite, or accept the cosmetic-only carry-forward (leaked secret already rotated 2026-04-30)? +3. **T7.1** — enable branch protection on api/bio/web `main` with 1-review (admin bypass)? +4. **T1.G** — Option A (custom postgres image with pg_partman + pg_cron, ~30 min maintenance window) or stay on Option B (V57 fail-soft GUC, zero downtime)? +5. **T1.I** — diagnose Hetzner self-hosted runner stall (Task #55), or fully commit to ubuntu-latest fallbacks per T3.1? +6. **T5.5** — `/schedule` an agent for 2026-06-01 to flip the JWT soak vars? +7. **T7.5** — consolidate older session memories into an archive entry now, or wait for the auto-memory system to handle it? + +--- + +## Appendix — Today's PR ledger (all 16 squashed to `main`) + +### identity-core-api +| PR | Squash | Scope | +|---|---|---| +| #63 | `432b4d3` | ArchUnit `entity.User` import boundary | +| #64 | `2d958c5` | JWT kid-based key registry | +| #65 | `d224ad1` | Login edge cases #1/#3/#4/#5/#6/#9 | +| #66 | `e986609` | DeviceController WebAuthn service boundary + ArchUnit guard | +| #67 | `2b49bd5` | `/oauth2/userinfo` type-claim | +| #68 | `d95425c` | V57 pg_partman + V56 placeholder + Testcontainers IT | +| #69 | `70036a5` | F15 deterministic clock | +| #70 | `1e23ef0` | User `@SQLDelete` + `@SQLRestriction` | +| **#71** | **`a77c844`** | **P0-PROD refresh-token `Persistable` (DEPLOYED)** | +| #72 | `eaf8111` | docs sweep | +| #73 | `1c9e9be` | Copilot follow-ups (Locale.ROOT, OAuth2 invalid_token, SoftDeletePurgeJob hardDelete, WebAuthn enrollment-revoke) | + +### web-app +| PR | Squash | Scope | +|---|---|---| +| #67 | `319b457` | P3 hygiene (title, setTimeout, CSP, NotificationPanel pause-on-hidden) | +| #68 | `386b904` | Lint ratchet 90 → 2 | +| #69 | `35c116c` | EnrollmentPage decomposition by biometric method | +| #70 | `9bcf16a` | NfcStep timeout copilot nit | +| #71 | `120c35b` | docs sweep | +| #72 | `bfb31c7` | SENIOR_UIUX P1 batch (IntegratorLandingCard + 11 aria-labels + nav rename) | +| #73 | `e47d464` | Copilot follow-ups | + +### biometric-processor +| PR | Squash | Scope | +|---|---|---| +| #69 | `d91760a` | docs sweep (alembic in runtime image) | + +--- + +*Last updated: 2026-05-04 12:50 UTC — added §T4.11 (6 user-reported bugs from afternoon testing). USER-BUG-2 (guest invitation jsonb) closed in-session: api PR #74 merged + rebuilt. T-DOC-AUDIT, T-WEB-USERBUGS, T-FACE-SEARCH dispatched in parallel for the remaining 5.* diff --git a/SENIOR_DB_REVIEW_2026-05-04.md b/SENIOR_DB_REVIEW_2026-05-04.md new file mode 100644 index 0000000..2c8e8fa --- /dev/null +++ b/SENIOR_DB_REVIEW_2026-05-04.md @@ -0,0 +1,610 @@ +# Senior Database Engineer — Deep Review + +**Date:** 2026-05-04 +**Reviewer:** Senior Database Engineer (independent) +**Scope:** identity-core-api (PostgreSQL + Flyway V0–V57) and biometric-processor (PostgreSQL + pgvector + Alembic 0001–0005). Storage tier only — no application code changes proposed. +**Repos / SHAs read at HEAD:** +- parent fivucsas: `e0e87b5` (2026-05-04) +- identity-core-api submodule: `2d958c5` (2026-05-04) +- biometric-processor submodule: `22bd33c` (2026-05-04) + +**Prior reviews consulted, not summarised:** +- `/opt/projects/DB_REVIEW_2026-04-30.md` (the predecessor of this document) +- `/opt/projects/ARCHITECTURE_REVIEW_2026-04-30.md` (audit-log partitioning + multi-DB cohabitation context) +- `/opt/projects/PERF_REVIEW_2026-04-30.md` (DTO triplication + Hikari pool sizing) + +**Read-only constraint:** SSH from this sandbox to the Hetzner VPS is not available (`/home/deploy/.ssh/id_ed25519` is not authorised for `root@116.203.222.213`). Live state claims that depend on prod were therefore re-verified from `flyway_schema_history` evidence in `/opt/projects/DB_REVIEW_2026-04-30.md` plus the current submodule HEAD diff. Where the predecessor's claim was already stale, I flag it explicitly. Items requiring a fresh `psql` snapshot (table sizes, `pg_stat_user_indexes.idx_scan`, vacuum bloat ratios, `pg_stat_archiver`) are marked **[NEEDS PROD VERIFY]**. + +--- + +## Executive summary + +1. **`User` entity STILL has no `@SQLDelete`.** `ManageUserService.java:288` still issues `userRepository.delete(user)`. V53 added a DB-level `BEFORE DELETE` trigger which closes the worst case (raw `psql` typo, a runaway migration), but the application code path now produces a *5xx error to the admin user* rather than a soft-delete — the trigger raises `restrict_violation`. The contract is **half-implemented**. This is the same finding as 2026-04-30 §1; the trigger landed but the entity-level fix did not. +2. **RLS is still theatre on prod.** `application-prod.yml:18` still resolves `${DATABASE_USERNAME}` from the env (the `.env.prod` value is `postgres` per the predecessor review). Postgres exempts owners + superusers unless `ALTER TABLE … FORCE ROW LEVEL SECURITY`. V25 left `FORCE` commented out (lines 147–149) and no later migration enables it. `current_tenant_id() IS NULL` policies still **fail-open** on every async / scheduled / unguarded path. Multi-tenant isolation in production today rests entirely on the Hibernate `@Filter("tenantFilter")` + `TenantHibernateAspect`. Any raw native query, `@Async` task, or `SoftDeletePurgeJob` runs **with no tenant fence at all**. +3. **Embedding encryption shipped as code (Fernet), but the alembic 0005 schema must be applied operator-side before the next bio container boot.** `20260502_0005_embedding_ciphertext.py` adds `embedding_ciphertext BYTEA` + `key_version SMALLINT NOT NULL DEFAULT 1` to `face_embeddings` and `voice_enrollments`. PR #65 (`611a3cc`) wires the writer; PR #68 (`22bd33c`) adds `alembic upgrade head` to the runtime image. The combination only protects new enrollments — **existing rows must be backfilled manually** via `app.infrastructure.persistence.scripts.backfill_embedding_ciphertext` after operator sets `FIVUCSAS_EMBEDDING_KEY`. Without that backfill, the column lights up green but historical templates remain plaintext-resident in pgvector. Memory note `feedback_audit_delta_before_rebuild` applies — diff ..HEAD before rebuilding the bio container. +4. **The audit-log partition story is now a three-way drift between disk, history, and the V57 in-flight migration.** V40/V41 are recorded in `flyway_schema_history` with NULL checksum (BASELINE SKIP markers) — meaning no SQL ever ran. V53 (forbid hard delete) was renumbered around V51/V52 (ShedLock) and is consistent. V57 (`audit_logs_pg_partman.sql`) is on disk but still gated by `app.skip_partman_v57` and requires `pg_partman` to be installed at the OS level — neither has been done on prod. The deployed `audit_logs` is therefore still a plain heap; volume is small (~1k rows in the 04-30 snapshot) so the gap remains advisory. +5. **Schema drift between the two databases is widening, not narrowing.** `users.id UUID` ↔ `face_embeddings.user_id VARCHAR(255)` was already noted in 2026-04-30 §16. Alembic 0005 added `key_version SMALLINT` consistently in both biometric tables but did **not** enforce a CHECK that ciphertext is non-null when the row is canonical, which means a half-encrypted row is indistinguishable from a half-failed migration. Forward-looking: the moment we let any cross-DB FDW or analytical join touch the boundary, the type mismatch will force string casts everywhere. + +--- + +## 1. Schema design + +### 1.1 Multi-tenancy model + +`identity_core` is a **column-discriminated multi-tenant** schema; every operational table carries a `tenant_id UUID REFERENCES tenants(id)`. The current attack surface for tenant bleed-through has three layers, and only the application layer is functional today: + +| Layer | Mechanism | Status | +|------|-----------|--------| +| Application — Hibernate filter | `User.java:40-41` `@FilterDef tenantFilter` + `TenantHibernateAspect.enableFilter(tenantId)` | Working, but only on JPA reads; native queries + `@Async` paths bypass it | +| Application — `TenantContext` thread-local | `TenantContext.setCurrentTenant()` stamps `app.current_tenant_id` GUC in `TenantHibernateAspect.java:34-58` | Plain `ThreadLocal` — does not propagate to `@Async` / `@Scheduled` (see `AuditLoggingAspect` Copilot finding on PR #38, still unfixed) | +| Database — RLS policies (V25) | `users_tenant_isolation USING (tenant_id = current_tenant_id() OR current_tenant_id() IS NULL)` | Theatre; app connects as table owner so RLS is bypassed even when policies fire | + +**`tenant_id` column coverage** (Flyway V1–V55, my read): + +| Has `tenant_id` | Lacks `tenant_id` (intentional) | **Lacks `tenant_id` (bug)** | +|---|---|---| +| `users` (V2:14), `roles` (V3), `auth_flows` (V16:43), `auth_flow_steps`, `auth_sessions` (V16:60), `user_devices` (V17), `user_enrollments` (V16:189), `audit_logs` (V5:15), `refresh_tokens` (V5:92), `active_sessions` (V5:160), `security_events` (V5:240), `nfc_cards` (V22:7), `oauth2_clients` (V24), `tenant_email_domains` (V44, PK is `(tenant_id, email_domain)`), `mfa_sessions` (V16/V36), `tenant_auth_methods` | `permissions` (V3, global), `auth_methods` (V16, global registry), `shedlock` (V51, distributed lock), `flyway_schema_history` (Flyway internal), `rate_limit_buckets` (V9, may be tenant-keyed by `bucket_key` string) | **`webauthn_credentials` (V18)** — derives tenant via `users` join; should have its own `tenant_id` for RLS / partition pruning. **`password_history`** — same. **`api_keys` (V19)** — has `tenant_id` per V2's earlier review §2 but I could not re-verify in V19 file (file not opened). **`user_settings` (V11/V14)** — derives via user, no own column. | + +**Missing `tenant_id` on RLS-sensitive tables** is a long-standing finding from `DB_REVIEW_2026-04-30.md §2` and is unchanged. Specifically `webauthn_credentials`, `password_history`, `mfa_sessions` (have it via V36), `nfc_cards` (has it via V22:7) — the worry is the *un-FORCEd* RLS on the ones that have the column, plus the *missing column* on the ones that don't. + +`biometric_db` is **also column-discriminated** but with `tenant_id VARCHAR(255)` rather than `UUID` (Alembic 0001 lines 65 + initial CREATE TABLE in `pgvector_voice_repository.py`). There is no FK to `identity_core.tenants(id)` — this is a hard PostgreSQL constraint (cross-database FK is unsupported), so the only enforcement is application-side. The bio service does not run any RLS at all. This is acceptable given the X-API-Key gate, but means a compromised biometric API key = unrestricted cross-tenant template read. + +### 1.2 Constraint hygiene + +Spot-checks against the migrations and entities: + +- **PK / NOT NULL** — sound on every operational table. `users.email` UNIQUE NOT NULL; `users.password_hash` NOT NULL; `audit_logs.action / resource_type / success` NOT NULL. +- **CHECK constraints** — present where it counts. `users_phone_e164` (V54), `chk_two_factor_secret_encrypted` (V42, `LIKE 'enc:v1:%'`), `valid_email` (V2:113), `chk_tenant_email_domains_lowercase` (V44:34), `enrollment_scores ∈ [0,1]` (V47:11). Notable absences: no CHECK on `face_embeddings.embedding_ciphertext IS NOT NULL` post-backfill (Alembic 0005 left both columns nullable — see §3.3); no CHECK that `enc_version SMALLINT NOT NULL DEFAULT 1` is ≥ 1; no CHECK that `users.user_type` and `users.expires_at` are consistent (a `GUEST` should always have `expires_at`). +- **UNIQUE indexes** — `unique_tenant_email UNIQUE(tenant_id, email)` on V2:108 protects multi-tenant email collision. `ux_tenant_email_domains_one_primary ON tenant_email_domains (tenant_id) WHERE is_primary = true` (V44:51) elegantly enforces "one primary domain per tenant" via a partial unique index — this is the right pattern, copy-worthy elsewhere. +- **FK ON DELETE behaviours** — heavily inconsistent. `webauthn_credentials.user_id ON DELETE CASCADE` (V18:4); `nfc_cards.user_id ON DELETE CASCADE` AND `nfc_cards.tenant_id ON DELETE CASCADE` (V22:6-7); `auth_sessions.tenant_id` has **no** ON DELETE clause (defaults to NO ACTION) per `DB_REVIEW_2026-04-30.md §10`. The 2026-04-28 ahabgu cascade incident lives on as a documented memory rule (`feedback_no_hard_delete_users.md`) — V53's BEFORE DELETE trigger is the partial mitigation; entity-level `@SQLDelete` is the missing other half. + +### 1.3 Soft-delete consistency + +`Tenant` is the gold standard: V49 column comment + `@SQLDelete` + `@SQLRestriction("deleted_at IS NULL")` on `Tenant.java:41-42`. Find queries skip tombstoned rows by default; the partial index `idx_tenants_deleted_at WHERE deleted_at IS NOT NULL` (V49:27) covers admin-restore lookups. + +`User` is the **broken case**: + +- Column exists since V2:105. +- `User.softDelete()` exists on `User.java:517-521`. +- **Entity has neither `@SQLDelete` nor `@SQLRestriction`.** +- `UserRepository.findByEmail` (line 30) DOES filter `deletedAt IS NULL` — good. +- `findByPasswordResetToken` (line 95), `findByEmailVerificationToken` (line 98) DO filter — good. +- **`findByStatus` (line 49), `findExpiredGuests` (line 70), `findByTenantIdAndUserType` (line 77), `countByTenantIdAndUserType` (line 85), `countByTenantId` (line 92), `searchUsers` (line 62), `searchUsersByTenant` (line 116), `findAllWithRoles` (line 106), `findAllByTenantIdWithRoles` (line 110)** all **lack the `deleted_at IS NULL` predicate**. A soft-deleted user is therefore visible in the admin user list, in tenant counts, in search results, and in `findExpiredGuests` cron output. This is a **data-leakage bug** for any GDPR hard-delete window (the user appears "alive" in admin views during the 30-day grace period). +- `findPurgeCandidates` (line 126) deliberately filters `deletedAt IS NOT NULL` — that one is correct. + +**Other tables with `deleted_at`:** `users.deleted_at`, `tenants.deleted_at`, `face_embeddings.deleted_at` (Alembic 0001:71 — this is on the dropped `biometric_data`, the surviving `face_embeddings` does NOT have it). + +`refresh_tokens.is_revoked + revokedAt` (V50, RefreshToken.java:88-93) is a **functional soft-delete in disguise**. The repository hot path `WHERE is_revoked = false` is partially indexed (`idx_refresh_tokens_user_expiry` per 2026-04-30 §11). This is fine. + +### 1.4 Naming consistency + +Column naming is overwhelmingly snake_case at the DB layer (`tenant_id`, `created_at`, `last_login_at`). Java entities use camelCase (`tenantId`, `createdAt`) and rely on Hibernate's default naming strategy. Two outliers worth calling out: + +- `audit_logs.user_agent_v2` (V8:14) — a versioned column name suggesting a one-off "redo" of `user_agent` that never cleaned up. `AuditLog.java:99-100` exposes both, and `getEffectiveUserAgent()` (line 173) returns whichever is non-null. Tech debt: at zero rows-using-only-v1 we should pick one and `ALTER TABLE … DROP COLUMN`. +- `idNumber` (Java) → `id_number` (DB) — fine, but the column is loaded on a 11-character VARCHAR with `UNIQUE` semantics implied by the value object yet I see no UNIQUE index defined on it. **[NEEDS PROD VERIFY]**: `\d users` will tell us. + +Table names are plural everywhere (`users`, `tenants`, `audit_logs`) — consistent. + +### 1.5 Encrypted-at-rest columns + +| Column | Encryption | Where | +|---|---|---| +| `users.two_factor_secret` | AES-GCM-256 via `TotpSecretAttributeConverter`; envelope `enc:v1:`; CHECK constraint enforces format (V42) | identity_core | +| `refresh_tokens.token_secret_hash` | SHA-256 of secret-half of `.` token (V55) | identity_core; **NB: plaintext `token` column still kept for backwards-compat dual-read — must be dropped in V56+ once soak is complete** | +| `face_embeddings.embedding_ciphertext` | Fernet (AES-128-CBC + HMAC-SHA-256) via `FIVUCSAS_EMBEDDING_KEY` (Alembic 0005 + PR #65) | biometric_db; **NB: nullable; plaintext `embedding vector(512)` is the index surface and stays** | +| `voice_enrollments.embedding_ciphertext` | Same | biometric_db; same caveat | + +**PII columns that should arguably be encrypted but aren't:** + +- `users.id_number` (Turkish TC Kimlik / national ID — KVKK Art 6 special category data when carried as a public-key personal identifier) — stored plaintext. Strong candidate for AttributeConverter encryption similar to `TotpSecretAttributeConverter`. +- `users.phone_number` (E.164, V54) — generally not classified as sensitive but the combination with `id_number` is. +- `users.address` (`Address.java` value object) — same combinatorial concern. +- `verification_documents.*` (V26 verification pipeline) — MRZ fields, document images. **[NEEDS PROD VERIFY]** but I would expect bytea blobs here. +- `nfc_cards.card_serial` (V22) — quasi-PII; UNIQUE per tenant. + +These four are not on fire — they are KVKK-relevant but not directly biometric — but the embedding-encryption work establishes the pattern and the next step is to extend it to TC Kimlik. + +--- + +## 2. Indexes + +### 2.1 Foreign-key indexes + +PostgreSQL does not auto-index FKs. Missing FK indexes block efficient cascade deletes and can cause sequential scans on parent-side updates. Inventory of FK columns vs index coverage (from migration files; **prod `\d ` confirmation needed**): + +| FK column | Has index? | Source | +|---|---|---| +| `users.tenant_id` | yes | V2:118 (`WHERE deleted_at IS NULL` partial) | +| `users.invited_by` | **no** | V10 added the column; no index | +| `audit_logs.user_id` | yes | V5 path; `idx_audit_user` | +| `audit_logs.tenant_id` | yes | V5:290 | +| `refresh_tokens.user_id` | yes | V5 + V50:37 indexes `family_id` separately | +| `refresh_tokens.tenant_id` | yes | V5:299 | +| `webauthn_credentials.user_id` | yes via dup index from 2026-04-30 §11 (drop one) | V18 | +| `nfc_cards.user_id` | yes | V22:22 | +| `nfc_cards.tenant_id` | yes | V22:22 | +| `mfa_sessions.user_id` | yes | V16 | +| `password_history.user_id` | likely yes | not re-verified | +| `auth_sessions.user_id` | yes | V16 | +| `auth_sessions.tenant_id` | yes | V16 | +| `auth_flow_steps.auth_flow_id` | likely yes | not re-verified | +| `oauth2_clients.tenant_id` | yes (V37 added explicit safety-net) | V24 + V37 | +| `user_enrollments.tenant_id` | yes | V25 RLS pre-req | +| `user_enrollments.user_id` | yes | V16:189 path | +| `user_devices.user_id` | yes | V17 path | +| `user_devices.tenant_id` | yes | V17 path | +| `audit_logs.resource_id` | **no covering index** | only `idx_audit_resource (resource_type, resource_id)` (V5) | + +**Net:** the gap at `users.invited_by` matters when a tenant admin is deleted and Postgres has to scan `users` to enforce FK referential integrity. At 27 rows it's invisible; at 100k rows it's a 200ms unindexed scan in the soft-delete cron. Add one-line partial: `CREATE INDEX idx_users_invited_by ON users(invited_by) WHERE invited_by IS NOT NULL;` + +### 2.2 pgvector indexes on `face_embeddings`, `voice_enrollments` + +`face_embeddings.embedding vector(512)` carries TWO indexes per the predecessor review: +- `idx_embeddings_vector_ivfflat USING ivfflat (lists=100)` (Alembic 0003:57) +- `idx_face_embeddings_embedding_hnsw` (created later by raw `CREATE TABLE` from `postgres_embedding_repository.py:27-37`; not in any alembic migration — schema drift) + +Two index strategies on the same column waste write bandwidth on every insert. At ~19 rows in prod neither is doing useful work; the planner does brute-force scan. **Decision required:** pick one. Industry guidance: HNSW for `<10M`, IVFFlat for `>10M` with low recall tolerance. With the projected 50k–500k enrolment ceiling for FIVUCSAS, **HNSW with `m=16, ef_construction=64` is the right choice** — drop the IVFFlat in a follow-up Alembic revision once the read path uses HNSW. + +`voice_enrollments` has the matching mismatch — `idx_voice_enrollments_embedding_hnsw` (raw CREATE TABLE) vs `idx_voice_embeddings_ivfflat` on the orphan `identity_core.voice_enrollments` table from V33 (DB_REVIEW_2026-04-30 §7 — orphan table is dead and should be dropped). + +### 2.3 Partial indexes + +The schema makes **good use** of partial indexes: +- `idx_users_tenant_id ON users(tenant_id) WHERE deleted_at IS NULL` (V2:118) — soft-delete-aware +- `idx_users_email_verification_token … WHERE email_verification_token IS NOT NULL` (V2:122) — sparse +- `idx_audit_request_id … WHERE request_id IS NOT NULL` (V8:43) — sparse +- `idx_audit_duration_slow … WHERE duration_ms > 1000` (V8:52) — heat-only +- `idx_audit_enhanced_metadata_gin … WHERE enhanced_metadata IS NOT NULL AND enhanced_metadata != '{}'::jsonb` (V8:69) — sparse GIN +- `idx_tenants_deleted_at … WHERE deleted_at IS NOT NULL` (V49:27) — restore lookups +- `ux_tenant_email_domains_one_primary … WHERE is_primary = true` (V44:51) — one-of constraint +- `idx_mfa_session_expiry … WHERE completed_at IS NULL` (V16) — pending sessions only +- Alembic 0003: `idx_embeddings_tenant_user … WHERE is_active = true` (line 71); `idx_embeddings_tenant_active`, `idx_embeddings_quality`, `idx_embeddings_created_at` (lines 83/95/107) all on `WHERE is_active = true`. + +**Missing partial-index opportunities** (would shrink the index by 50–80%): +- `idx_users_phone_number(phone_number) WHERE phone_number IS NOT NULL AND deleted_at IS NULL` — already exists (V2:124). +- `idx_refresh_tokens_user_id WHERE is_revoked = false` — partial would beat the full per-FK index. **[NEEDS PROD VERIFY]**. +- `idx_webauthn_credentials_user_id WHERE revoked_at IS NULL` — assumes the table has revocation column. + +### 2.4 Unused indexes + +`pg_stat_user_indexes.idx_scan = 0` audit per 2026-04-30 §11 is the canonical baseline. Headline drops still recommended: +- `idx_api_keys_key_hash` — exact duplicate of the UNIQUE constraint +- `idx_webauthn_credentials_credential_id` — exact duplicate of the UNIQUE constraint +- `idx_audit_resource`, `idx_audit_failed_operations`, `idx_audit_request_timing` — 0 scans after 58+ days + +These survive into 2026-05-04 because no V53–V55 touched them. Bundle into a single V56 cleanup migration alongside the missing-index adds in §2.1. + +### 2.5 Audit-log index strategy + +The audit-log workload divides into three query shapes: +1. Admin tenant view: `WHERE tenant_id = ? ORDER BY created_at DESC LIMIT 50` +2. User audit timeline: `WHERE user_id = ? ORDER BY created_at DESC` +3. Distributed-trace dive: `WHERE request_id = ?` + +The **right** indexes for these are `idx_audit_tenant_created (tenant_id, created_at DESC)`, `idx_audit_user_action_created (user_id, action, created_at DESC)`, and `idx_audit_request_id`. The migration history confirms 1+3 land in V5/V8 and the user variant exists. The (`?action_filter=NONE`) common admin variant is also covered. + +What is **not** covered: a `WHERE success = false AND created_at > NOW() - 1d` index for security-event detection. The current `idx_audit_action(action)` does **not** include `success` so failed-login spike detection issues a heap re-check. Add `CREATE INDEX idx_audit_failed_recent ON audit_logs(created_at DESC) WHERE success = false`. + +--- + +## 3. Relations / referential integrity + +### 3.1 ER diagram (text) + +```mermaid +erDiagram + tenants ||--o{ users : "tenant_id" + tenants ||--o{ tenant_email_domains : "tenant_id" + tenants ||--o{ auth_flows : "tenant_id" + tenants ||--o{ auth_sessions : "tenant_id" + tenants ||--o{ user_devices : "tenant_id" + tenants ||--o{ user_enrollments : "tenant_id" + tenants ||--o{ oauth2_clients : "tenant_id" + tenants ||--o{ api_keys : "tenant_id" + tenants ||--o{ nfc_cards : "tenant_id" + tenants ||--o{ refresh_tokens : "tenant_id" + tenants ||--o{ active_sessions : "tenant_id" + tenants ||--o{ security_events : "tenant_id" + tenants ||--o{ audit_logs : "tenant_id (NULLABLE — sentinel)" + tenants ||--o{ tenant_auth_methods : "tenant_id" + tenants ||--o{ mfa_sessions : "tenant_id" + users ||--o{ user_roles : "user_id" + users ||--o{ refresh_tokens : "user_id" + users ||--o{ webauthn_credentials : "user_id" + users ||--o{ nfc_cards : "user_id" + users ||--o{ user_devices : "user_id" + users ||--o{ user_enrollments : "user_id" + users ||--o{ user_settings : "user_id" + users ||--o{ password_history : "user_id" + users ||--o{ mfa_sessions : "user_id" + users ||--o{ active_sessions : "user_id" + users ||--o{ audit_logs : "user_id (NULLABLE)" + users ||--o{ guest_invitations : "invited_by" + users }o--|| users : "invited_by (self-FK)" + roles ||--o{ user_roles : "role_id" + roles ||--o{ role_permissions : "role_id" + permissions ||--o{ role_permissions : "permission_id" + auth_methods ||--o{ tenant_auth_methods : "auth_method_id" + auth_methods ||--o{ auth_flow_steps : "auth_method_id" + auth_flows ||--o{ auth_flow_steps : "auth_flow_id" + auth_sessions ||--o{ auth_session_steps : "session_id" + %% Cross-DB boundary (no FK, only application contract): + users ..o{ face_embeddings : "user_id (VARCHAR cast)" + users ..o{ voice_enrollments : "user_id (VARCHAR cast)" + tenants ..o{ face_embeddings : "tenant_id (VARCHAR cast)" +``` + +### 3.2 ON DELETE behaviour audit + +Inventory based on migration grep. Behaviours are **CASCADE** (denoted ↘) or **NO ACTION** (default, denoted ⊥). I do not see any `SET NULL` in the migrations grep output but `DB_REVIEW_2026-04-30.md §10` lists 5 NO-ACTION FKs against `tenants`. Verified subset: + +| Child table | FK column | Target | Behaviour | +|---|---|---|---| +| `users` | `tenant_id` | `tenants(id)` | ↘ CASCADE | +| `users` | `invited_by` | `users(id)` | ⊥ NO ACTION | +| `webauthn_credentials` | `user_id` | `users(id)` | ↘ CASCADE | +| `nfc_cards` | `user_id` | `users(id)` | ↘ CASCADE | +| `nfc_cards` | `tenant_id` | `tenants(id)` | ↘ CASCADE | +| `auth_methods` | (no tenant FK; global) | — | — | +| `tenant_auth_methods` | `tenant_id` | `tenants(id)` | ↘ CASCADE | +| `tenant_auth_methods` | `auth_method_id` | `auth_methods(id)` | ↘ CASCADE | +| `auth_flows` | `tenant_id` | `tenants(id)` | ↘ CASCADE | +| `auth_flow_steps` | `auth_flow_id` | `auth_flows(id)` | ↘ CASCADE | +| `auth_sessions` | `session_id` (in steps table) | `auth_sessions(id)` | ↘ CASCADE | +| `auth_sessions` | `tenant_id` | `tenants(id)` | **⊥ NO ACTION** per predecessor review | +| `user_devices` | `tenant_id` | `tenants(id)` | **⊥ NO ACTION** per predecessor review | +| `user_enrollments` | `user_id` | `users(id)` | ↘ CASCADE | +| `user_enrollments` | `tenant_id` | `tenants(id)` | **⊥ NO ACTION** per predecessor review | + +The dual-CASCADE on `nfc_cards` (both user and tenant) means soft-deleting a tenant via `Tenant.softDelete()` is safe (no row removed → no cascade). But hard-deleting a tenant — even via the legitimate purge job that issues `SET LOCAL app.allow_hard_delete='on'` — would FAIL on the `auth_sessions.tenant_id NO ACTION` FK before any other table can react. The purge job hasn't tried this on prod yet, so the failure is latent. + +**Recommendation:** the purge migration plan — the missing V56-or-later — should include explicit `ALTER TABLE … ALTER CONSTRAINT … ON DELETE CASCADE` for the five NO-ACTION tenant FKs (matching the pattern `users.tenant_id ON DELETE CASCADE`). + +### 3.3 Orphan-row risk + +- `audit_logs.user_id` is FK with **ON DELETE SET NULL** per V5 (`tenant_id UUID REFERENCES tenants` with no `ON DELETE` ≠ SET NULL but the predecessor's analysis of audit_logs cited SET NULL). The result: orphan audit rows after a hard-purge survive but with NULL user_id. Combined with the V46 backfill, orphan rows are also at risk of NULL tenant_id (12.4% of audit rows per `DB_REVIEW_2026-04-30.md §6`), making them invisible to *both* user-scoped and tenant-scoped admin queries. Users without a `tenant_id` cannot be attributed to a customer in a KVKK audit response. +- Cross-DB orphans: `face_embeddings.user_id VARCHAR(255)` has no FK at all (Postgres can't FK across DBs). When a user is hard-purged from `identity_core`, their embeddings in `biometric_db` linger forever unless the application emits an explicit `DELETE`. **[NEEDS APPLICATION-SIDE VERIFY]:** does `SoftDeletePurgeJob.purgeBatch` call into `BiometricProcessorClient.deleteEnrollment(userId)` before issuing the SQL DELETE? If not, every hard-purge leaks 1+ biometric template per user. This is a **silent KVKK Art. 17 violation** if true. + +### 3.4 Cyclic dependencies + +`User → User (invited_by)` is a self-loop, fine. +`User → Tenant → AuthFlow → AuthFlowStep → AuthMethod` is a chain. +No cycles among the surviving entities post-V48 (`biometric_data` drop). + +--- + +## 4. Views + +### 4.1 Existing views + +- `v_recent_audit_logs` — created in V8:143, recreated in V40:227, in V57:215 (depending on path taken). Joins `audit_logs` to `users` for the `user_email` denormalisation. Non-materialised. Safe to keep — it's a presentation-layer convenience. +- `v_slow_operations` — V8:171, V40:239, V57:225 — `WHERE duration_ms > 1000` aggregation. Useful for ops dashboards. +- `mv_audit_statistics` — **MATERIALIZED** view. V8:190, recreated V40:249, V57:233. No automatic refresh strategy in any migration; it must be `REFRESH MATERIALIZED VIEW mv_audit_statistics;` invoked from cron / `@Scheduled`. **[NEEDS PROD VERIFY]:** is there a cron invocation? If not, the view is stale since 2026-04-19 (when V40/41 were first stamped, even as BASELINE SKIP). +- `v_rate_limit_monitoring` — V9:313. Simple aggregation. + +### 4.2 Recommended new views + +A handful of admin queries currently DTO-triplicate (per `PERF_REVIEW_2026-04-30.md`) and could be views: + +- `v_user_signin_stats` — denormalised join of `users + audit_logs(action='USER_LOGIN', success=true)` aggregating count + last login per user. Replaces N+1 in `enrichWithLoginInfo` (`ManageUserService.java:296`). +- `v_tenant_health` — single row per tenant with `users_count`, `active_users_24h`, `mfa_enrolled_pct`, `webauthn_credentials_count`, `failed_logins_24h`. Today these are five separate queries from the dashboard. +- `v_enrollment_summary` — joins `user_enrollments + face_embeddings + voice_enrollments` (cross-DB) — feasible only via FDW or by surfacing biometric-side counts through the API. Probably not worth the FDW cost. + +Refresh-materialised candidates (heavy queries running ≥1/min): +- None at current load (1k audit_logs in 2 months). Premature. + +--- + +## 5. Stored procedures / functions / triggers + +### 5.1 Existing functions / triggers (from migration files) + +| Object | Source | Purpose | +|---|---|---| +| `update_updated_at_column()` (function) + `update_
_updated_at` (trigger) | V1, V2, V3, V4, V8, V9, V10 | Auto-`updated_at` on UPDATE; pattern repeated per table | +| `current_tenant_id() RETURNS UUID STABLE` | V25:23 | Reads `app.current_tenant_id` GUC for RLS; fail-safe returns NULL | +| `populate_audit_request_id() / trg_populate_audit_request_id` | V8:77 | Pulls `request_id` from JSON `metadata` if not set explicitly | +| `forbid_hard_delete()` (function) + `tg_users_forbid_hard_delete`, `tg_tenants_forbid_hard_delete` (triggers) | V53:36-64 | BEFORE DELETE guard; raises `restrict_violation` unless `app.allow_hard_delete='on'` | +| `ensure_audit_logs_partition(target_month date) RETURNS boolean` | V41:18 | Idempotent monthly-partition creator; would be invoked from cron if V40/V41 were live | +| `partman.create_parent / partman.run_maintenance_proc` | V57 (in flight) | Replaces V41 once `pg_partman` is installed | +| `mv_audit_statistics` REFRESH | (none) | No scheduled refresh exists | + +### 5.2 Audit logic — DB vs application + +Audit logs are written from application code, not DB triggers. `AuditLogAdapter.saveAuditLog` is the single writer. Trade-off: + +- **Pro app-side:** richer context (request ID, user agent, JWT claims) is available pre-commit; can be batched. +- **Con app-side:** if the app forgets to call it (the historical bug fixed by V46) data is lost forever; `@Async` thread-local leak (12.4% NULL `tenant_id`). +- **Pro DB-side trigger:** every write is captured atomically; no application-bug bypass. +- **Con DB-side:** can't see HTTP-layer context. + +A **hybrid** is the right move: keep the application-side AuditLogPort for richness, AND add lightweight `AFTER INSERT/UPDATE/DELETE` triggers on `users`, `tenants`, `roles`, `webauthn_credentials`, `nfc_cards`, `oauth2_clients`, `api_keys` writing a row to `audit_logs` with `action = 'DB_TRIGGER_*'`. The trigger row is the safety-net; the application row is the canonical one. Reconciliation later via the `request_id` field. + +### 5.3 Tenant context in functions / triggers + +`current_tenant_id()` (V25:23) is the only consumer of `app.current_tenant_id` GUC. The GUC is set by: +- `TenantHibernateAspect.java:34-58` — only on application threads with a non-null `TenantContext` +- Flyway / migration runs — superuser, GUC unset → NULL → `OR current_tenant_id() IS NULL` clause makes RLS fail-open + +The `forbid_hard_delete` trigger (V53) uses a **different** GUC `app.allow_hard_delete` and `current_setting(name, missing_ok := true)` — correct pattern. Recommend rewriting `current_tenant_id()` similarly: + +```sql +CREATE OR REPLACE FUNCTION current_tenant_id() RETURNS UUID AS $$ +DECLARE v TEXT; +BEGIN + v := current_setting('app.current_tenant_id', true); -- missing_ok + RETURN NULLIF(v, '')::UUID; +EXCEPTION WHEN others THEN RETURN NULL; +END; +$$ LANGUAGE plpgsql STABLE; +``` + +…and tighten the policy (see §8.1). + +--- + +## 6. Performance / sizing + +### 6.1 Table bloat / hot tables [NEEDS PROD VERIFY] + +Per `DB_REVIEW_2026-04-30.md §13`: +- `webauthn_credentials` has `n_dead_tup / n_live_tup = 8.66` (26 dead vs 3 live). Autovacuum has never run because absolute thresholds (`autovacuum_vacuum_threshold = 50`) win out over scale_factor at <100 rows. +- `user_roles` 2.33; `users` 1.55; `user_enrollments` 1.34. +- These are **micro-bloat issues** that would self-resolve at scale, but matter operationally because the FK-cascade incident on 2026-04-28 left dead tuples that are never reclaimed. + +Per-table autovacuum tuning was recommended in 2026-04-30 §13 and has not been applied. Re-recommended verbatim. + +### 6.2 Audit-log growth rate + +Predecessor measured 1082 rows in 58 days = **~17 rows/day**. Combined with the AuditLoggingAspect fix (now writing tenant_id), every authenticated request writes 1–3 rows. At a modest 10 RPS sustained, that's 864 k rows/day → **~ 26 M rows/month** at full load. The current heap-table model breaks down around 50–100 M rows. Partitioning is not advisory at projected scale — it's mandatory on a 12-month horizon. + +V57 (`audit_logs_pg_partman.sql`) handles this correctly **once pg_partman is installed at the OS level on the shared-postgres container**. Since the current image is `pgvector/pgvector:pg17`, `pg_partman` is NOT bundled — V57 has the operator-bypass `app.skip_partman_v57=on` for this exact reason. **Operator action item:** swap to a custom image bundling `postgresql-17-partman` (`apt install postgresql-17-partman` on the Debian base) before V57 can run. + +### 6.3 Hikari pool config + +Per `application-prod.yml:20-29`: +```yaml +hikari: + maximum-pool-size: 20 + minimum-idle: 5 + connection-timeout: 30000 + idle-timeout: 600000 + max-lifetime: 1800000 + connection-init-sql: ${DB_CONNECTION_INIT_SQL:SET statement_timeout = 30000} +``` + +`connection-init-sql` was added since the predecessor review — good. But: +- `leak-detection-threshold` **still missing**. A handler bug holding a connection >60s should produce a stacktrace warning. Add `leak-detection-threshold: 60000`. +- `idle_in_transaction_session_timeout` is NOT set in init-sql. A `BEGIN;` without `COMMIT;` (rare but seen during the FK-cascade post-mortem) blocks autovacuum. Either chain into init-sql (`SET statement_timeout = 30000; SET idle_in_transaction_session_timeout = 600000`) or set at PG level via the compose `command:`. + +PostgreSQL `max_connections=100` (from `infra/shared-db/docker-compose.yml` per predecessor §18) divided across 5 apps × 20 Hikari connections each = ceiling at app #5. Either raise PG to `200` (CX43 has the RAM) or introduce **pgbouncer** in transaction mode before app #6 lands. Architecturally, pgbouncer is the better answer because it also cushions Spring Boot warm-up burst (initial 20 connections per app × 5 apps = 100 simultaneously requested at Hetzner reboot). + +--- + +## 7. Backup / recovery + +### 7.1 PITR / WAL archiving status + +Predecessor §9 found `archive_mode = off` live despite compose claiming `-c archive_mode=on`. The recent parent commit `1ab95e9 infra(shared-db): land pgBackRest WAL archiving + PITR (P6.8) — deploy DEFERRED` confirms PR-style work has been done at `infra/shared-db/` to wire pgBackRest, but **deploy is explicitly deferred**. Re-verify: + +``` +[NEEDS PROD VERIFY] +SHOW archive_mode; +SHOW archive_command; +SELECT * FROM pg_stat_archiver; +``` + +Until `archived_count > 0`, **PITR is not actually working** and the "RUNBOOK_PITR.md" claim is aspirational. The 2026-04-30 DR drill was a `pg_dump` round-trip, not a WAL-replay restore — the latter has never been tested. + +### 7.2 Backup cadence + +`/opt/projects/backups/` shows daily dumps (per parent `git status` showing many backups deleted from 2026-03-21..2026-03-29 — these would be GPG-encrypted snapshots). The cadence works for RPO ~24h. PITR closes that to ~5 min if and only if archiving is actually on. + +Off-site: `mirror.log` and `offsite.log` exist in `/opt/projects/backups/` — assume hetzner storage box mirroring is wired. **[NEEDS LOG VERIFY]** the most recent successful run. + +### 7.3 DR drill cadence + +Last drill: 2026-04-30 04:54 UTC — 25 users / 19 tenants / 279 refresh_tokens restored OK. Quarterly cadence is industry-acceptable for a small-team SaaS, but for a regulated KVKK service the recommended cadence is **monthly with rotating scenarios** (full restore, point-in-time to T-1h, single-table restore). The runbook is in `/opt/projects/infra/RUNBOOK_DR.md`. + +--- + +## 8. Security + +### 8.1 Row-level security + +Status (verified from migrations + predecessor review; not re-verified live): + +- 9 tables have `ALTER TABLE … ENABLE ROW LEVEL SECURITY` (V25:9-17): `users, roles, user_roles, auth_flows, auth_flow_steps, auth_sessions, user_devices, user_enrollments, audit_logs`. +- **0 tables have `FORCE ROW LEVEL SECURITY`** (V25:147-149 commented out, no later migration enables it). +- 13+ tenant-keyed tables have **no RLS at all**: `mfa_sessions, nfc_cards, webauthn_credentials, refresh_tokens, security_events, tenant_auth_methods, tenant_email_domains, oauth2_clients, api_keys, active_sessions, password_history, voice_enrollments` (orphan), `liveness_attempts` (in biometric_db). + +The application connects as a Postgres role that owns the tables (`postgres` per predecessor + `application-prod.yml:18` env-variable indirection). Owners bypass RLS unless `FORCE` is set. **The RLS protection surface today is zero.** + +The fix is a four-step migration, in this order: +1. Create `app_identity` non-superuser role with table-level SELECT/INSERT/UPDATE/DELETE grants. +2. Switch `DATABASE_USERNAME` env to `app_identity`. Migrations continue to run as `postgres` on container startup. +3. `ALTER TABLE … FORCE ROW LEVEL SECURITY` on every RLS table. +4. Tighten the `current_tenant_id() IS NULL` fail-open clause to a deny clause + add a separate `… TO postgres USING (true)` admin-bypass policy. + +Step 1+2+3 is the same recommendation as 2026-04-30 §2 and is the **single highest-leverage change in the whole storage layer**. This is the canonical "why we passed compliance audit" story. + +### 8.2 Database role separation + +Today: one role (`postgres`, superuser) is used by: +- Flyway migrations (correct — needs DDL) +- Application runtime (incorrect — should be a non-superuser app role) +- pgBackRest / backup user (correct — needs `pg_read_all_settings` + `pg_read_all_data`) +- DBA shell access (`docker exec … psql -U postgres …`) — fine + +The recommended split: 5 roles (`postgres` for ops, `flyway_migrate`, `app_identity`, `app_biometric`, `backup_reader`). At current scale this is over-engineering; the minimum-viable split is `postgres` (DBA + Flyway) + `app_identity` (runtime). + +### 8.3 Connection encryption + +[NEEDS PROD VERIFY] `SHOW ssl;` — predecessor review didn't capture this. Within the Docker network the connection is plaintext to `shared-postgres:5432`. That is acceptable on Hetzner because the bridge network is unreachable from outside. Public SSL is unnecessary as long as the port is never exposed (verified: `infra/shared-db/docker-compose.yml` should bind only the internal network; no `5432:5432` mapping in production compose). + +### 8.4 pgcrypto / pgsodium + +V0 enables `uuid-ossp`, `pgcrypto`, `pg_trgm`, `vector`. `pgsodium` is NOT installed. Application-side AES-GCM via `TotpSecretCipher` is the right primitive for transparent column encryption — no need for pgsodium until / unless we move to row-level encryption with key separation per tenant. The Fernet primitive used in biometric-processor (`cryptography` Python package) is functionally equivalent. + +--- + +## 9. Migration history hygiene + +### 9.1 V1 → V55 audit + +Pacing is healthy: V1–V15 in 3 weeks (Dec 2025), V16–V32 over 3 months (Jan–Feb 2026), V33–V55 over 6 weeks (Mar–May 2026). No mega-migrations. A few smell points: + +- **V15 seed data** — `seed_realistic_sample_data.sql` mixes DDL (none) with INSERTs that reference test users by hard-coded UUIDs (`11111111-1111-1111-1111-111111111111` etc). This is fine for dev but writes prod artefacts on first deploy. Recommendation: gate behind `pg_environment` (`SELECT current_database()` checks) or move to a separate `seed-dev/` folder consumed only by docker-compose dev profile. +- **V29 + V32 + V35 + V36** — incremental column additions to `mfa_sessions`. Healthy. +- **V40 / V41 / V42 / V43** — partition + maintenance + check + no-op. The NULL-checksum BASELINE SKIP markers from `DB_REVIEW_2026-04-30.md §5` are recorded but the SQL never ran. Need cleanup. +- **V43** is annotated `noop_reserved_v43_ships_as_V48` — **do not** keep no-op slot migrations in the chain. Either remove the file (after `flyway repair`) or replace with the actual migration that owns the slot. +- **V44 + V45 + V46** — multi-domain tenants + admin permissions baseline + audit-tenant backfill. Three small migrations, each idempotent. +- **V51 / V52 / V53 numbering collision** — V51 is ShedLock per the file content at `V51__shedlock.sql:13-17`, which explicitly states *"Numbering note: this is V51 (renumbered from V53). The feat/v51-forbid-hard-delete-p1-7 branch carries a separate 'V51 BEFORE DELETE trigger' migration that never reached main; when that branch merges it must renumber to V52 or later"*. The forbid-hard-delete migration ultimately landed as V53. **Current state is consistent**, but this kind of rename creates a forensic risk (a V51 PR landing today would silently overwrite ShedLock if Flyway weren't strict about checksums). +- **V55 plaintext-token retention** — `V55__refresh_token_hash.sql:6-7` keeps the plaintext `token` column for backwards-compat. Schedule the **V56 drop column** for 30 days post-soak. +- **No V56 yet on disk**, but **V57 IS on disk**. Flyway will REFUSE to start with `out-of-order=false` (default in `application-prod.yml:36-41` does NOT set `out-of-order: true`) once V57 is installed but V56 is still missing. **This will brick prod on next deploy.** Either: + - (a) set `spring.flyway.out-of-order=true` (acceptable trade-off if all migrations are idempotent), OR + - (b) rename V57 → V56 before merging the partman branch, OR + - (c) introduce an empty `V56__placeholder.sql` (smelly but Flyway-safe). + +### 9.2 `validate-on-migrate=true` + +`application-prod.yml:41` reads `validate-on-migrate: ${SPRING_FLYWAY_VALIDATE_ON_MIGRATE:true}` — flipped to **true** by default since the predecessor review. Good. **But** the BASELINE SKIP rows for V40/V41/V42/V43 still have `checksum IS NULL`, and `validate-on-migrate=true` will refuse to start as soon as Flyway re-encounters them. **[NEEDS PROD VERIFY]** that the env var was overridden to `false` in `.env.prod` to keep prod alive — if so, the toggle is paper-only. + +The fix is `flyway repair` to recompute checksums for V40–V43 + delete the V40/V41 rows entirely (since the SQL never ran), then re-apply via V57 path. This is operator-only and 30 minutes of work. + +### 9.3 Alembic 0001 → 0005 + +5 revisions, monotone numbering, no rename collisions. **But** per predecessor §3 the live `biometric_db` had **no `alembic_version` table at all** as of 2026-04-30 — Alembic had never run against prod, schema came from raw `CREATE TABLE` statements in the repository code. PR #68 (`22bd33c chore(bio): add alembic to runtime image`) closes this by adding `alembic upgrade head` to the entrypoint. Operator must: + +1. **Stop** the bio container. +2. Manually create the `alembic_version` table with version `0001_initial` (since the schema matches that point). +3. Run `alembic upgrade head` *inside* the container — should advance through 0002 → 0005 cleanly. +4. Validate `alembic_version.version_num = '0005_embedding_ciphertext'`. +5. **Set `FIVUCSAS_EMBEDDING_KEY` env** before restart, otherwise PR #65 fails fast on boot (intentional). +6. Run `app.infrastructure.persistence.scripts.backfill_embedding_ciphertext` to populate `embedding_ciphertext` for existing rows. +7. Confirm `key_version=1` everywhere; absence of any plaintext-only rows. + +**This is operator-only because automation can't infer the right Alembic stamp without prod-state read access.** + +--- + +## 10. Findings + prioritised recommendations + +### P0 — urgent, security or correctness + +| # | Finding | Action | Effort | Type | +|---|---|---|---|---| +| P0-1 | RLS still bypassed (app connects as table owner; no FORCE; fail-open policies). 13 tenant-keyed tables have no RLS at all. | Create `app_identity` non-superuser role; switch `DATABASE_USERNAME`; `ALTER TABLE … FORCE ROW LEVEL SECURITY` on the 9 RLS tables; extend RLS to the 13 missing tables; tighten policy from "fail-open on NULL GUC" to deny + admin-bypass policy. Migration V58 (after V56/V57 land). | 12–16 h | operator + agent | +| P0-2 | `User` entity has no `@SQLDelete` / `@SQLRestriction`; `ManageUserService.deleteUser` calls `userRepository.delete(user)` (line 288), which V53 trigger now blocks with `restrict_violation`. End user gets 5xx. The hard-delete trigger is the *backup*, not the contract. | Add `@SQLDelete(sql="UPDATE users SET deleted_at=NOW(), status='INACTIVE', is_active=false WHERE id=?")` + `@SQLRestriction("deleted_at IS NULL")` on `User.java`. Replace line 288 with `user.softDelete(); userRepository.save(user);`. Add V58 column comment matching V49. | 2 h | agent | +| P0-3 | 9 `UserRepository` `findBy*` methods do NOT filter `deletedAt IS NULL` — soft-deleted users leak into admin lists, search, counts, expired-guest cron. | Add `AND u.deletedAt IS NULL` to lines 49, 62, 70, 75, 83, 92, 105, 109, 116. Or land #P0-2's `@SQLRestriction` and they get filtered for free. (Prefer the latter — single annotation fixes all.) | 1 h | agent | +| P0-4 | Embedding-encryption is half-deployed: PR #65 writes ciphertext for new rows; existing rows remain plaintext-only until operator runs `backfill_embedding_ciphertext` script. KVKK Decision 2018/10 is currently violated for any pre-2026-05-04 enrolment. | Operator: `FIVUCSAS_EMBEDDING_KEY` set + `alembic upgrade head` + run backfill script + verify zero plaintext-only rows. Then write 0006 promoting `embedding_ciphertext NOT NULL` + dropping plaintext `embedding` once read paths confirm migration. | 4 h ops + 6 h dev | operator + agent | +| P0-5 | V57 on disk without V56; Flyway with `out-of-order=false` will REFUSE to start. Production deploy bricks on next rebuild. | Either renumber V57 → V56 before merge, OR add `V56__noop.sql`, OR set `spring.flyway.out-of-order=true`. Recommend (a) renumber. | 30 min | agent | + +### P1 — perf or major hygiene + +| # | Finding | Action | Effort | Type | +|---|---|---|---|---| +| P1-1 | `audit_logs` is a plain heap with the V40/V41 BASELINE SKIP shadow markers. V57 (pg_partman) is on disk but `pg_partman` is not installed at the OS level on `shared-postgres`. | Build a custom postgres-17 image bundling `postgresql-17-partman` + push to registry; rebuild shared-postgres with the new image; run `flyway repair` to scrub V40/V41 NULL checksums; apply V57 via partman path. Set `spring.flyway.out-of-order=true` for the rollout. | 6 h | operator | +| P1-2 | Cross-DB orphan biometrics on user hard-purge: `SoftDeletePurgeJob.purgeBatch` issues SQL DELETE but [NEEDS APP VERIFY] does not call `BiometricProcessorClient.deleteEnrollment(userId)` first. KVKK Art. 17 violation. | Add `biometricProcessorClient.deleteAllEnrollments(userId)` to `SoftDeletePurgeJob.purgeBatch` BEFORE the SQL DELETE. Idempotent (404 OK). | 2 h | agent | +| P1-3 | `audit_logs.tenant_id IS NULL` for ~12% of rows (`@Async` thread-local leak). V46 backfilled history; new rows still drift. | Wrap the `@Async` `TaskExecutor` in `DelegatingSecurityContextExecutor` + a custom `DelegatingTenantContextExecutor`. Add an integration test that asserts an async-emitted audit row carries the original thread's tenant_id. | 4 h | agent | +| P1-4 | `face_embeddings` carries TWO vector indexes (ivfflat from Alembic 0003, hnsw from raw repository CREATE TABLE). Write amplification on every enrolment. | Drop `idx_embeddings_vector_ivfflat` in Alembic 0006. Keep HNSW. Confirm via `\di face_embeddings` post-apply. | 1 h | agent | +| P1-5 | `voice_enrollments` orphan in `identity_core` (V33-created, 0 rows) confuses future migrations. | Write V58 `DROP TABLE voice_enrollments;` in `identity_core`. The biometric data lives in `biometric_db`. | 30 min | agent | +| P1-6 | Five tenant FKs are NO ACTION (predecessor §10): `auth_sessions, oauth2_clients, user_devices, user_enrollments, verification_sessions`. Hard-purge job (legitimately bypassing V53 trigger) will fail on these constraints. | V58 `ALTER TABLE … ALTER CONSTRAINT … ON DELETE CASCADE` for each, mirroring `users.tenant_id ON DELETE CASCADE`. | 3 h | agent | +| P1-7 | Hikari `leak-detection-threshold` missing. A handler bug holding a connection >60s exhausts the pool silently. | Add `leak-detection-threshold: 60000` to `application-prod.yml:20-29`. | 30 min | agent | +| P1-8 | `pg_stat_statements` not loaded; no slow-query telemetry. | Add `-c shared_preload_libraries=pg_stat_statements -c pg_stat_statements.track=all -c log_min_duration_statement=1000 -c log_lock_waits=on -c idle_in_transaction_session_timeout=600000` to compose. Rolling restart needed. `CREATE EXTENSION pg_stat_statements;` per DB. | 2 h | operator | +| P1-9 | pgBackRest `archive_mode` not actually on per predecessor §9. PITR is paper-only. | Recreate shared-postgres so compose flags take effect; verify `pg_stat_archiver.archived_count > 0`. Then run a sandbox WAL-replay restore drill. | 1 h | operator | +| P1-10 | TC Kimlik (`users.id_number`) stored plaintext. KVKK Art 6 special category. | Add `IdNumberAttributeConverter` mirroring `TotpSecretAttributeConverter`. AES-GCM via existing key. Add CHECK constraint `id_number IS NULL OR id_number LIKE 'enc:v1:%'` after backfill. | 6 h | agent | + +### P2 — polish + +| # | Finding | Action | Effort | Type | +|---|---|---|---|---| +| P2-1 | Unused indexes per predecessor §11 (`idx_audit_resource`, `idx_audit_failed_operations`, `idx_audit_request_timing`, dup `idx_api_keys_key_hash`, dup `idx_webauthn_credentials_credential_id`). | Single migration V58a dropping them. | 1 h | agent | +| P2-2 | Missing FK index `idx_users_invited_by`. | One-line partial index. | 15 min | agent | +| P2-3 | Missing failed-event index for security alerting. | `CREATE INDEX idx_audit_failed_recent ON audit_logs(created_at DESC) WHERE success = false;` | 15 min | agent | +| P2-4 | `audit_logs.user_agent_v2` shadow column; `getEffectiveUserAgent()` proves the tech debt. | After verifying zero rows have only `user_agent` set, drop `user_agent` and rename `user_agent_v2 → user_agent`. | 1 h | agent | +| P2-5 | `mv_audit_statistics` has no scheduled REFRESH. View is stale. | Add `@Scheduled(fixedDelay=1h)` or pgcron job. | 30 min | agent | +| P2-6 | Per-table autovacuum tuning for `webauthn_credentials, user_roles, user_enrollments, mfa_sessions` (predecessor §13). | One V58 `ALTER TABLE SET (autovacuum_vacuum_scale_factor=0.05, autovacuum_vacuum_threshold=10);` block. | 30 min | agent | +| P2-7 | `mfa_sessions` cleanup not scheduled (predecessor §14). | `@Scheduled(fixedDelay=1h) MfaSessionRepository.deleteExpiredAndIncomplete(now())`. | 30 min | agent | +| P2-8 | Type mismatch `users.id UUID` ↔ `face_embeddings.user_id VARCHAR(255)` (predecessor §16). | Alembic 0007: `ALTER TABLE face_embeddings ALTER COLUMN user_id TYPE UUID USING user_id::UUID;` + add CHECK constraint pre-cutover. | 2 h | agent | +| P2-9 | `flyway_schema_history` retains BASELINE SKIP rows for V40/V41/V42/V43 (NULL checksums). | `flyway repair` + delete V40/V41 rows after V57 migration owns partitioning. | 1 h | operator | +| P2-10 | `current_tenant_id()` function uses EXCEPTION-driven NULL fallback. | Rewrite to use `current_setting(name, true)` pattern (already used by V53 forbid_hard_delete). | 15 min | agent | +| P2-11 | `users.user_type` + `users.expires_at` consistency CHECK missing (`GUEST` should always have `expires_at`). | `ADD CONSTRAINT chk_user_guest_expiry CHECK (user_type <> 'GUEST' OR expires_at IS NOT NULL)`. | 30 min | agent | +| P2-12 | V15 seed migration writes prod data. | Move out of Flyway into a dev-profile-only seeder, OR gate with `WHERE NOT EXISTS (... fivucsas.local)`. | 1 h | agent | + +### P3 — defer + +- **HNSW is over-indexing at <50 rows.** Lower `m` from 16 to 8 once we know the real enrolment ceiling. +- **Single-container 5-DB layout** is fine until ~30 RPS sustained or a 6th app. Re-evaluate Q3 2026. +- **Audit-log volume** is low enough today that the partition story is forward-looking only — but with V57 in flight, finish it now rather than keep the option open. + +--- + +## What's working well (top 5) + +1. **`Tenant` soft-delete contract** is fully wired (V49 + `@SQLDelete + @SQLRestriction`). The pattern is the canonical answer for `User`. Copy verbatim. +2. **TOTP at-rest encryption** (V39 + V42 + `TotpSecretAttributeConverter` + DB CHECK constraint) is the textbook example of defence-in-depth: code can't bypass DB constraint, DB can't accept un-encrypted, V42 enforces it. +3. **Refresh-token rotation family + secret hashing** (V50 + V55 + RFC 6749 §10.4 reuse-detection) is industry-best-practice and was shipped quickly post-Sec-P2 #6. +4. **Idempotent migration discipline** — V44, V46, V49, V53, V54 all use `IF NOT EXISTS / DO $$ … END $$ / ON CONFLICT DO NOTHING` defensively. Re-runs are harmless. This is rare in commercial Java codebases. +5. **Partial-index hygiene** — `WHERE deleted_at IS NULL`, `WHERE is_revoked = false`, `WHERE is_primary = true`, `WHERE completed_at IS NULL` patterns are used throughout V2/V8/V44/V49. Index size today is 30–40% smaller than it would be with full indexes. + +--- + +## Action plan ordering + +### Day 0 (today, agent) +1. **P0-5**: rename V57 → V56 OR add `V56__noop.sql`. *Without this, any deploy after V57 lands bricks prod.* +2. **P0-2**: add `@SQLDelete + @SQLRestriction` to `User.java`; replace line 288 of `ManageUserService` with `user.softDelete(); userRepository.save(user);`. +3. **P0-3**: subsumed by P0-2 — `@SQLRestriction` filters all 9 leaky finders for free. +4. **P0-4 part 1 (agent prep)**: write the operator runbook for the embedding-encryption operator step (alembic stamp + backfill + key set + verify). + +### Day 1 (operator) +5. **P0-4 part 2 (operator)**: stamp alembic to 0001_initial, run `alembic upgrade head`, set `FIVUCSAS_EMBEDDING_KEY`, run backfill script, verify zero plaintext-only rows. +6. **P1-9**: recreate shared-postgres so pgBackRest archive_mode flags take effect; verify `pg_stat_archiver`. +7. **P2-9**: `flyway repair` to scrub the BASELINE SKIP rows. + +### Week 1 (agent + operator) +8. **P0-1**: create `app_identity` role + switch DATABASE_USERNAME + FORCE RLS migration. *Single highest-leverage change.* Stage on dev first. +9. **P1-1**: build custom postgres-17 image bundling pg_partman + recreate shared-postgres + apply V57. +10. **P1-2**: wire bio-side delete into `SoftDeletePurgeJob.purgeBatch`. +11. **P1-3**: tenant-aware `@Async` executor. +12. **P1-7**: Hikari leak-detection-threshold. +13. **P1-8**: pg_stat_statements + slow-query logging. + +### Week 2 (agent) +14. **P1-4 / P1-5 / P1-6 / P1-10**: V58 migration covering vector-index dedup, orphan-table drop, tenant FK CASCADE fixes, TC Kimlik encryption. +15. **P2 batch (P2-1 through P2-12)**: bundled into V58a/V58b/V58c. + +--- + +## Appendix A — severity tally + +| Severity | Count | +|----------|-------| +| P0 | 5 | +| P1 | 10 | +| P2 | 12 | +| P3 | 3 | +| **Total**| **30**| + +## Appendix B — facts captured at HEAD (without prod psql) + +| Repo | HEAD | Last migration on disk | Last entity touched | +|---|---|---|---| +| identity-core-api | `2d958c5` | V57 (pg_partman, in-flight, gated) | `User.java` (P2.10 equality fix, no `@SQLDelete`) | +| biometric-processor | `22bd33c` | Alembic 0005 (`embedding_ciphertext`) | repository_factory.py (Fernet writer wired) | +| parent fivucsas | `e0e87b5` | n/a | N/A | + +## Appendix C — items that need a fresh `psql` snapshot + +These claims were carried forward from `DB_REVIEW_2026-04-30.md`. Re-confirm after operator runs the day-0/1 actions: + +1. `flyway_schema_history` rows for V40/V41/V42/V43 still NULL-checksum +2. `archive_mode = off` despite compose flags +3. `audit_logs.tenant_id IS NULL` count (was 134 / 1082 = 12.4%) +4. `pg_stat_user_indexes.idx_scan = 0` for the listed unused indexes +5. `webauthn_credentials.n_dead_tup / n_live_tup = 8.66` ratio +6. `alembic_version` table existence in `biometric_db` (was missing entirely) +7. 19 face / 35 voice / 2 fingerprint embedding row counts + +End of review. diff --git a/SENIOR_UIUX_REVIEW_2026-05-04.md b/SENIOR_UIUX_REVIEW_2026-05-04.md new file mode 100644 index 0000000..48cd9ce --- /dev/null +++ b/SENIOR_UIUX_REVIEW_2026-05-04.md @@ -0,0 +1,463 @@ +# Senior UI/UX Designer Review — FIVUCSAS Auth Platform + +**Date:** 2026-05-04 +**Reviewer:** Senior UI/UX Designer (product design lens) +**Audience:** Founder / sole operator +**Surfaces in scope:** `app.fivucsas.com` (admin dashboard), `verify.fivucsas.com` (hosted login + embeddable widget), `demo.fivucsas.com` (Marmara BYS showcase), `fivucsas.com` (landing — brand cohesion only) +**Verification basis:** Live HTML for all four surfaces (curl), `/opt/projects/fivucsas/web-app/src` source at HEAD `319b457`, recent 30-commit window confirmed before each finding so this report does not relitigate items already shipped (Profile date-i18n leak, tenant hardcode, biometric label, dead Redux, polling pause-on-hidden, PWA `navigateFallback`, FacePuzzle/HandPuzzle overlays, USER-BUG-1..10 — all verified merged). + +This is a design review. It does not duplicate the engineering, principal, backend, or DB lenses already in the repo. Where a finding overlaps an existing review item, I flag it and only re-state it if the user-experience angle is materially different. + +--- + +## Executive summary + +FIVUCSAS already feels like a real product. The design system is coherent — violet/iris primary, Inter+Poppins typographic pairing, calibrated 25-step shadow ramp, tasteful gradient brand mark, dark mode with `prefers-color-scheme` boot, `prefers-reduced-motion` honored, skip-to-content link, breadcrumbs everywhere, perfect 1700/1700 i18n key parity between `en.json` and `tr.json`. That is a level of polish most six-month-old SaaS prototypes never reach. + +The friction is concentrated in three places: + +1. **The verify.fivucsas.com developer entry experience is invisible.** A tenant developer landing on the bare URL sees `FIVUCSAS Verify` and a blank page (verify-app only mounts when an OAuth `client_id` query is present). There is no "Hello, integrator" page, no SDK snippet, no health/status indicator. Compare to Stripe's `js.stripe.com` 200-with-explanation page or Auth0's hosted-login self-document mode. +2. **The embeddable-widget developer journey is split across `app.fivucsas.com/developer-portal` and `app.fivucsas.com/widget-demo`** — both behind admin auth. A prospective tenant developer cannot evaluate the SDK without first being onboarded as an admin user. This is the single biggest *product*-level UX gap I found, and it is the same one the Principal Review called out at "no self-serve tenant signup" — viewed from the design side, it manifests as the docs being un-shareable. +3. **The admin dashboard is a polished tool but its information architecture is power-user-shaped.** A new admin opening it for the first time will see 18 sidebar entries grouped into 5 categories, three of which (Biometric Tools, Biometric Puzzles, Auth Methods Testing) are debug surfaces that should be feature-flagged or moved behind a "Developer" expander. + +The good news: items 1 and 2 are agent-actionable; item 3 is a 30-minute cleanup. Nothing in this review requires the design-team-and-mockups arc that a typical senior UX review produces. + +--- + +## 1. First impressions — the 6-second test + +### 1.1 `app.fivucsas.com` (cold load, no session) + +What loads — `/opt/projects/fivucsas/web-app/dist/index.html`: +- HTTP 200 in **342 ms** from a Hetzner-routed request, **8.1 KB** index. +- Title: *FIVUCSAS — Biometric Identity Verification Platform.* +- Meta description, OG, Twitter, JSON-LD Organization + WebPage, canonical, hreflang considerations all present. SEO hygiene is a 9/10. +- Five `` for the critical-path bundles + a clever `` for the MediaPipe face-landmarker WASM and `.task` model — keeps the face-capture screen from stalling 1–2 s on first paint. Comment block in the HTML explains *why*. This is exactly the level of "design + perf" thinking I want to see and it's invisible to the user, which is the point. +- CSP includes `script-src 'unsafe-eval' 'wasm-unsafe-eval'`. Necessary for ONNX runtime; correct trade-off, but worth noting that you will fail a strict-CSP scanner at e.g. Mozilla Observatory. Defensible in the security write-up; mention it on the security page. + +What the user sees post-bundle: +- `LoginPage.tsx` (1031 LOC). I did not load it visually but the source shows: full-bleed gradient hero, gradient brand-shield, tabbed Email-vs-Magic-Link entry, *Continue with passkey* CTA, language toggle (EN/TR with two-letter chip — nice touch), light/dark toggle, FIVUCSAS wordmark with the violet→iris gradient WebkitBackgroundClip text effect. +- The login page hardcodes `color: '#1a1a2e'` for the form-card background (RegisterPage.tsx:255, also LoginPage). The rest of the app uses `theme.palette.background.paper`. **Issue (P2):** when a user toggles dark mode on the login page, the inputs render dark-on-dark because the override beats the theme. Verify visually; if confirmed, swap to `(th) => th.palette.background.paper`. + +**6-second verdict:** Brand impression is "high-end fintech security tool, takes itself seriously." Fast, no layout shift, accessible language pivot. I'd let an enterprise prospect see this. + +### 1.2 `verify.fivucsas.com` (cold load, no query params) + +What loads: +- HTTP 200 in **111 ms**, **1.8 KB** index. Notably leaner than the admin bundle (separate `dist-verify/` build). +- Title: *FIVUCSAS Verify.* No description, no OG. Robots `noindex, nofollow` (correct — this is a transactional surface). +- Body: just `
`. The verify-app SPA reads OAuth params from the URL; with no params it surfaces `t('hosted.missingParams')` inside an error Alert. + +**Finding (P1, agent-actionable, S):** A naked GET to `https://verify.fivucsas.com/` is the URL a tenant developer types into their browser to see what they bought. Currently they get a red error alert that says "Missing required parameters." There is no "this is what this surface does," no link to docs, no test-mode button. Stripe, Auth0, Okta, and Keycloak all show a tasteful explainer page in this case ("This page is the FIVUCSAS hosted sign-in. Tenants integrate via the SDK at fivucsas.com/docs"). Two options: + + - **Option A (smaller):** add a `?demo=1` short-circuit in `HostedLoginApp.tsx` that renders a static explainer card with "Try the demo" → links to `verify.fivucsas.com/login?client_id=demo&redirect_uri=...&...`. + - **Option B (better):** when no query params are present, render an "Integrator landing" with three sections: *What this is* / *Try the demo* / *Read the docs*. Roughly 80 LOC, no design dependency. + +The `FIVUCSAS Verify` is also too terse — bookmark-unfriendly. Suggest *FIVUCSAS — Sign in with biometric MFA* (i18n it). + +The hosted-login surface itself (`HostedLoginApp.tsx`, lines 449-525) is genuinely well-designed — a "SECURED BY FIVUCSAS" pill chip, tenant-name interpolation in the headline (`t('hosted.signingInTo', { tenant: clientLabel })`), an iris→violet brand mark, ambient radial-gradient background that adapts to dark mode, and a `verify.fivucsas.com` microcopy footer that conveys the origin without screaming. **This is excellent work.** I would point any prospect to it as the best design surface in the platform. + +### 1.3 `demo.fivucsas.com` + +A faithful Turkish-language Marmara Üniversitesi Bilgi Yönetim Sistemi clone. Two strong design moves: +- It's **fully Turkish** end-to-end — students aren't context-switching to English to evaluate a Turkish product. +- It explicitly tells the visitor what's happening: *"Bu sayfa, FIVUCSAS biyometrik kimlik doğrulamanın bir üniversite BYS'ye nasıl entegre edildiğini göstermektedir."* That micro-explainer is exactly the kind of demo-affordance most B2B platforms forget. + +**Finding (P3, S):** the e-Devlet button is currently disabled but visually present. To a Turkish user familiar with e-Devlet this looks broken, not coming-soon. Either remove it or wrap it with a "yakında" badge so it reads as roadmap, not bug. The third lens here (UX trust): a disabled-but-styled button on a security-product demo erodes the "this is real" feeling. + +### 1.4 `fivucsas.com` (landing — brand cohesion only) + +Loads in **67 ms**, **7.2 KB** index. Three Google Fonts loaded (Inter, Space Grotesk, JetBrains Mono — nice typographic system); SEO fully populated; SoftwareApplication JSON-LD declares Marmara University as `sourceOrganization`, which is a smart trust-signal play. Body class `noise` suggests a textural background overlay. + +**Cohesion check across the four surfaces:** +| Surface | Theme color (meta) | Primary brand color in CSS | Typography pairing | +|---|---|---|---| +| `fivucsas.com` (landing) | `#070713` | (couldn't introspect bundled CSS) | Inter + Space Grotesk + JetBrains Mono | +| `app.fivucsas.com` | `#6366f1` | `#6366f1` (violet) primary | Inter + Poppins | +| `verify.fivucsas.com` | (none) | `#6366f1` primary | Inter + Poppins | +| `demo.fivucsas.com` | n/a (mock university site) | n/a | n/a | +| `web-app/public/manifest.json` PWA theme_color | **`#1976d2`** (MUI default blue, not violet) | — | — | + +**Finding (P2, XS):** the admin PWA manifest declares a theme color `#1976d2` while the rest of the platform — meta `theme-color`, `theme.ts` `BRAND.violet`, the brand-shield gradient — is `#6366f1`. When a user installs the PWA on Android, the splash screen + tab strip will render in MUI default blue, breaking brand cohesion at the most installable moment. One-line edit in `public/manifest.json`. Same file: `theme_color: '#1976d2' → '#6366f1'`, `background_color: '#ffffff' → '#0f1220'` (or keep light, but match the dark-mode-default). + +**Finding (P2, S):** landing uses `Space Grotesk` for display copy; admin + verify use `Poppins`. Both are great fonts but they read differently — Space Grotesk has more geometric, slightly-condensed feeling; Poppins is rounder. A user clicking from the marketing site to the app gets a typographic micro-jolt. Pick one. My recommendation: Poppins everywhere (already loaded in admin, latin-ext supports Turkish). Drop Space Grotesk from landing. + +--- + +## 2. Information architecture + +### 2.1 Sidebar inventory (`Sidebar.tsx`) + +The sidebar groups 18 entries into 5 categories. Source of truth is `src/config/sidebarPermissions.ts`, with role filtering. Translations resolved via `nav.group.*` and `nav.*` keys. Active route gets a 3-px violet rail (`::before` pseudo-element). Admin-only items get an amber "Admin" chip. + +``` +Overview Dashboard +Access Users · Tenants · Roles · Guests +Security Auth flows · Auth sessions · Devices · Audit logs · Analytics +Biometrics Enrollments · Biometric tools · Biometric puzzles · Auth methods testing +Personal My profile · Settings +``` + +**Finding (P1, M):** *Biometric tools*, *Biometric puzzles*, and *Auth methods testing* are developer / debug surfaces. The Sidebar code makes them visible to every authenticated user (no admin gating that I can see in the icon map alone — confirm in `sidebarPermissions.ts`). For a tenant-admin from Marmara who logs in expecting to "manage faculty enrollments," seeing three sibling entries called *Puzzles* and *Testing* makes the product feel like a half-finished playground. Recommendations: + - Move all three behind a single `Developer Tools` collapse, default-collapsed. + - Gate the collapse on `user.isPlatformOwner()` (or a dedicated `developer-mode` toggle in Settings). + - Either way, rename: *Auth methods testing* → *Method sandbox*; *Biometric puzzles* → *Liveness puzzles* (it's not the user's biometric they're puzzling). + +**Finding (P2, S):** the "Identity · Verified" tagline under the FIVUCSAS wordmark is hardcoded English (`Sidebar.tsx:193`). Same with *All systems operational* — it does pass through `t('sidebar.systemStatus', 'All systems operational')` so the EN string is the *fallback*, but i18n review note: confirm `sidebar.systemStatus` exists in both en.json and tr.json (key parity says yes; double-check the value reads correctly in TR — fallbacks are silent failures). + +**Finding (P3, XS):** the sidebar footer status indicator hardcodes a green dot and the text *All systems operational*. There is no actual wire to a status API. If any of the three Hetzner services degrades, the sidebar will reassuringly lie to the user. Either wire it to `status.fivucsas.com` (which is the URL it points at) or remove the live-dot affordance and make it a static "View status" link. + +### 2.2 Page-level layout pattern + +`DashboardLayout.tsx` does the right things: +- Skip-to-content link (visually hidden until focused, lands on `#main-content`). +- Breadcrumbs from `pathSegments` with UUID-skipping logic and an i18n map. Last crumb is `text.primary`, others `text.secondary`. **Solid.** +- Footer with platform / terms / privacy / version. Version uses JetBrains Mono. +- Ambient radial-gradient page background that adapts to dark mode. + +**Finding (P2, S):** there is no consistent `PageTitle` component. `TopBar.tsx` calls `getPageTitle()` (lines 46-66) which does a giant `if (path.startsWith(...))` chain. Add a one-line `` lookup that reuses `BREADCRUMB_I18N_MAP` so adding a new page = one map entry, not two. Ten new pages (`face-demo`, `voice-search`, etc.) silently fall through to `t('nav.dashboard')`. Verify by visiting `/face-demo` — top bar will say "Dashboard," not "Face Demo." A user looking at the URL bar and the top bar will see two different page identities. + +**Finding (P2, XS):** there is no global "page-level empty state" pattern. List pages render `t('common.noData')` inside a `` with no illustration, no CTA, no "create your first X" affordance. `EnrollmentsListPage`, `DevicesPage`, `AuthSessionsPage`, `GuestsPage` all share the same minimal-empty problem. Recommend creating `` as a `shared/components/` element and adopting it on the 6 list pages. **L** in aggregate but **XS per page** — agent-actionable as a single PR. + +### 2.3 Settings page mental model + +`SettingsPage.tsx` (603 LOC) splits into Profile / Security / TOTP+WebAuthn enrollment / Sessions / Language. The 10 auth methods are NOT all configurable here — they're configured per-tenant in `AuthFlowBuilderPage` (admin), and per-user enrollment lives in `MyProfilePage`. A user who wants to "turn on fingerprint for myself" has to: + 1. Go to Settings → Security → click *Add a passkey* → WebAuthn dialog. + 2. Or go to MyProfile → see enrolled methods → click into Enrollment. + +**Finding (P1, S):** the mental model is fragmented. *Settings* is for account-level toggles, *MyProfile* is for biometric enrollments. The split is correct in principle (settings = data; profile = identity) but the page titles lie about it. Suggest: **rename "MyProfile" to "My Identity & Biometrics"** (`nav.myIdentity` key), or merge the auth-method-enrollment subset *into* Settings under a "Authentication methods" section. Right now a non-technical tenant admin will not find their fingerprint enrollment without a guided tour. + +**Finding (P2, S):** the Settings page's removed-features comment (lines 58-65) reads: *"notification toggles … and appearance toggles (dark mode / compact view) were removed from this page — the backend had no storage wired for them."* You shipped the right thing — ghost UI is worse than missing UI — but the page now feels under-stuffed. Add a one-line empty-section helper *"Notification preferences will land in v1.5"* so the absence reads as roadmap, not oversight. + +--- + +## 3. Accessibility (WCAG 2.1 AA target) + +### 3.1 What's already good + +- Skip-to-content link (`DashboardLayout.tsx:144-164`). +- `aria-current="page"` on the active sidebar item (`Sidebar.tsx:227`). +- `