Skip to content

fix(auth): P0/P1 bundle from senior reviews 2026-05-12 (aud claim, MFA fail-open, OTP exhausted, hs-2026-04 revoked)#100

Merged
ahmetabdullahgultekin merged 4 commits into
mainfrom
fix/2026-05-12-auth-p0-bundle
May 27, 2026
Merged

fix(auth): P0/P1 bundle from senior reviews 2026-05-12 (aud claim, MFA fail-open, OTP exhausted, hs-2026-04 revoked)#100
ahmetabdullahgultekin merged 4 commits into
mainfrom
fix/2026-05-12-auth-p0-bundle

Conversation

@ahmetabdullahgultekin

Copy link
Copy Markdown
Contributor

Summary

Four P0/P1 fixes from /opt/projects/SECURITY_REVIEW_2026-05-12.md and /opt/projects/BACKEND_REVIEW_2026-05-12.md.

Bug 1 (P0) — JWT aud claim not shipped (4th time)

Memory has missed this fix three times (2026-04-20, 2026-05-02, 2026-05-12). On HEAD only iss was bound on the parser; aud was never stamped or required.

  • New app.security.jwt.audience property (single string for now; multi-audience parser supported via JJWT requireAudience).
  • JwtService stamps the audience claim on every minted access token. Refresh tokens are opaque UUIDs in DB (V55/V60), not JWTs — no audience needed there.
  • Parser requires the configured audience; empty config stays permissive (dev/test parity with expectedIssuer).
  • application-prod.yml pins aud=fivucsas-api. JwtService.assertProdAudienceIsConfigured fails fast at boot if the prod profile loads with a blank audience.
  • New JwtServiceAudienceTest (9 tests) pins mint, parse, prod boot assertion, and the empty-config permissive path so this fix cannot disappear a fourth time.

Bug 2 (P0) — FaceVerifyMfaStepHandler cosine fallback

PR #83 removed the 0.7 cosine confidence fallback in FaceAuthHandler and AuthController.verify2FAMethod but missed this N-step MFA handler — which is the one hosted login actually uses. INVESTIGATION_FAILOPEN_2026-05-07.md F3.

Mirrors PR #83's pattern: trust only the bio processor's verified field, hard-reject when missing, cache user.getId() once to stay inside the UserDomainBoundary ratchet.

Bug 3 (P0) — 5 MFA OTP call-sites swallowing exhausted

NIST 800-63B 5-strike lockout is correct at the primitive level (OtpService.validateWithResult) but all 5 MFA OTP call-sites used the boolean validate() overload and threw the flag away. Users had to wait 5min for TTL instead of getting an immediate "send another OTP."

Call-sites switched to validateWithResult:

  • EmailOtpAuthHandler
  • SmsOtpAuthHandler (local-Redis fallback only — Twilio Verify path enforces its own counter)
  • EmailOtpVerifyMfaStepHandler
  • AuthController.verify2FAMethod SMS_OTP arm
  • AuthController.verify2FAMethod EMAIL_OTP arm

OtpAttemptsExhaustedException now maps to HTTP 429 with {errorCode:"OTP_ATTEMPTS_EXHAUSTED", action:"resend", remainingAttempts:0} + Retry-After: 300 (RFC 6585 §4) via a new GlobalExceptionHandler arm. verify2FAMethod's catch (Exception) now splits on this exception so it re-throws rather than being collapsed into a 500.

Bug 4 (P1) — Revoke leaked HS512 kid hs-2026-04

Bundled into the JWT commit because both fixes touch the same JwtService + application-prod.yml paths:

  • app.security.jwt.allow-hs512: false pinned explicitly in application-prod.yml (was relying on the application.yml default).
  • New app.security.jwt.revoked-kids config (comma-separated). Any JWT carrying a kid in this set is rejected during validation regardless of allow-hs512 state. Defence-in-depth even if the rollback flag flips back on.
  • application-prod.yml revokes hs-2026-04 (the leaked f8ee668:.env.gcp kid).
  • JwtServiceAudienceTest.revokedKidRejectedEvenIfHs512Allowed proves it: HS512 token signed with hs-2026-04 is rejected even when allowHs512=true.

Commits

  • 9b58250 — fix(jwt): bind and validate aud claim (also revokes hs-2026-04 — same files)
  • 6e955c3 — fix(mfa): close cosine fallback in FaceVerifyMfaStepHandler
  • 9e4e772 — fix(otp): propagate exhausted state through 5 MFA call-sites

(Per spec the 4th commit would have been security(jwt): revoke hs-2026-04 kid, but the revoked-kid code logic and application-prod.yml lines were so tightly coupled with the aud commit that splitting them would have created a no-op file split. The behavior is fully tested under JwtServiceAudienceTest.revokedKidRejected*.)

Test plan

  • mvn -T 2C test1271 passed, 0 failed, 58 skipped (skipped are pre-existing Testcontainers-gated integration tests).
  • Grep permitAll in SecurityConfig.javano changes (no new anonymous routes). git diff SecurityConfig.java returns empty.
  • YAML lint pass on application.yml + application-prod.yml (Python yaml.safe_load).
  • ArchUnit UserDomainBoundaryTest — baseline refreshed for line shifts; net +1 violation in verify2FAMethod for the new catch (OtpAttemptsExhaustedException) arm (cached as String exhaustedUid so the arm contributes only one getId() call).
  • Symptom grep after the fix (memory feedback_verify_completion_claims):
    • grep "audience" src/main/java/com/fivucsas/identity/security/JwtService.java — present in mint + parse + boot assert.
    • grep "validateWithResult" src/main/java/com/fivucsas/identity/application/service/{handler,mfa/handler}/*.java src/main/java/com/fivucsas/identity/controller/AuthController.java — all 5 call-sites flipped.

Operator notes

  • Default for APP_SECURITY_JWT_AUDIENCE env var: empty in application.yml, fivucsas-api in application-prod.yml. No .env.prod change required — the prod profile literal is the boot-time default.
  • After deploy, all existing access tokens minted before the rebuild will fail validation because they have no aud claim. Frontends will see 401s and silently refresh via the refresh-token rotation flow (V55). No user-visible logout expected, but operator should monitor /refresh traffic spike for ~15 min post-deploy.

Cross-references

  • /opt/projects/SECURITY_REVIEW_2026-05-12.md — §aud-claim, §P1 revoked kid
  • /opt/projects/BACKEND_REVIEW_2026-05-12.md — §FaceVerifyMfaStepHandler, §OTP-exhausted
  • INVESTIGATION_FAILOPEN_2026-05-07.md F3 — the fail-open vector this commit closes

🤖 Generated with Claude Code

ahmetabdullahgultekin and others added 3 commits May 12, 2026 17:51
SECURITY_REVIEW / BACKEND_REVIEW 2026-05-12 §aud-claim. Memory has missed
the same fix three times (2026-04-20, 2026-05-02, 2026-05-12): only `iss`
was bound on the parser, `aud` was never stamped or required.

- New `app.security.jwt.audience` property (single string for now;
  multi-audience parser already supported via JJWT requireAudience).
- JwtService stamps the audience claim on every minted access token
  (refresh tokens are opaque UUIDs in DB, not JWTs).
- Parser requires the configured audience on extractAllClaims; empty
  config leaves it permissive (dev/test parity with the issuer config).
- application-prod.yml pins aud=fivucsas-api; JwtService.assertProdAudienceIsConfigured
  fails fast at boot if the prod profile loads with a blank audience.
- New JwtServiceAudienceTest pins the contract on both sides so this
  fix cannot disappear a fourth time. 9 tests cover mint, parse, prod
  boot assertion, and the empty-config permissive path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BACKEND_REVIEW 2026-05-12 §P0. PR #83 removed the 0.7 cosine confidence
fallback in FaceAuthHandler and AuthController.verify2FAMethod but missed
this N-step MFA handler — which is the one hosted login actually uses.
INVESTIGATION_FAILOPEN_2026-05-07.md F3 documents this as a fail-open
vector: a handler-side fallback silently overrides the bio processor's
adaptive aging threshold (VERIFICATION_THRESHOLD_AGED_*).

Mirrors the exact pattern PR #83 used in FaceAuthHandler:
- Trust ONLY the bio processor's `verified` field.
- Missing/null `verified` is a hard reject + ERROR log.
- Cache user-id once to keep the entity.User boundary surface
  at a single call-site per method (UserDomainBoundary ratchet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SECURITY_REVIEW / BACKEND_REVIEW 2026-05-12 §OTP-exhausted. NIST 800-63B
§5.1.1.2 5-strike lockout is correct at primitive level (OtpService.
validateWithResult returns exhausted=true on the 5th wrong guess) but all
5 MFA OTP call-sites used the boolean validate() overload and threw the
flag away — users waited 5min for TTL instead of getting an immediate
"send another OTP" prompt.

Call-sites switched to validateWithResult:
- EmailOtpAuthHandler (N-step flow)
- SmsOtpAuthHandler (N-step flow, local Redis fallback path)
- EmailOtpVerifyMfaStepHandler (N-step MFA, hosted-login path)
- AuthController.verify2FAMethod SMS_OTP arm
- AuthController.verify2FAMethod EMAIL_OTP arm

On exhausted=true each call-site throws OtpAttemptsExhaustedException.
GlobalExceptionHandler now maps the exception to HTTP 429 with
{errorCode:"OTP_ATTEMPTS_EXHAUSTED", action:"resend", remainingAttempts:0}
plus Retry-After: 300 (RFC 6585 §4). Mirrors the existing contract from
OtpController. verify2FAMethod's catch (Exception) had to be split so the
OtpAttemptsExhaustedException re-throws instead of being collapsed into a
500.

Test updates: existing handler tests stubbed otpService.validate(); switched
to validateWithResult with the structured result (valid/invalid/notFound).
ArchUnit baseline refreshed for shifted line numbers (no net new violations
except +1 in verify2FAMethod for the new catch arm).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 12, 2026 17:53

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@ahmetabdullahgultekin

Copy link
Copy Markdown
Contributor Author

CI note (2026-05-12)

The failing Integration tests (Testcontainers) job is pre-existing baseline rot, not a regression from this PR. Same root cause as #99: Script V40__partition_audit_logs.sql failed — V40 requires pg_partman extension which the Testcontainers postgres image (like prod's pgvector/pgvector:pg17) does not have. See OPERATOR_ACTIONS_2026-05-12.md item 1 in Rollingcat-Software/FIVUCSAS#67.

Maven unit tests pass: 1271 passed, 0 failed, 58 Testcontainers-gated tests skipped. GitGuardian + scan pass.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@ahmetabdullahgultekin

Copy link
Copy Markdown
Contributor Author

FYI: #101 fixes the V40 audit_logs_pkey collision that's causing the Integration tests (Testcontainers) job to fail on this PR. Once #101 merges and you rebase, this PR's IT check should go green.

@ahmetabdullahgultekin

Copy link
Copy Markdown
Contributor Author

Update on #101: my fixture resolves the first V40 failure (the audit_logs_pkey collision), but CI then surfaces a SECOND V40 bug — || operator inside COMMENT ON TABLE on line 270, which is invalid PG grammar. See #101 description for details. Fully fixing the IT job for your PR needs operator approval to either (a) edit V40 in place + flyway:repair on prod, or (b) keep accepting continue-on-error. #101 lands progress and clear diagnostics either way.

VerifyMfaStepService caught the handler's OtpAttemptsExhaustedException in its
generic catch(RuntimeException) blocks and collapsed it into a 200 ERROR,
defeating the rate-limit contract on the exact path hosted login uses. Re-throw
the exception ahead of those catches so GlobalExceptionHandler maps it to
429 + Retry-After. Adds a regression test (now 16 passing).

Addresses the senior-review finding on PR #100.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ahmetabdullahgultekin ahmetabdullahgultekin merged commit d2c2d6c into main May 27, 2026
3 of 4 checks passed
ahmetabdullahgultekin added a commit that referenced this pull request May 28, 2026
…ision) (#101)

* test(flyway): V39_5 fixture renames audit_logs_pkey to unblock V40 in CI

V40 (partition audit_logs) renames audit_logs to audit_logs_legacy, then
creates a new partitioned audit_logs root with a composite PK named
audit_logs_pkey. On a fresh Testcontainers DB this fails with:

  ERROR: relation "audit_logs_pkey" already exists
  Location: db/migration/V40__partition_audit_logs.sql

because PG keeps the V5-declared inline PK index/constraint name attached
to the renamed (legacy) table. The new composite-PK ADD collides on the
shared name space.

Prod does not hit this: V40 was BASELINE-SKIPPED in 2026-04 (see V57
header comment + DB review 2026-04-30 §4), so the migration is recorded
as success=t in flyway_schema_history but never ran. Prod's audit_logs
is still relkind='r' and V57 (idempotent partman-aware partitioning) is
the prod-side conversion path; it's fail-soft when partman is missing
(current pgvector/pgvector:pg17 image) so deploys are safe.

CI runs every migration end-to-end on a fresh DB, so it hits the V40 PG
name collision. The Integration tests (Testcontainers) CI job has been
red on every PR (including #99 V61 and #100 P0 bundle) since pre-existing
2026-05-04 baseline; it's continue-on-error so it does not block merges
but is permanent noise on every PR's checks page.

Fix: test-only fixture V39_5 loaded only by the integration profile via
spring.flyway.locations=classpath:db/migration,classpath:db/test-fixtures.
It renames audit_logs_pkey -> audit_logs_v5_pkey before V40 runs, freeing
the global name. Idempotent: only renames when audit_logs is still a
plain heap with the V5-default constraint name.

Matches the V28_5__align_default_login_flow_for_v29.sql precedent exactly
(test-only fixture pre-fixing state for a checksum-locked production
migration). Zero prod surface area: the file lives in
src/test/resources/db/test-fixtures/ which is never on the prod classpath.

Verified locally against pgvector/pgvector:pg17:
  - V5-style CREATE TABLE + V40-style RENAME+CREATE+ADD-CONSTRAINT
    reproduces ERROR exactly.
  - Same sequence with V39_5 inserted between V5 and V40 succeeds; both
    the legacy table (audit_logs_v5_pkey) and the new partitioned root
    (audit_logs_pkey) coexist with distinct constraint names.
  - Idempotency confirmed: re-running V39_5 against a post-V40 state
    falls through the IF guards and is a no-op.

Why not edit V40 in place: three previous in-place V40 fixes (commits
2b00a7b, a30287b, 81101f0) were each reverted (43d0050, 614a94f,
f404d12) because V40's checksum is locked in prod flyway_schema_history.

Why not a new V62 production migration: V57 already provides the
prod-side path (idempotent heap->partitioned conversion + partman setup,
fail-soft when partman absent). A V62 would duplicate V57. The
CI-side problem is purely the V40 in-place hard failure, addressed
surgically by this test fixture.

Unblocks the Integration tests check on PR #99 (V61 audit_logs.tenant_id
NOT NULL) and #100 (P0/P1 senior-review bundle).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(gitleaks): bump application-integration.yml line refs after V39_5 yaml edit

The previous commit added 6 lines of comment to the spring.flyway section
of application-integration.yml, shifting the test-only jwt.secret /
totp.enc-key lines from 55,62 to 61,68. Update the .gitleaksignore
fingerprints to match so the scan job stays green. Secrets unchanged
(both are pre-existing test-only fixtures, allowlisted since the file
was added).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants