Skip to content

Add manual secret rotation policy and DCR key migration script#181

Open
IlonaShishov wants to merge 5 commits into
mainfrom
feature/manual-secret-rotation-policy
Open

Add manual secret rotation policy and DCR key migration script#181
IlonaShishov wants to merge 5 commits into
mainfrom
feature/manual-secret-rotation-policy

Conversation

@IlonaShishov
Copy link
Copy Markdown
Collaborator

Summary

  • Adds docs/MANUAL_SECRET_ROTATION.md — runbook documenting manual rotation procedures for all four secrets (RED_HAT_SSO_CLIENT_SECRET, GMA_CLIENT_SECRET, DATABASE_URL, DCR_ENCRYPTION_KEY) across three tiers, including emergency rotation procedures
  • Adds scripts/rotate_dcr_encryption_key.py — migration script for transactional re-encryption of DCR client secrets with dry-run support, pre-flight validation, and per-client progress logging

Test plan

  • Run ruff check scripts/rotate_dcr_encryption_key.py — passes
  • Run mypy scripts/rotate_dcr_encryption_key.py --ignore-missing-imports — passes
  • Test rotation script against local SQLite DB with dummy data (dry-run + production mode)
  • Verify rotated secrets decrypt with new key and old key is rejected
  • Review runbook procedures for accuracy against production environment

🤖 Generated with Claude Code

Comment thread docs/MANUAL_SECRET_ROTATION.md Outdated
Comment thread docs/MANUAL_SECRET_ROTATION.md

#### Step 1: Request New Secret from CIAM Contacts

Unlike Red Hat SSO (which has a self-service process), GMA client secrets must be requested directly from CIAM team contacts.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you check with them what happen with the clients generated by the previous GMA client? They will still work, right? Asking because that is what the customer Gemini Enterprise will have configured. And kind of separate but related question, do we need to rotate those? Or provide instructions for the customer to rotate those? Basically recreating the agent won't rotate the client ids as that would be stored in the dcr_clients database and will retrieve the existing ones instead of trying to get new ones

Copy link
Copy Markdown
Collaborator Author

@IlonaShishov IlonaShishov May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt ask this specifically, but it makes sense to me that clients created by an old GMA client should not stop working. They are standalone clients.
re rotating those, this might be a bit difficult, as a collective manual yearly rotation might include customers client who were just now added to the service. Im wondering if there is a expiration date to the order, which might prompt customers to renew the order, thus renewing client creds altogether.
As an additional layer of protection for those customer client creds, we do have the encryption key that we will be rotating yearly.

Comment thread docs/MANUAL_SECRET_ROTATION.md
Copy link
Copy Markdown
Collaborator

@luis5tb luis5tb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

Files: docs/MANUAL_SECRET_ROTATION.md (+1267), scripts/rotate_dcr_encryption_key.py (+309)


Critical (2)

1. registration_access_token_encrypted not rotated [scripts/rotate_dcr_encryption_key.py: rotate()]

DCRClientModel (db/models.py:95) has a second Fernet-encrypted field registration_access_token_encrypted that the script ignores. Currently always NULL (service.py never populates it), but if populated in the future, rotation would silently leave it encrypted with the old key — creating a split-key state.

Fix (pick one):

  • A (recommended): Add rotation for this field when non-NULL:
    if client.registration_access_token_encrypted:
        client.registration_access_token_encrypted = self.reencrypt_secret(
            client.registration_access_token_encrypted
        )
  • B: Add a pre-flight assertion that all values are NULL (fail-safe if the field gets populated before the script is updated)
  • C: Document as known limitation with a code comment

2. Detached SQLAlchemy instances across sessions [scripts/rotate_dcr_encryption_key.py: fetch_dcr_clients() + rotate()]

fetch_dcr_clients() loads client objects in one session, then rotate() calls session.add(client) in a different session. While SQLAlchemy can re-associate detached objects, this pattern is fragile and non-standard for async ORM usage.

Fix: Re-fetch clients within the rotation transaction:

async with self.async_session() as session, session.begin():
    result = await session.execute(select(DCRClientModel))
    clients = result.scalars().all()
    for client in clients:
        client.client_secret_encrypted = self.reencrypt_secret(
            client.client_secret_encrypted
        )
        # No session.add() needed — already tracked

Important (5)

3. CLI arguments expose keys in process listings [script:169-175, runbook:Steps 5-6]

--old-key, --new-key, and --database-url are visible via ps aux, system audit logs, and CI/CD logs. The script supports env vars but doesn't warn against CLI usage.

Fix: Add a security warning to the script's --help text and update the runbook to show env-var-only examples. Remove examples showing secrets as CLI args.

4. Plaintext secret memory zeroing is partially effective [script:107-117]

The bytearray zeroing only covers a copy — Fernet.decrypt() returns immutable bytes that persist in heap until GC'd. The current code provides false confidence.

Fix: Simplify to minimize plaintext lifetime (del + gc.collect), or document as best-effort. The practical risk depends on whether core dump forensics is in-scope for the Cloud Run threat model.

5. No automated tests [scripts/rotate_dcr_encryption_key.py]

309-line critical operations script with zero tests. The project has pytest + asyncio_mode="auto" + in-memory SQLite, making tests straightforward to add.

Fix: Add tests/test_rotate_dcr_encryption_key.py covering: key validation, dry-run isolation, production rotation, error cases (invalid keys, empty DB, connection failures).

6. Two TBD verification sections [runbook:307, runbook:829]

GMA_CLIENT_SECRET verification (Tier 1.2 Step 3) and DCR encryption key verification (Tier 3 Step 9) are both <TBD>. A runbook with missing verification is risky for production use.

Fix: Fill in concrete verification procedures, or mark these tiers as "draft" until complete.

7. Script import requires unstated prerequisites [script:line 15]

The script imports from lightspeed_agent.db.models import DCRClientModel but the runbook doesn't mention activating the venv or having the package installed.

Fix: Update runbook Step 5 to include source .venv/bin/activate before running the script (the doc does show this but the prerequisite relationship between venv activation and the import isn't explicit).


Suggestions (3)

8. history -c insufficient for shell history cleanup [runbook:~904]

Only clears in-memory history, not entries already written to ~/.bash_history.

Fix: Add history -w /dev/null or recommend HISTCONTROL=ignorespace (prefix sensitive commands with a space).

9. Runbook shows secrets in echo commands [runbook: Steps 2, 4]

echo -n "<secret>" | gcloud secrets versions add ... leaves secrets in shell history.

Fix: Recommend gcloud secrets versions add --data-file=- with interactive stdin.

10. Emergency contacts all TBD [runbook:1261-1271]

All contacts in Appendix B are placeholders. Not blocking merge, but should be populated before real use.


Strengths

  • Tiered approach — Secrets organized by complexity with appropriate procedures per tier
  • Transactional safety — All-or-nothing via session.begin(), no partial encryption states possible
  • Dry-run mode — Excellent operational safety, tests full decrypt/re-encrypt without database writes
  • Comprehensive rollback — Each tier has explicit rollback steps; Tier 3 has two options (reverse migration + DB restore)
  • Zero secret logging — All logger calls use safe identifiers (client_id, counts), never keys or plaintext
  • Pre-flight validation — Keys and DB connectivity checked before any mutation
  • Emergency rotation appendix — Covers incident response, communication templates, post-incident review
  • Documented exit codes — 0/1/2 mapping enables automation

Verdict: Request Changes

Must fix before merge (2 critical):

  1. Handle registration_access_token_encrypted (at minimum a safety assertion)
  2. Fix cross-session SQLAlchemy pattern (re-fetch within transaction or use merge())

Strongly recommended (5 important):
3. Add security warning about CLI arg key exposure
4. Address or document memory zeroing limitations
5. Add automated tests
6. Fill in TBD verification sections
7. Clarify venv prerequisite in runbook

Items 3-7 could be tracked as follow-up issues if the team wants to unblock the merge on just the two critical fixes.

@IlonaShishov IlonaShishov force-pushed the feature/manual-secret-rotation-policy branch 2 times, most recently from 1338995 to 7da2dff Compare May 11, 2026 14:23
IlonaShishov and others added 4 commits May 12, 2026 12:24
Documents rotation procedures for all four secrets (RED_HAT_SSO_CLIENT_SECRET,
GMA_CLIENT_SECRET, DATABASE_URL, DCR_ENCRYPTION_KEY) across three tiers with
step-by-step runbook. Includes scripts/rotate_dcr_encryption_key.py for
transactional re-encryption of DCR client secrets with dry-run support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cloud Run env-var secrets are resolved at instance startup, not
auto-updated when Secret Manager versions change. Add explicit
service restart steps to Tier 1 rotation procedures and remove
the internal attestation section per PR review feedback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Handle the second Fernet-encrypted field in DCRClientModel so key
rotation does not leave it encrypted with the old key when non-NULL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ects

Remove fetch_dcr_clients() and move all DB operations into rotate().
Clients are fetched and mutated within the same session, eliminating the
cross-session detached-object pattern. Dry-run rolls back on exit,
production commits explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@IlonaShishov IlonaShishov force-pushed the feature/manual-secret-rotation-policy branch from 7da2dff to 273e1d5 Compare May 12, 2026 12:58
Update rotation doc to use DCR_OLD_KEY/DCR_NEW_KEY/DATABASE_URL environment
variables consistently instead of --old-key/--new-key/--database-url CLI
arguments, which expose secrets in process listings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@IlonaShishov IlonaShishov force-pushed the feature/manual-secret-rotation-policy branch from 273e1d5 to c381102 Compare May 12, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants