Add manual secret rotation policy and DCR key migration script#181
Add manual secret rotation policy and DCR key migration script#181IlonaShishov wants to merge 5 commits into
Conversation
|
|
||
| #### Step 1: Request New Secret from CIAM Contacts | ||
|
|
||
| Unlike Red Hat SSO (which has a self-service process), GMA client secrets must be requested directly from CIAM team contacts. |
There was a problem hiding this comment.
did you check with them what happen with the clients generated by the previous GMA client? They will still work, right? Asking because that is what the customer Gemini Enterprise will have configured. And kind of separate but related question, do we need to rotate those? Or provide instructions for the customer to rotate those? Basically recreating the agent won't rotate the client ids as that would be stored in the dcr_clients database and will retrieve the existing ones instead of trying to get new ones
There was a problem hiding this comment.
I didnt ask this specifically, but it makes sense to me that clients created by an old GMA client should not stop working. They are standalone clients.
re rotating those, this might be a bit difficult, as a collective manual yearly rotation might include customers client who were just now added to the service. Im wondering if there is a expiration date to the order, which might prompt customers to renew the order, thus renewing client creds altogether.
As an additional layer of protection for those customer client creds, we do have the encryption key that we will be rotating yearly.
luis5tb
left a comment
There was a problem hiding this comment.
PR Review Summary
Files: docs/MANUAL_SECRET_ROTATION.md (+1267), scripts/rotate_dcr_encryption_key.py (+309)
Critical (2)
1. registration_access_token_encrypted not rotated [scripts/rotate_dcr_encryption_key.py: rotate()]
DCRClientModel (db/models.py:95) has a second Fernet-encrypted field registration_access_token_encrypted that the script ignores. Currently always NULL (service.py never populates it), but if populated in the future, rotation would silently leave it encrypted with the old key — creating a split-key state.
Fix (pick one):
- A (recommended): Add rotation for this field when non-NULL:
if client.registration_access_token_encrypted: client.registration_access_token_encrypted = self.reencrypt_secret( client.registration_access_token_encrypted )
- B: Add a pre-flight assertion that all values are NULL (fail-safe if the field gets populated before the script is updated)
- C: Document as known limitation with a code comment
2. Detached SQLAlchemy instances across sessions [scripts/rotate_dcr_encryption_key.py: fetch_dcr_clients() + rotate()]
fetch_dcr_clients() loads client objects in one session, then rotate() calls session.add(client) in a different session. While SQLAlchemy can re-associate detached objects, this pattern is fragile and non-standard for async ORM usage.
Fix: Re-fetch clients within the rotation transaction:
async with self.async_session() as session, session.begin():
result = await session.execute(select(DCRClientModel))
clients = result.scalars().all()
for client in clients:
client.client_secret_encrypted = self.reencrypt_secret(
client.client_secret_encrypted
)
# No session.add() needed — already trackedImportant (5)
3. CLI arguments expose keys in process listings [script:169-175, runbook:Steps 5-6]
--old-key, --new-key, and --database-url are visible via ps aux, system audit logs, and CI/CD logs. The script supports env vars but doesn't warn against CLI usage.
Fix: Add a security warning to the script's --help text and update the runbook to show env-var-only examples. Remove examples showing secrets as CLI args.
4. Plaintext secret memory zeroing is partially effective [script:107-117]
The bytearray zeroing only covers a copy — Fernet.decrypt() returns immutable bytes that persist in heap until GC'd. The current code provides false confidence.
Fix: Simplify to minimize plaintext lifetime (del + gc.collect), or document as best-effort. The practical risk depends on whether core dump forensics is in-scope for the Cloud Run threat model.
5. No automated tests [scripts/rotate_dcr_encryption_key.py]
309-line critical operations script with zero tests. The project has pytest + asyncio_mode="auto" + in-memory SQLite, making tests straightforward to add.
Fix: Add tests/test_rotate_dcr_encryption_key.py covering: key validation, dry-run isolation, production rotation, error cases (invalid keys, empty DB, connection failures).
6. Two TBD verification sections [runbook:307, runbook:829]
GMA_CLIENT_SECRET verification (Tier 1.2 Step 3) and DCR encryption key verification (Tier 3 Step 9) are both <TBD>. A runbook with missing verification is risky for production use.
Fix: Fill in concrete verification procedures, or mark these tiers as "draft" until complete.
7. Script import requires unstated prerequisites [script:line 15]
The script imports from lightspeed_agent.db.models import DCRClientModel but the runbook doesn't mention activating the venv or having the package installed.
Fix: Update runbook Step 5 to include source .venv/bin/activate before running the script (the doc does show this but the prerequisite relationship between venv activation and the import isn't explicit).
Suggestions (3)
8. history -c insufficient for shell history cleanup [runbook:~904]
Only clears in-memory history, not entries already written to ~/.bash_history.
Fix: Add history -w /dev/null or recommend HISTCONTROL=ignorespace (prefix sensitive commands with a space).
9. Runbook shows secrets in echo commands [runbook: Steps 2, 4]
echo -n "<secret>" | gcloud secrets versions add ... leaves secrets in shell history.
Fix: Recommend gcloud secrets versions add --data-file=- with interactive stdin.
10. Emergency contacts all TBD [runbook:1261-1271]
All contacts in Appendix B are placeholders. Not blocking merge, but should be populated before real use.
Strengths
- Tiered approach — Secrets organized by complexity with appropriate procedures per tier
- Transactional safety — All-or-nothing via
session.begin(), no partial encryption states possible - Dry-run mode — Excellent operational safety, tests full decrypt/re-encrypt without database writes
- Comprehensive rollback — Each tier has explicit rollback steps; Tier 3 has two options (reverse migration + DB restore)
- Zero secret logging — All logger calls use safe identifiers (client_id, counts), never keys or plaintext
- Pre-flight validation — Keys and DB connectivity checked before any mutation
- Emergency rotation appendix — Covers incident response, communication templates, post-incident review
- Documented exit codes — 0/1/2 mapping enables automation
Verdict: Request Changes
Must fix before merge (2 critical):
- Handle
registration_access_token_encrypted(at minimum a safety assertion) - Fix cross-session SQLAlchemy pattern (re-fetch within transaction or use
merge())
Strongly recommended (5 important):
3. Add security warning about CLI arg key exposure
4. Address or document memory zeroing limitations
5. Add automated tests
6. Fill in TBD verification sections
7. Clarify venv prerequisite in runbook
Items 3-7 could be tracked as follow-up issues if the team wants to unblock the merge on just the two critical fixes.
1338995 to
7da2dff
Compare
Documents rotation procedures for all four secrets (RED_HAT_SSO_CLIENT_SECRET, GMA_CLIENT_SECRET, DATABASE_URL, DCR_ENCRYPTION_KEY) across three tiers with step-by-step runbook. Includes scripts/rotate_dcr_encryption_key.py for transactional re-encryption of DCR client secrets with dry-run support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cloud Run env-var secrets are resolved at instance startup, not auto-updated when Secret Manager versions change. Add explicit service restart steps to Tier 1 rotation procedures and remove the internal attestation section per PR review feedback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Handle the second Fernet-encrypted field in DCRClientModel so key rotation does not leave it encrypted with the old key when non-NULL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ects Remove fetch_dcr_clients() and move all DB operations into rotate(). Clients are fetched and mutated within the same session, eliminating the cross-session detached-object pattern. Dry-run rolls back on exit, production commits explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7da2dff to
273e1d5
Compare
Update rotation doc to use DCR_OLD_KEY/DCR_NEW_KEY/DATABASE_URL environment variables consistently instead of --old-key/--new-key/--database-url CLI arguments, which expose secrets in process listings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
273e1d5 to
c381102
Compare
Summary
docs/MANUAL_SECRET_ROTATION.md— runbook documenting manual rotation procedures for all four secrets (RED_HAT_SSO_CLIENT_SECRET,GMA_CLIENT_SECRET,DATABASE_URL,DCR_ENCRYPTION_KEY) across three tiers, including emergency rotation proceduresscripts/rotate_dcr_encryption_key.py— migration script for transactional re-encryption of DCR client secrets with dry-run support, pre-flight validation, and per-client progress loggingTest plan
ruff check scripts/rotate_dcr_encryption_key.py— passesmypy scripts/rotate_dcr_encryption_key.py --ignore-missing-imports— passes🤖 Generated with Claude Code