feat(provision): post-provision runtime IAM policy + setup instructions by bnsoni · Pull Request #59 · iblai/infra-cli

bnsoni · 2026-05-20T08:18:29Z

⚠ Stacked on top of #58 (feat/tenant-launcher-and-defaults). Merge #58 first or rebase this after.

Summary

Today the AWS keys baked into /ibl/config.yml on the platform server have to span two accounts: IBL's ECR (image pulls) and the operator's S3 buckets. Operators either over-share (admin keys in .env.setup) or wait for IBL to mint a separate ECR-only user that still doesn't grant S3 — neither is a clean fit.

This PR closes that gap by printing the exact minimum-privilege IAM policy at the end of provision / provision-env, with three copy-paste aws iam commands the operator runs in their own account before setup-env.

What lands

New — src/iblai_infra/runtime_iam.py:

build_runtime_iam_policy(bucket_names) — returns the JSON document. Bucket ARNs derived from actual s3_bucket_{backups,media,static} terraform outputs (no hardcoded names). ECR scope targets IBL's registry via two centralized module constants (IBLAI_ECR_ACCOUNT_ID, IBLAI_ECR_REGION).
extract_bucket_names(outputs) — pulls the three bucket-name outputs out of a terraform outputs dict.
render_runtime_access_instructions(config, outputs, ws) — saves the policy to <workspace>/runtime-iam-policy.json, prints it verbatim under a rule, then the three aws iam commands and the .env.setup paste instructions. Skipped for DeploymentType.CALL.

Integration — app.show_results() calls the renderer after every provision (both interactive provision and provision-env).

Policy scope (zero wildcards, zero policy-mutation rights):

	Resource	Verbs
S3	The three literal bucket ARNs Terraform created	`GetObject` `PutObject` `DeleteObject` `GetObjectAcl` `PutObjectAcl` `ListBucket` `GetBucketLocation`
ECR auth	`*` (AWS requires this Resource shape for `GetAuthorizationToken`)	`GetAuthorizationToken`
ECR pull	`arn:aws:ecr:<region>:<iblai-account>:repository/*`	`BatchCheckLayerAvailability` `BatchGetImage` `GetDownloadUrlForLayer`

Docs:

README.md — new sub-section under "Provision infrastructure" with the rendered commands + scope table. Section 4 (non-interactive .env flow) renumbered to a clean 3-step sequence (provision → mint runtime user → setup-env) so the IAM step isn't missed.
.env.setup.example — AWS_ACCESS_KEY_ID comment block now directs the operator to the runtime user, not their admin keys.
CHANGELOG.md — ## [1.11.0] entry.

Tests — 11 new in tests/test_runtime_iam.py: policy shape, ARN generation, tight-verb invariants (s3:* and bucket-policy mutations explicitly absent), ECR account targeting, call-server skip, partial / empty terraform outputs, JSON round-trip. Full suite: 576 passing in ~1.3 s.

Codebase-cleanliness checks

No hardcoded bucket names — runtime_iam.py reads only terraform output keys (s3_bucket_backups, etc.), then derives ARNs from whatever values the outputs supplied.
No client / customer references — grep -riE "kaplan|syracuse|ibleducation|iblai\.nonprod" returns zero hits across the new files.
No secrets in tests — fixtures use generic names like test-backups, p-staging-dm-media.
IBL ECR account / region centralized as module constants in runtime_iam.py rather than scattered across the diff.

Operator UX after this lands

# 1. Provision (Terraform)
iblai infra provision-env -f .env.provision

# 2. (Printed verbatim by step 1) — paste into shell
aws iam create-user --user-name <project>-<env>-runtime
aws iam put-user-policy \
    --user-name <project>-<env>-runtime \
    --policy-name iblai-runtime \
    --policy-document file://<workspace>/runtime-iam-policy.json
aws iam create-access-key --user-name <project>-<env>-runtime

# 3. Paste AccessKeyId + SecretAccessKey into .env.setup, then:
iblai infra setup-env <project> -f .env.setup

No IBL-side credential handoff. No admin keys on the server. No clicking through the IAM console.

Test plan

uv run pytest tests/ -q — 576 passing
Local iblai infra provision-env -f .env.provision against a real AWS account — verify the rendered policy lists the literal bucket names from Terraform, three aws iam commands render correctly, <workspace>/runtime-iam-policy.json exists and is valid JSON
aws iam put-user-policy --policy-document file://<workspace>/runtime-iam-policy.json — AWS accepts the policy
Resulting runtime keys plugged into .env.setup → iblai infra setup-env completes (ECR pulls succeed, DM/edX can read/write S3)
iblai infra provision-env with --deployment-type call-server (if exposed via env) — confirm the IAM instructions are skipped silently

🤖 Generated with Claude Code

Adds a new `ibl_tenant_platform` ansible role that launches a tenant Platform (Platform + admin User + UserPlatformLink) via `run_launch_steps` when `PLATFORM_NAME` is set to a non-default value, plus a sweep of defense-in-depth defaults so a fresh single-server bootstrap comes up production-safe out of the box. Highlights: * Tenant launcher — new role wired into both `playbook.yml` (setup / setup-env) and `launch_playbook.yml` (launch / launch-env). Gated on `PLATFORM_NAME != 'main'`, skips + logs on re-runs when the tenant already exists, surfaces the generated admin password via the `IBLAI_FIXTURE_OUTPUT` pipeline (never persisted to disk). Also writes `PLATFORM_NAME=<KEY>` (uppercase) at the root of `/ibl/config.yml` and enforces `Platform.show_paywall=False` + `Platform.is_advertising=False` via `Platform.objects.filter().update()`. * Reserved names — - `ADMIN_USERNAME=ibl_admin` is rejected at every input layer (interactive prompt, .env, --admin-username); reserved for the SPA OAuth Application owner the platform itself maintains. New default suggestion is `platform_admin`. Backed by a Pydantic field_validator on `SetupConfig.admin_username`. - `PLATFORM_NAME=main` is rejected as an explicit input. Unset / blank silently resolves to `main` (preserving SSO `backend_name=main-oauth2` and skipping the tenant launcher). * Safer SPA defaults — `IBL_SPA.MENTOR.STRIPE_ENABLED=false` and `IBL_SPA.MENTOR.ENABLE_ADVERTISING=false` are written unconditionally in `ibl_spa` (fresh installs) and `ibl_launch_services` (AMI launches) so a deploy without explicit billing setup never surfaces monetization UI by accident. * Microsoft SSO completeness — `microsoft_sso_config` now also patches `IBL_SPA.AUTH.EXTERNAL_IDP_LOGOUT_URL` and `IBL_SPA.AUTH.IBL_DIRECT_SSO_URL` (with `microsoft_sso_tenant_id` falling back to `common`), then restarts the Auth + Mentor SPAs so they pick up the new auth flow. * Final `ibl global-proxy reload` — added as `post_tasks` in both `playbook.yml` and `launch_playbook.yml`, so any nginx state touched by SSO roles (edX restarts in google_sso_config / microsoft_sso_config) is reloaded before the playbook exits. * 100 GB volume floor for single / multi server — Pydantic validators (`InfraConfig` model_validator gated on `DeploymentType.SINGLE`, plus `MultiServerConfig.validate_volume_sizes`), matching interactive + CLI + .env input checks. Defaults bumped accordingly. Call-server unchanged (LiveKit only needs ~40 GB). * 32 GB memory warning — new `INSTANCE_RAM_GB` mapping + helper. Non-blocking warning suggesting 64 GB (m5.4xlarge / r5.2xlarge) when the operator picks a 32 GB instance — in the interactive provision wizard, in `provision-env`, and in `launch` / `launch-env` (only when AI is enabled). * Codebase scrub — removed all references to the canonical client name from comments, docstrings, prompt instructions, error hints, and example .env files. Replacement placeholders: `<client>` for monorepo org names, `acme` for tenant-key examples. * Test fix — the five `_test_ssh()` retry-path tests in `tests/ansible/test_runner.py` no longer sleep for ~135 s each; they now mock `time.sleep` alongside the existing `subprocess.run` mock, cutting ~11 minutes off the full suite. Test suite: 562 passing in ~1.3 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, and defaults * Fixed broken `iblai-cli-ops` link (was `ibl-cli-ops`). * Replaced the stale 9-row role table with a phase-grouped table covering the actual 16 roles in `playbook.yml` (host setup, platform install, core services, finalization, optional integrations, post-tasks). Removed the dead `final_steps` row. * Provision section now mentions the three deployment topologies (single / multi / call), the 100 GB volume floor, and the 32 GB memory warning. * Setup section notes the tenant `Platform` launch when `PLATFORM_NAME` is set to anything other than `main`, that reserved usernames (`ibl_admin`) are rejected with `platform_admin` as the new default suggestion, and that Stripe / advertising are off by default. * Section 6 (Launch from AMI) collapsed from three near-duplicate examples to one `.env-driven` + one `--flag-driven` block. Cleanup reference removed (covered in section 8 / Manage environments). * Section 4 (non-interactive provision + setup) trimmed; same content in fewer paragraphs. * Project-structure tree: added `env_provision.py` + `env_setup.py`, added `launch_playbook.yml` + `service_update_playbook.yml`, removed the inaccurate "9 Ansible roles" annotation, bumped test count 357 → 562. Net: -50 lines, no client-specific examples or hosts, all instructions match the current code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Bump `__version__` to 1.10.0 - Add CHANGELOG entry covering the tenant launcher, reserved-name rules, safer SPA defaults, 100 GB volume floor, 32 GB memory warning, Microsoft SSO IBL_SPA.AUTH completion, final proxy reload, codebase scrub, and the slow-test fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `src/iblai_infra/runtime_iam.py` — a small helper that runs at the tail end of `provision` / `provision-env` and prints the exact minimum-privilege IAM policy the operator needs to attach to a scoped "runtime" user in their own AWS account before `setup-env` runs. The motivation: today the AWS keys baked into `/ibl/config.yml` on the platform server have to serve TWO accounts at once — IBL's ECR (image pulls) and the operator's own S3 buckets. Reusing the provisioning admin keys is overkill and minting a separate user by hand is friction. This change closes that gap by: 1. **Computing the policy JSON at runtime** — bucket ARNs come from the actual `s3_bucket_{backups,media,static}` terraform outputs, not from any hardcoded list. ECR scope targets IBL's `arn:aws:ecr:<region>: <iblai-account>:repository/*` via two centralized module constants. 2. **Saving it to `<workspace>/runtime-iam-policy.json`** so the operator can pipe it into `aws iam put-user-policy --policy-document file://...` without copy-pasting JSON. 3. **Printing three ready-to-run `aws iam` commands** (`create-user`, `put-user-policy`, `create-access-key`) with the project / environment substituted into the user name. 4. **Pointing the operator at `.env.setup`** with the exact lines to update (`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY`). Policy scope: - S3: literal bucket ARNs only (no wildcards, no bucket-policy mutation, no lifecycle config) with `Get/Put/Delete/Acl + ListBucket / GetBucketLocation`. - ECR: `GetAuthorizationToken` on `*` (AWS requires this) plus `BatchGetImage`, `BatchCheckLayerAvailability`, `GetDownloadUrlForLayer` scoped to IBL's ECR repos. Skipped automatically for `DeploymentType.CALL` (no S3 buckets, separate credential flow). Other changes: - `.env.setup.example` — `AWS_ACCESS_KEY_ID` comment block now directs the operator to use the runtime user from the post-provision step, not their provisioning admin keys. - `README.md` — new sub-section under "Provision infrastructure" documenting the runtime IAM step + the scope table. Section 4 (non- interactive `.env` flow) renumbered as a 3-step sequence so the IAM step isn't missed. - `__version__` 1.10.0 → 1.11.0 + CHANGELOG entry. 11 new tests in `tests/test_runtime_iam.py` (policy shape, ARN generation, tight verb set, call-server skip, empty-output handling, JSON round-trip). Full suite: 576 passing in ~1.3 s. No hardcoded bucket names, no client references — the policy is constructed entirely from terraform outputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Earlier rev folded both into one customer-minted policy. Correcting per spec: the customer creates an S3-only IAM user in their own account; ECR pull credentials for IBL's image registry are a separate IBL-provided handoff and are explicitly out of scope for this module. - `build_runtime_iam_policy` now emits only `PlatformBucketObjects` + `PlatformBucketList` statements. Dropped the `ECRAuth` and `ECRPullPlatformImages` statements, and removed the now-unused `IBLAI_ECR_ACCOUNT_ID` / `IBLAI_ECR_REGION` module constants. - Renderer rewritten: - Section title is now "Next: create the S3 IAM user". - Two-sentence opening explains it's the S3 set only. - User name template is `<project>-<env>-s3-runtime` (was `-runtime`) so it's unambiguous which set this is. - Policy name is `iblai-s3-runtime`. - Closing line explicitly notes ECR pull credentials are provided separately by IBL and are NOT set up here. - Tests: added `test_no_ecr_statements` (negative assertion sweeping every Statement's Action list for `ecr:*` and failing on any hit). Dropped the ECR-resource-shape tests since those statements no longer exist. Net: 11 → 10 tests; full suite 575 passing. - README sub-section gains a leading two-row table making the "S3 (customer) vs ECR (IBL handoff)" split crystal-clear, then walks through the S3 user creation; the ECR row points back to IBL's handoff procedure. - `.env.setup.example` comment block restated: keys here are S3-only, ECR is a separate IBL handoff. - CHANGELOG 1.11.0 entry updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the longer "credentials are provided separately by IBL" wording across runtime_iam.py, .env.setup.example, and README with: "For ECR images, use AWS credentials provided by ibl.ai — or contact us at https://ibl.ai/contact" Same surface area; tighter copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dentials Closes a long-standing conflation in the .env.setup credential model. Previously a single AWS access key had to satisfy two distinct accounts at once: ECR pulls against IBL's image registry AND S3 access against the buckets in the operator's own account. Worked only when that one key happened to have both scopes. Now the two sets are first-class and land in the right place on the host: S3 keys → /ibl/config.yml root (AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY at top level). Consumed by DM / edX containers at runtime via iblai-cli-ops templating. Source: customer creates this user post-provision using the runtime-iam-policy.json the CLI prints. ECR keys → ~/.aws/credentials [default] profile. Consumed by `aws ecr get-login-password` in every Login to ECR task, without env-var overrides anywhere. Source: ibl.ai-provided handoff. Implementation: * `SetupConfig` gains optional `ecr_aws_access_key_id` / `ecr_aws_secret_access_key` / `ecr_aws_default_region` (secret is `Field(exclude=True)`). * `env_setup.py` reads new `ECR_AWS_*` env vars. * `runner.py::_build_extra_vars` passes both sets as separate ansible extra-vars (`aws_*` and `ecr_aws_*`). When ECR is empty, the S3 keys fall through — backwards-compatible with older single-key-set deployments. * `awscli` role: writes ECR keys (not S3) to ~/.aws/credentials default profile. * `ibl_platform` role: new task writes S3 keys to /ibl/config.yml root via three `ibl config save --set` calls. Gated `no_log: true`. * Four `Login to ECR` tasks across `ibl_spa`, `ibl_launch_services`, `ibl_platform`, `ibl_service_update` strip the env-var overrides — they now use whatever ~/.aws/credentials [default] holds, which is exactly the ECR set. Docs / examples: * `.env.setup.example` — two clearly-labeled AWS_* blocks (S3 + ECR) with destination + usage inline. Comments call out the fall-through behavior for older deployments. * `README` — credential-set table under "Provision infrastructure" gains a "Lives in" column making the split unambiguous. * `CHANGELOG` — 1.11.0 entry expanded with the split details. Full suite: 575 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Root cause: SPA images do NOT ship with node_modules baked in. The container runs `pnpm install` on first boot (~80–120s observed) before Next.js can start. Combined with `docker compose pull` and image- extraction overhead, total cold-start can comfortably exceed the older 150s budget on a slower instance or marginal network — the wait task gives up, the playbook bails, but the SPA finishes installing seconds later and ends up serving 200. False negative. Repro: a fresh `iblai infra setup-env <name>` run failed at the Auth SPA wait with 10 attempts of `non-zero return code`. SSH'd in immediately after, container was Up 17 minutes, curl `localhost:5000` returned 200. The SPA was healthy — the wait just didn't wait long enough. Fix: 30 retries × 15s = 450s (7.5 min). Applied to all six SPA wait tasks across both flows: ibl_spa role (initial setup / setup-env) - Wait for Auth SPA - Wait for Mentor SPA - Wait for Skills SPA ibl_launch_services (AMI launch / launch-env) - Wait for Auth SPA - Wait for Mentor SPA - Wait for Skills SPA Each task gets an inline comment explaining the 450s budget rationale so a future maintainer doesn't shrink it without re-tracing this. Note: a node_modules-prebake at the image level would fix this more elegantly, but that's an iblai-prod-images concern, outside this repo. This change makes the ansible-side wait robust to the current image shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bnsoni and others added 8 commits May 20, 2026 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(provision): post-provision runtime IAM policy + setup instructions#59

feat(provision): post-provision runtime IAM policy + setup instructions#59
bnsoni wants to merge 8 commits into
mainfrom
feat/runtime-iam-instructions

bnsoni commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bnsoni commented May 20, 2026

Summary

What lands

Codebase-cleanliness checks

Operator UX after this lands

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant