Skip to content

feat(provision): post-provision runtime IAM policy + setup instructions#59

Open
bnsoni wants to merge 8 commits into
mainfrom
feat/runtime-iam-instructions
Open

feat(provision): post-provision runtime IAM policy + setup instructions#59
bnsoni wants to merge 8 commits into
mainfrom
feat/runtime-iam-instructions

Conversation

@bnsoni

@bnsoni bnsoni commented May 20, 2026

Copy link
Copy Markdown
Contributor

Stacked on top of #58 (feat/tenant-launcher-and-defaults). Merge #58 first or rebase this after.

Summary

Today the AWS keys baked into /ibl/config.yml on the platform server have to span two accounts: IBL's ECR (image pulls) and the operator's S3 buckets. Operators either over-share (admin keys in .env.setup) or wait for IBL to mint a separate ECR-only user that still doesn't grant S3 — neither is a clean fit.

This PR closes that gap by printing the exact minimum-privilege IAM policy at the end of provision / provision-env, with three copy-paste aws iam commands the operator runs in their own account before setup-env.

What lands

Newsrc/iblai_infra/runtime_iam.py:

  • build_runtime_iam_policy(bucket_names) — returns the JSON document. Bucket ARNs derived from actual s3_bucket_{backups,media,static} terraform outputs (no hardcoded names). ECR scope targets IBL's registry via two centralized module constants (IBLAI_ECR_ACCOUNT_ID, IBLAI_ECR_REGION).
  • extract_bucket_names(outputs) — pulls the three bucket-name outputs out of a terraform outputs dict.
  • render_runtime_access_instructions(config, outputs, ws) — saves the policy to <workspace>/runtime-iam-policy.json, prints it verbatim under a rule, then the three aws iam commands and the .env.setup paste instructions. Skipped for DeploymentType.CALL.

Integrationapp.show_results() calls the renderer after every provision (both interactive provision and provision-env).

Policy scope (zero wildcards, zero policy-mutation rights):

Resource Verbs
S3 The three literal bucket ARNs Terraform created GetObject PutObject DeleteObject GetObjectAcl PutObjectAcl ListBucket GetBucketLocation
ECR auth * (AWS requires this Resource shape for GetAuthorizationToken) GetAuthorizationToken
ECR pull arn:aws:ecr:<region>:<iblai-account>:repository/* BatchCheckLayerAvailability BatchGetImage GetDownloadUrlForLayer

Docs:

  • README.md — new sub-section under "Provision infrastructure" with the rendered commands + scope table. Section 4 (non-interactive .env flow) renumbered to a clean 3-step sequence (provision → mint runtime user → setup-env) so the IAM step isn't missed.
  • .env.setup.exampleAWS_ACCESS_KEY_ID comment block now directs the operator to the runtime user, not their admin keys.
  • CHANGELOG.md## [1.11.0] entry.

Tests — 11 new in tests/test_runtime_iam.py: policy shape, ARN generation, tight-verb invariants (s3:* and bucket-policy mutations explicitly absent), ECR account targeting, call-server skip, partial / empty terraform outputs, JSON round-trip. Full suite: 576 passing in ~1.3 s.

Codebase-cleanliness checks

  • No hardcoded bucket namesruntime_iam.py reads only terraform output keys (s3_bucket_backups, etc.), then derives ARNs from whatever values the outputs supplied.
  • No client / customer referencesgrep -riE "kaplan|syracuse|ibleducation|iblai\.nonprod" returns zero hits across the new files.
  • No secrets in tests — fixtures use generic names like test-backups, p-staging-dm-media.
  • IBL ECR account / region centralized as module constants in runtime_iam.py rather than scattered across the diff.

Operator UX after this lands

# 1. Provision (Terraform)
iblai infra provision-env -f .env.provision

# 2. (Printed verbatim by step 1) — paste into shell
aws iam create-user --user-name <project>-<env>-runtime
aws iam put-user-policy \
    --user-name <project>-<env>-runtime \
    --policy-name iblai-runtime \
    --policy-document file://<workspace>/runtime-iam-policy.json
aws iam create-access-key --user-name <project>-<env>-runtime

# 3. Paste AccessKeyId + SecretAccessKey into .env.setup, then:
iblai infra setup-env <project> -f .env.setup

No IBL-side credential handoff. No admin keys on the server. No clicking through the IAM console.

Test plan

  • uv run pytest tests/ -q — 576 passing
  • Local iblai infra provision-env -f .env.provision against a real AWS account — verify the rendered policy lists the literal bucket names from Terraform, three aws iam commands render correctly, <workspace>/runtime-iam-policy.json exists and is valid JSON
  • aws iam put-user-policy --policy-document file://<workspace>/runtime-iam-policy.json — AWS accepts the policy
  • Resulting runtime keys plugged into .env.setupiblai infra setup-env completes (ECR pulls succeed, DM/edX can read/write S3)
  • iblai infra provision-env with --deployment-type call-server (if exposed via env) — confirm the IAM instructions are skipped silently

🤖 Generated with Claude Code

bnsoni and others added 8 commits May 20, 2026 08:39
Adds a new `ibl_tenant_platform` ansible role that launches a tenant
Platform (Platform + admin User + UserPlatformLink) via `run_launch_steps`
when `PLATFORM_NAME` is set to a non-default value, plus a sweep of
defense-in-depth defaults so a fresh single-server bootstrap comes up
production-safe out of the box.

Highlights:

* Tenant launcher — new role wired into both `playbook.yml` (setup /
  setup-env) and `launch_playbook.yml` (launch / launch-env). Gated on
  `PLATFORM_NAME != 'main'`, skips + logs on re-runs when the tenant
  already exists, surfaces the generated admin password via the
  `IBLAI_FIXTURE_OUTPUT` pipeline (never persisted to disk). Also writes
  `PLATFORM_NAME=<KEY>` (uppercase) at the root of `/ibl/config.yml` and
  enforces `Platform.show_paywall=False` + `Platform.is_advertising=False`
  via `Platform.objects.filter().update()`.

* Reserved names —
  - `ADMIN_USERNAME=ibl_admin` is rejected at every input layer
    (interactive prompt, .env, --admin-username); reserved for the SPA
    OAuth Application owner the platform itself maintains. New default
    suggestion is `platform_admin`. Backed by a Pydantic field_validator
    on `SetupConfig.admin_username`.
  - `PLATFORM_NAME=main` is rejected as an explicit input. Unset /
    blank silently resolves to `main` (preserving SSO
    `backend_name=main-oauth2` and skipping the tenant launcher).

* Safer SPA defaults — `IBL_SPA.MENTOR.STRIPE_ENABLED=false` and
  `IBL_SPA.MENTOR.ENABLE_ADVERTISING=false` are written unconditionally
  in `ibl_spa` (fresh installs) and `ibl_launch_services` (AMI launches)
  so a deploy without explicit billing setup never surfaces monetization
  UI by accident.

* Microsoft SSO completeness — `microsoft_sso_config` now also patches
  `IBL_SPA.AUTH.EXTERNAL_IDP_LOGOUT_URL` and
  `IBL_SPA.AUTH.IBL_DIRECT_SSO_URL` (with `microsoft_sso_tenant_id`
  falling back to `common`), then restarts the Auth + Mentor SPAs so
  they pick up the new auth flow.

* Final `ibl global-proxy reload` — added as `post_tasks` in both
  `playbook.yml` and `launch_playbook.yml`, so any nginx state touched
  by SSO roles (edX restarts in google_sso_config / microsoft_sso_config)
  is reloaded before the playbook exits.

* 100 GB volume floor for single / multi server — Pydantic validators
  (`InfraConfig` model_validator gated on `DeploymentType.SINGLE`, plus
  `MultiServerConfig.validate_volume_sizes`), matching interactive +
  CLI + .env input checks. Defaults bumped accordingly. Call-server
  unchanged (LiveKit only needs ~40 GB).

* 32 GB memory warning — new `INSTANCE_RAM_GB` mapping + helper.
  Non-blocking warning suggesting 64 GB (m5.4xlarge / r5.2xlarge) when
  the operator picks a 32 GB instance — in the interactive provision
  wizard, in `provision-env`, and in `launch` / `launch-env` (only when
  AI is enabled).

* Codebase scrub — removed all references to the canonical client name
  from comments, docstrings, prompt instructions, error hints, and
  example .env files. Replacement placeholders: `<client>` for monorepo
  org names, `acme` for tenant-key examples.

* Test fix — the five `_test_ssh()` retry-path tests in
  `tests/ansible/test_runner.py` no longer sleep for ~135 s each; they
  now mock `time.sleep` alongside the existing `subprocess.run` mock,
  cutting ~11 minutes off the full suite.

Test suite: 562 passing in ~1.3 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, and defaults

* Fixed broken `iblai-cli-ops` link (was `ibl-cli-ops`).
* Replaced the stale 9-row role table with a phase-grouped table covering
  the actual 16 roles in `playbook.yml` (host setup, platform install,
  core services, finalization, optional integrations, post-tasks).
  Removed the dead `final_steps` row.
* Provision section now mentions the three deployment topologies (single
  / multi / call), the 100 GB volume floor, and the 32 GB memory warning.
* Setup section notes the tenant `Platform` launch when `PLATFORM_NAME`
  is set to anything other than `main`, that reserved usernames
  (`ibl_admin`) are rejected with `platform_admin` as the new default
  suggestion, and that Stripe / advertising are off by default.
* Section 6 (Launch from AMI) collapsed from three near-duplicate
  examples to one `.env-driven` + one `--flag-driven` block. Cleanup
  reference removed (covered in section 8 / Manage environments).
* Section 4 (non-interactive provision + setup) trimmed; same content
  in fewer paragraphs.
* Project-structure tree: added `env_provision.py` + `env_setup.py`,
  added `launch_playbook.yml` + `service_update_playbook.yml`, removed
  the inaccurate "9 Ansible roles" annotation, bumped test count
  357 → 562.

Net: -50 lines, no client-specific examples or hosts, all instructions
match the current code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Bump `__version__` to 1.10.0
- Add CHANGELOG entry covering the tenant launcher, reserved-name
  rules, safer SPA defaults, 100 GB volume floor, 32 GB memory
  warning, Microsoft SSO IBL_SPA.AUTH completion, final proxy reload,
  codebase scrub, and the slow-test fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `src/iblai_infra/runtime_iam.py` — a small helper that runs at the
tail end of `provision` / `provision-env` and prints the exact
minimum-privilege IAM policy the operator needs to attach to a scoped
"runtime" user in their own AWS account before `setup-env` runs.

The motivation: today the AWS keys baked into `/ibl/config.yml` on the
platform server have to serve TWO accounts at once — IBL's ECR (image
pulls) and the operator's own S3 buckets. Reusing the provisioning admin
keys is overkill and minting a separate user by hand is friction. This
change closes that gap by:

1. **Computing the policy JSON at runtime** — bucket ARNs come from the
   actual `s3_bucket_{backups,media,static}` terraform outputs, not from
   any hardcoded list. ECR scope targets IBL's `arn:aws:ecr:<region>:
   <iblai-account>:repository/*` via two centralized module constants.
2. **Saving it to `<workspace>/runtime-iam-policy.json`** so the operator
   can pipe it into `aws iam put-user-policy --policy-document file://...`
   without copy-pasting JSON.
3. **Printing three ready-to-run `aws iam` commands** (`create-user`,
   `put-user-policy`, `create-access-key`) with the project /
   environment substituted into the user name.
4. **Pointing the operator at `.env.setup`** with the exact lines to
   update (`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY`).

Policy scope:
- S3: literal bucket ARNs only (no wildcards, no bucket-policy mutation,
  no lifecycle config) with `Get/Put/Delete/Acl + ListBucket /
  GetBucketLocation`.
- ECR: `GetAuthorizationToken` on `*` (AWS requires this) plus
  `BatchGetImage`, `BatchCheckLayerAvailability`,
  `GetDownloadUrlForLayer` scoped to IBL's ECR repos.

Skipped automatically for `DeploymentType.CALL` (no S3 buckets, separate
credential flow).

Other changes:
- `.env.setup.example` — `AWS_ACCESS_KEY_ID` comment block now directs
  the operator to use the runtime user from the post-provision step,
  not their provisioning admin keys.
- `README.md` — new sub-section under "Provision infrastructure"
  documenting the runtime IAM step + the scope table. Section 4 (non-
  interactive `.env` flow) renumbered as a 3-step sequence so the IAM
  step isn't missed.
- `__version__` 1.10.0 → 1.11.0 + CHANGELOG entry.

11 new tests in `tests/test_runtime_iam.py` (policy shape, ARN
generation, tight verb set, call-server skip, empty-output handling,
JSON round-trip). Full suite: 576 passing in ~1.3 s.

No hardcoded bucket names, no client references — the policy is
constructed entirely from terraform outputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier rev folded both into one customer-minted policy. Correcting per
spec: the customer creates an S3-only IAM user in their own account; ECR
pull credentials for IBL's image registry are a separate IBL-provided
handoff and are explicitly out of scope for this module.

- `build_runtime_iam_policy` now emits only `PlatformBucketObjects` +
  `PlatformBucketList` statements. Dropped the `ECRAuth` and
  `ECRPullPlatformImages` statements, and removed the now-unused
  `IBLAI_ECR_ACCOUNT_ID` / `IBLAI_ECR_REGION` module constants.
- Renderer rewritten:
  - Section title is now "Next: create the S3 IAM user".
  - Two-sentence opening explains it's the S3 set only.
  - User name template is `<project>-<env>-s3-runtime` (was `-runtime`)
    so it's unambiguous which set this is.
  - Policy name is `iblai-s3-runtime`.
  - Closing line explicitly notes ECR pull credentials are provided
    separately by IBL and are NOT set up here.
- Tests: added `test_no_ecr_statements` (negative assertion sweeping
  every Statement's Action list for `ecr:*` and failing on any hit).
  Dropped the ECR-resource-shape tests since those statements no longer
  exist. Net: 11 → 10 tests; full suite 575 passing.
- README sub-section gains a leading two-row table making the
  "S3 (customer) vs ECR (IBL handoff)" split crystal-clear, then walks
  through the S3 user creation; the ECR row points back to IBL's
  handoff procedure.
- `.env.setup.example` comment block restated: keys here are S3-only,
  ECR is a separate IBL handoff.
- CHANGELOG 1.11.0 entry updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the longer "credentials are provided separately by IBL"
wording across runtime_iam.py, .env.setup.example, and README with:

  "For ECR images, use AWS credentials provided by ibl.ai —
   or contact us at https://ibl.ai/contact"

Same surface area; tighter copy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dentials

Closes a long-standing conflation in the .env.setup credential model.
Previously a single AWS access key had to satisfy two distinct accounts
at once: ECR pulls against IBL's image registry AND S3 access against
the buckets in the operator's own account. Worked only when that one
key happened to have both scopes.

Now the two sets are first-class and land in the right place on the
host:

  S3 keys  → /ibl/config.yml root (AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY
             at top level). Consumed by DM / edX containers at runtime
             via iblai-cli-ops templating. Source: customer creates this
             user post-provision using the runtime-iam-policy.json the
             CLI prints.
  ECR keys → ~/.aws/credentials [default] profile. Consumed by
             `aws ecr get-login-password` in every Login to ECR task,
             without env-var overrides anywhere. Source: ibl.ai-provided
             handoff.

Implementation:

* `SetupConfig` gains optional `ecr_aws_access_key_id` /
  `ecr_aws_secret_access_key` / `ecr_aws_default_region` (secret is
  `Field(exclude=True)`).
* `env_setup.py` reads new `ECR_AWS_*` env vars.
* `runner.py::_build_extra_vars` passes both sets as separate ansible
  extra-vars (`aws_*` and `ecr_aws_*`). When ECR is empty, the S3 keys
  fall through — backwards-compatible with older single-key-set
  deployments.
* `awscli` role: writes ECR keys (not S3) to ~/.aws/credentials default
  profile.
* `ibl_platform` role: new task writes S3 keys to /ibl/config.yml root
  via three `ibl config save --set` calls. Gated `no_log: true`.
* Four `Login to ECR` tasks across `ibl_spa`, `ibl_launch_services`,
  `ibl_platform`, `ibl_service_update` strip the env-var overrides —
  they now use whatever ~/.aws/credentials [default] holds, which is
  exactly the ECR set.

Docs / examples:

* `.env.setup.example` — two clearly-labeled AWS_* blocks (S3 + ECR)
  with destination + usage inline. Comments call out the fall-through
  behavior for older deployments.
* `README` — credential-set table under "Provision infrastructure"
  gains a "Lives in" column making the split unambiguous.
* `CHANGELOG` — 1.11.0 entry expanded with the split details.

Full suite: 575 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: SPA images do NOT ship with node_modules baked in. The
container runs `pnpm install` on first boot (~80–120s observed) before
Next.js can start. Combined with `docker compose pull` and image-
extraction overhead, total cold-start can comfortably exceed the older
150s budget on a slower instance or marginal network — the wait task
gives up, the playbook bails, but the SPA finishes installing seconds
later and ends up serving 200. False negative.

Repro: a fresh `iblai infra setup-env <name>` run failed at the Auth
SPA wait with 10 attempts of `non-zero return code`. SSH'd in
immediately after, container was Up 17 minutes, curl `localhost:5000`
returned 200. The SPA was healthy — the wait just didn't wait long
enough.

Fix: 30 retries × 15s = 450s (7.5 min). Applied to all six SPA wait
tasks across both flows:

  ibl_spa role           (initial setup / setup-env)
    - Wait for Auth SPA
    - Wait for Mentor SPA
    - Wait for Skills SPA
  ibl_launch_services    (AMI launch / launch-env)
    - Wait for Auth SPA
    - Wait for Mentor SPA
    - Wait for Skills SPA

Each task gets an inline comment explaining the 450s budget rationale
so a future maintainer doesn't shrink it without re-tracing this.

Note: a node_modules-prebake at the image level would fix this more
elegantly, but that's an iblai-prod-images concern, outside this repo.
This change makes the ansible-side wait robust to the current image
shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant