feat(provision): post-provision runtime IAM policy + setup instructions#59
Open
bnsoni wants to merge 8 commits into
Open
feat(provision): post-provision runtime IAM policy + setup instructions#59bnsoni wants to merge 8 commits into
bnsoni wants to merge 8 commits into
Conversation
Adds a new `ibl_tenant_platform` ansible role that launches a tenant
Platform (Platform + admin User + UserPlatformLink) via `run_launch_steps`
when `PLATFORM_NAME` is set to a non-default value, plus a sweep of
defense-in-depth defaults so a fresh single-server bootstrap comes up
production-safe out of the box.
Highlights:
* Tenant launcher — new role wired into both `playbook.yml` (setup /
setup-env) and `launch_playbook.yml` (launch / launch-env). Gated on
`PLATFORM_NAME != 'main'`, skips + logs on re-runs when the tenant
already exists, surfaces the generated admin password via the
`IBLAI_FIXTURE_OUTPUT` pipeline (never persisted to disk). Also writes
`PLATFORM_NAME=<KEY>` (uppercase) at the root of `/ibl/config.yml` and
enforces `Platform.show_paywall=False` + `Platform.is_advertising=False`
via `Platform.objects.filter().update()`.
* Reserved names —
- `ADMIN_USERNAME=ibl_admin` is rejected at every input layer
(interactive prompt, .env, --admin-username); reserved for the SPA
OAuth Application owner the platform itself maintains. New default
suggestion is `platform_admin`. Backed by a Pydantic field_validator
on `SetupConfig.admin_username`.
- `PLATFORM_NAME=main` is rejected as an explicit input. Unset /
blank silently resolves to `main` (preserving SSO
`backend_name=main-oauth2` and skipping the tenant launcher).
* Safer SPA defaults — `IBL_SPA.MENTOR.STRIPE_ENABLED=false` and
`IBL_SPA.MENTOR.ENABLE_ADVERTISING=false` are written unconditionally
in `ibl_spa` (fresh installs) and `ibl_launch_services` (AMI launches)
so a deploy without explicit billing setup never surfaces monetization
UI by accident.
* Microsoft SSO completeness — `microsoft_sso_config` now also patches
`IBL_SPA.AUTH.EXTERNAL_IDP_LOGOUT_URL` and
`IBL_SPA.AUTH.IBL_DIRECT_SSO_URL` (with `microsoft_sso_tenant_id`
falling back to `common`), then restarts the Auth + Mentor SPAs so
they pick up the new auth flow.
* Final `ibl global-proxy reload` — added as `post_tasks` in both
`playbook.yml` and `launch_playbook.yml`, so any nginx state touched
by SSO roles (edX restarts in google_sso_config / microsoft_sso_config)
is reloaded before the playbook exits.
* 100 GB volume floor for single / multi server — Pydantic validators
(`InfraConfig` model_validator gated on `DeploymentType.SINGLE`, plus
`MultiServerConfig.validate_volume_sizes`), matching interactive +
CLI + .env input checks. Defaults bumped accordingly. Call-server
unchanged (LiveKit only needs ~40 GB).
* 32 GB memory warning — new `INSTANCE_RAM_GB` mapping + helper.
Non-blocking warning suggesting 64 GB (m5.4xlarge / r5.2xlarge) when
the operator picks a 32 GB instance — in the interactive provision
wizard, in `provision-env`, and in `launch` / `launch-env` (only when
AI is enabled).
* Codebase scrub — removed all references to the canonical client name
from comments, docstrings, prompt instructions, error hints, and
example .env files. Replacement placeholders: `<client>` for monorepo
org names, `acme` for tenant-key examples.
* Test fix — the five `_test_ssh()` retry-path tests in
`tests/ansible/test_runner.py` no longer sleep for ~135 s each; they
now mock `time.sleep` alongside the existing `subprocess.run` mock,
cutting ~11 minutes off the full suite.
Test suite: 562 passing in ~1.3 s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, and defaults * Fixed broken `iblai-cli-ops` link (was `ibl-cli-ops`). * Replaced the stale 9-row role table with a phase-grouped table covering the actual 16 roles in `playbook.yml` (host setup, platform install, core services, finalization, optional integrations, post-tasks). Removed the dead `final_steps` row. * Provision section now mentions the three deployment topologies (single / multi / call), the 100 GB volume floor, and the 32 GB memory warning. * Setup section notes the tenant `Platform` launch when `PLATFORM_NAME` is set to anything other than `main`, that reserved usernames (`ibl_admin`) are rejected with `platform_admin` as the new default suggestion, and that Stripe / advertising are off by default. * Section 6 (Launch from AMI) collapsed from three near-duplicate examples to one `.env-driven` + one `--flag-driven` block. Cleanup reference removed (covered in section 8 / Manage environments). * Section 4 (non-interactive provision + setup) trimmed; same content in fewer paragraphs. * Project-structure tree: added `env_provision.py` + `env_setup.py`, added `launch_playbook.yml` + `service_update_playbook.yml`, removed the inaccurate "9 Ansible roles" annotation, bumped test count 357 → 562. Net: -50 lines, no client-specific examples or hosts, all instructions match the current code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Bump `__version__` to 1.10.0 - Add CHANGELOG entry covering the tenant launcher, reserved-name rules, safer SPA defaults, 100 GB volume floor, 32 GB memory warning, Microsoft SSO IBL_SPA.AUTH completion, final proxy reload, codebase scrub, and the slow-test fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `src/iblai_infra/runtime_iam.py` — a small helper that runs at the
tail end of `provision` / `provision-env` and prints the exact
minimum-privilege IAM policy the operator needs to attach to a scoped
"runtime" user in their own AWS account before `setup-env` runs.
The motivation: today the AWS keys baked into `/ibl/config.yml` on the
platform server have to serve TWO accounts at once — IBL's ECR (image
pulls) and the operator's own S3 buckets. Reusing the provisioning admin
keys is overkill and minting a separate user by hand is friction. This
change closes that gap by:
1. **Computing the policy JSON at runtime** — bucket ARNs come from the
actual `s3_bucket_{backups,media,static}` terraform outputs, not from
any hardcoded list. ECR scope targets IBL's `arn:aws:ecr:<region>:
<iblai-account>:repository/*` via two centralized module constants.
2. **Saving it to `<workspace>/runtime-iam-policy.json`** so the operator
can pipe it into `aws iam put-user-policy --policy-document file://...`
without copy-pasting JSON.
3. **Printing three ready-to-run `aws iam` commands** (`create-user`,
`put-user-policy`, `create-access-key`) with the project /
environment substituted into the user name.
4. **Pointing the operator at `.env.setup`** with the exact lines to
update (`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY`).
Policy scope:
- S3: literal bucket ARNs only (no wildcards, no bucket-policy mutation,
no lifecycle config) with `Get/Put/Delete/Acl + ListBucket /
GetBucketLocation`.
- ECR: `GetAuthorizationToken` on `*` (AWS requires this) plus
`BatchGetImage`, `BatchCheckLayerAvailability`,
`GetDownloadUrlForLayer` scoped to IBL's ECR repos.
Skipped automatically for `DeploymentType.CALL` (no S3 buckets, separate
credential flow).
Other changes:
- `.env.setup.example` — `AWS_ACCESS_KEY_ID` comment block now directs
the operator to use the runtime user from the post-provision step,
not their provisioning admin keys.
- `README.md` — new sub-section under "Provision infrastructure"
documenting the runtime IAM step + the scope table. Section 4 (non-
interactive `.env` flow) renumbered as a 3-step sequence so the IAM
step isn't missed.
- `__version__` 1.10.0 → 1.11.0 + CHANGELOG entry.
11 new tests in `tests/test_runtime_iam.py` (policy shape, ARN
generation, tight verb set, call-server skip, empty-output handling,
JSON round-trip). Full suite: 576 passing in ~1.3 s.
No hardcoded bucket names, no client references — the policy is
constructed entirely from terraform outputs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier rev folded both into one customer-minted policy. Correcting per
spec: the customer creates an S3-only IAM user in their own account; ECR
pull credentials for IBL's image registry are a separate IBL-provided
handoff and are explicitly out of scope for this module.
- `build_runtime_iam_policy` now emits only `PlatformBucketObjects` +
`PlatformBucketList` statements. Dropped the `ECRAuth` and
`ECRPullPlatformImages` statements, and removed the now-unused
`IBLAI_ECR_ACCOUNT_ID` / `IBLAI_ECR_REGION` module constants.
- Renderer rewritten:
- Section title is now "Next: create the S3 IAM user".
- Two-sentence opening explains it's the S3 set only.
- User name template is `<project>-<env>-s3-runtime` (was `-runtime`)
so it's unambiguous which set this is.
- Policy name is `iblai-s3-runtime`.
- Closing line explicitly notes ECR pull credentials are provided
separately by IBL and are NOT set up here.
- Tests: added `test_no_ecr_statements` (negative assertion sweeping
every Statement's Action list for `ecr:*` and failing on any hit).
Dropped the ECR-resource-shape tests since those statements no longer
exist. Net: 11 → 10 tests; full suite 575 passing.
- README sub-section gains a leading two-row table making the
"S3 (customer) vs ECR (IBL handoff)" split crystal-clear, then walks
through the S3 user creation; the ECR row points back to IBL's
handoff procedure.
- `.env.setup.example` comment block restated: keys here are S3-only,
ECR is a separate IBL handoff.
- CHANGELOG 1.11.0 entry updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the longer "credentials are provided separately by IBL" wording across runtime_iam.py, .env.setup.example, and README with: "For ECR images, use AWS credentials provided by ibl.ai — or contact us at https://ibl.ai/contact" Same surface area; tighter copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dentials
Closes a long-standing conflation in the .env.setup credential model.
Previously a single AWS access key had to satisfy two distinct accounts
at once: ECR pulls against IBL's image registry AND S3 access against
the buckets in the operator's own account. Worked only when that one
key happened to have both scopes.
Now the two sets are first-class and land in the right place on the
host:
S3 keys → /ibl/config.yml root (AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY
at top level). Consumed by DM / edX containers at runtime
via iblai-cli-ops templating. Source: customer creates this
user post-provision using the runtime-iam-policy.json the
CLI prints.
ECR keys → ~/.aws/credentials [default] profile. Consumed by
`aws ecr get-login-password` in every Login to ECR task,
without env-var overrides anywhere. Source: ibl.ai-provided
handoff.
Implementation:
* `SetupConfig` gains optional `ecr_aws_access_key_id` /
`ecr_aws_secret_access_key` / `ecr_aws_default_region` (secret is
`Field(exclude=True)`).
* `env_setup.py` reads new `ECR_AWS_*` env vars.
* `runner.py::_build_extra_vars` passes both sets as separate ansible
extra-vars (`aws_*` and `ecr_aws_*`). When ECR is empty, the S3 keys
fall through — backwards-compatible with older single-key-set
deployments.
* `awscli` role: writes ECR keys (not S3) to ~/.aws/credentials default
profile.
* `ibl_platform` role: new task writes S3 keys to /ibl/config.yml root
via three `ibl config save --set` calls. Gated `no_log: true`.
* Four `Login to ECR` tasks across `ibl_spa`, `ibl_launch_services`,
`ibl_platform`, `ibl_service_update` strip the env-var overrides —
they now use whatever ~/.aws/credentials [default] holds, which is
exactly the ECR set.
Docs / examples:
* `.env.setup.example` — two clearly-labeled AWS_* blocks (S3 + ECR)
with destination + usage inline. Comments call out the fall-through
behavior for older deployments.
* `README` — credential-set table under "Provision infrastructure"
gains a "Lives in" column making the split unambiguous.
* `CHANGELOG` — 1.11.0 entry expanded with the split details.
Full suite: 575 passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: SPA images do NOT ship with node_modules baked in. The
container runs `pnpm install` on first boot (~80–120s observed) before
Next.js can start. Combined with `docker compose pull` and image-
extraction overhead, total cold-start can comfortably exceed the older
150s budget on a slower instance or marginal network — the wait task
gives up, the playbook bails, but the SPA finishes installing seconds
later and ends up serving 200. False negative.
Repro: a fresh `iblai infra setup-env <name>` run failed at the Auth
SPA wait with 10 attempts of `non-zero return code`. SSH'd in
immediately after, container was Up 17 minutes, curl `localhost:5000`
returned 200. The SPA was healthy — the wait just didn't wait long
enough.
Fix: 30 retries × 15s = 450s (7.5 min). Applied to all six SPA wait
tasks across both flows:
ibl_spa role (initial setup / setup-env)
- Wait for Auth SPA
- Wait for Mentor SPA
- Wait for Skills SPA
ibl_launch_services (AMI launch / launch-env)
- Wait for Auth SPA
- Wait for Mentor SPA
- Wait for Skills SPA
Each task gets an inline comment explaining the 450s budget rationale
so a future maintainer doesn't shrink it without re-tracing this.
Note: a node_modules-prebake at the image level would fix this more
elegantly, but that's an iblai-prod-images concern, outside this repo.
This change makes the ansible-side wait robust to the current image
shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Today the AWS keys baked into
/ibl/config.ymlon the platform server have to span two accounts: IBL's ECR (image pulls) and the operator's S3 buckets. Operators either over-share (admin keys in.env.setup) or wait for IBL to mint a separate ECR-only user that still doesn't grant S3 — neither is a clean fit.This PR closes that gap by printing the exact minimum-privilege IAM policy at the end of
provision/provision-env, with three copy-pasteaws iamcommands the operator runs in their own account beforesetup-env.What lands
New —
src/iblai_infra/runtime_iam.py:build_runtime_iam_policy(bucket_names)— returns the JSON document. Bucket ARNs derived from actuals3_bucket_{backups,media,static}terraform outputs (no hardcoded names). ECR scope targets IBL's registry via two centralized module constants (IBLAI_ECR_ACCOUNT_ID,IBLAI_ECR_REGION).extract_bucket_names(outputs)— pulls the three bucket-name outputs out of a terraform outputs dict.render_runtime_access_instructions(config, outputs, ws)— saves the policy to<workspace>/runtime-iam-policy.json, prints it verbatim under a rule, then the threeaws iamcommands and the.env.setuppaste instructions. Skipped forDeploymentType.CALL.Integration —
app.show_results()calls the renderer after every provision (both interactiveprovisionandprovision-env).Policy scope (zero wildcards, zero policy-mutation rights):
GetObjectPutObjectDeleteObjectGetObjectAclPutObjectAclListBucketGetBucketLocation*(AWS requires this Resource shape forGetAuthorizationToken)GetAuthorizationTokenarn:aws:ecr:<region>:<iblai-account>:repository/*BatchCheckLayerAvailabilityBatchGetImageGetDownloadUrlForLayerDocs:
README.md— new sub-section under "Provision infrastructure" with the rendered commands + scope table. Section 4 (non-interactive.envflow) renumbered to a clean 3-step sequence (provision → mint runtime user → setup-env) so the IAM step isn't missed..env.setup.example—AWS_ACCESS_KEY_IDcomment block now directs the operator to the runtime user, not their admin keys.CHANGELOG.md—## [1.11.0]entry.Tests — 11 new in
tests/test_runtime_iam.py: policy shape, ARN generation, tight-verb invariants (s3:*and bucket-policy mutations explicitly absent), ECR account targeting, call-server skip, partial / empty terraform outputs, JSON round-trip. Full suite: 576 passing in ~1.3 s.Codebase-cleanliness checks
runtime_iam.pyreads only terraform output keys (s3_bucket_backups, etc.), then derives ARNs from whatever values the outputs supplied.grep -riE "kaplan|syracuse|ibleducation|iblai\.nonprod"returns zero hits across the new files.test-backups,p-staging-dm-media.runtime_iam.pyrather than scattered across the diff.Operator UX after this lands
No IBL-side credential handoff. No admin keys on the server. No clicking through the IAM console.
Test plan
uv run pytest tests/ -q— 576 passingiblai infra provision-env -f .env.provisionagainst a real AWS account — verify the rendered policy lists the literal bucket names from Terraform, threeaws iamcommands render correctly,<workspace>/runtime-iam-policy.jsonexists and is valid JSONaws iam put-user-policy --policy-document file://<workspace>/runtime-iam-policy.json— AWS accepts the policy.env.setup→iblai infra setup-envcompletes (ECR pulls succeed, DM/edX can read/write S3)iblai infra provision-envwith--deployment-type call-server(if exposed via env) — confirm the IAM instructions are skipped silently🤖 Generated with Claude Code