fix(ibl_edx): layout-agnostic LMS readiness probe + explicit migration check#40
Open
bnsoni wants to merge 1 commit into
Open
fix(ibl_edx): layout-agnostic LMS readiness probe + explicit migration check#40bnsoni wants to merge 1 commit into
bnsoni wants to merge 1 commit into
Conversation
…n check
The `Wait for LMS to be ready` task in the `ibl_edx` ansible role has been
the silent failure point of every fresh-bootstrap deployment that takes
the non-AMI compose path. Two compounding bugs land the operator at the
same misleading "wait timed out after 40 retries" message after 10 minutes:
1. The probe target — `http://localhost:8600/heartbeat` — only exists on
tutor's published-port layout. On bootstrap paths that render a
Caddy-fronted layout (LMS reachable only via a Host header on the
local reverse proxy), `:8600` is never bound on the host. The wait
times out even though LMS is healthy on a different port.
2. The launch step has `ignore_errors: true`. When `ibl tutor local
launch -I` silently skips its `tutor local do init` migration step,
the openedx schema stays empty, LMS crash-loops on
`ProgrammingError: Table 'openedx.<table>' doesn't exist`, and the
wait task — not the launch task — is the one that fails. The actual
error gets buried 10 minutes deep in retry noise.
This commit makes the role self-diagnose both failure modes:
- Replace the `:8600` probe with a layout-agnostic
`--resolve learn.{{ base_domain }}:80:127.0.0.1` probe through the
host nginx (always on `:80`, routes by Host to whichever upstream the
deployment rendered — direct LMS, Caddy, etc.). Works uniformly on
every bootstrap path.
- Add a dedicated `Verify openedx migrations ran` task between launch
and the readiness wait. Polls for `waffle_flag` (created early in the
migration sequence) and fails in ≤3 minutes with an actionable
recovery message ("SSH in and run `ibl tutor local do init`, OR
`docker compose down -v` and re-run `ibl edx launch`") if the
schema is empty.
- Move `RestartCount` sampling INTO the readiness loop. A real LMS
crash loop now returns rc=2 from the second iteration (~30s) with the
last error from `docker logs` in stderr, instead of consuming the
full 40-retry budget. rc=1 (not yet ready) remains the normal retry
signal.
Net: an empty-schema LMS now surfaces in ~3 min with the right diagnosis;
a crashing LMS surfaces in ~30s with the actual log error; a healthy
LMS reaches ready on every layout that has a working host nginx.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The
Wait for LMS to be readytask in theibl_edxansible role has been the silent failure point of every fresh-bootstrap deployment that takes the non-AMI compose path. Two compounding bugs land the operator at the same misleading "wait timed out after 40 retries" message 10 minutes after the real failure:Wrong probe target —
curl http://localhost:8600/heartbeatonly works on tutor's published-port layout. On bootstrap paths that render a Caddy-fronted layout (LMS reachable only via Host header on the local reverse proxy),:8600is never bound on the host. The wait times out even though LMS is healthy on a different port.Silent migration skip — the launch step has
ignore_errors: true. Whenibl tutor local launch -Isilently skips itstutor local do initmigration step, the openedx schema stays empty, LMS crash-loops onProgrammingError: Table 'openedx.<table>' doesn't exist, and the wait task is the one that fails. The actual error gets buried 10 minutes deep in retry noise.What changes
1. Layout-agnostic LMS readiness probe
Probes through the host nginx (always on
:80) which routes by Host header to whichever LMS upstream the deployment rendered — published-port:8600, Caddy on:81, or anything else. Works uniformly on every bootstrap path.2. Explicit migration verification
New task between launch and readiness wait:
If the schema is empty, the role fails in ≤3 min with an actionable recovery message:
waffle_flagis created early in the migration sequence (alphabetically among Django apps); if it exists, ~570 other tables also do.3. Crash-loop early-exit inside the readiness loop
RestartCountsampling was previously a separate task that ran after the wait succeeded. If LMS was crash-looping it never ran. Moved it inline:Real crash loops now surface in ~30s with the actual log error in stderr instead of after 10 min of retries.
rc=1(not yet ready) remains the normal retry signal;rc=2(crash-loop) fails fast.Failure-mode comparison
:8600):81)Test plan
tests/ansible/test_runner.py::TestConstants/TestBuildExtraVars/TestExtractRoleFromLine— all greeniblai infra setup <env>on a fresh box where the previous attempt timed out — should pass through the readiness wait without 10-minute timeoutiblai infra setup <env>on a deliberately broken box (LMS image with missing schema) — should fail in ≤3 min with the migration-check error message, not the readiness-wait timeout🤖 Generated with Claude Code