Skip to content

fix(ibl_edx): layout-agnostic LMS readiness probe + explicit migration check#40

Open
bnsoni wants to merge 1 commit into
mainfrom
feat/edx-wait-layout-agnostic
Open

fix(ibl_edx): layout-agnostic LMS readiness probe + explicit migration check#40
bnsoni wants to merge 1 commit into
mainfrom
feat/edx-wait-layout-agnostic

Conversation

@bnsoni

@bnsoni bnsoni commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Why

The Wait for LMS to be ready task in the ibl_edx ansible role has been the silent failure point of every fresh-bootstrap deployment that takes the non-AMI compose path. Two compounding bugs land the operator at the same misleading "wait timed out after 40 retries" message 10 minutes after the real failure:

  1. Wrong probe targetcurl http://localhost:8600/heartbeat only works on tutor's published-port layout. On bootstrap paths that render a Caddy-fronted layout (LMS reachable only via Host header on the local reverse proxy), :8600 is never bound on the host. The wait times out even though LMS is healthy on a different port.

  2. Silent migration skip — the launch step has ignore_errors: true. When ibl tutor local launch -I silently skips its tutor local do init migration step, the openedx schema stays empty, LMS crash-loops on ProgrammingError: Table 'openedx.<table>' doesn't exist, and the wait task is the one that fails. The actual error gets buried 10 minutes deep in retry noise.

What changes

1. Layout-agnostic LMS readiness probe

- curl -s -o /dev/null -w '%{http_code}' http://localhost:8600/heartbeat | grep -q 200
+ curl -sk --resolve learn.{{ base_domain }}:80:127.0.0.1 \
+   -o /dev/null -w '%{http_code}' --max-time 5 \
+   http://learn.{{ base_domain }}/heartbeat

Probes through the host nginx (always on :80) which routes by Host header to whichever LMS upstream the deployment rendered — published-port :8600, Caddy on :81, or anything else. Works uniformly on every bootstrap path.

2. Explicit migration verification

New task between launch and readiness wait:

- name: Verify openedx migrations ran (waffle_flag must exist)
  shell: docker exec ibl_prod_mysql mysql -uroot -p"$PW" \
      -e "SHOW TABLES LIKE 'waffle_flag';" openedx | grep -q waffle_flag
  retries: 6
  delay: 30
  until: openedx_migrations_ok.rc == 0

If the schema is empty, the role fails in ≤3 min with an actionable recovery message:

ibl edx launch returned but openedx.waffle_flag does not exist — the tutor local do init step did not complete migrations. SSH in and run ibl tutor local do init manually, OR docker compose down -v from {{ ibl_root }}/app/ibl-edx/ibl-edx-pro/env/local/ and re-run ibl edx launch.

waffle_flag is created early in the migration sequence (alphabetically among Django apps); if it exists, ~570 other tables also do.

3. Crash-loop early-exit inside the readiness loop

RestartCount sampling was previously a separate task that ran after the wait succeeded. If LMS was crash-looping it never ran. Moved it inline:

RC=$(docker inspect --format '{{.RestartCount}}' ibl_prod_lms)
if [ "$RC" -gt 5 ]; then
    LAST=$(docker logs --tail 30 ibl_prod_lms | grep -Ei 'error|exception|traceback|fatal' | tail -3)
    echo "LMS_CRASHLOOP RestartCount=$RC last_error: $LAST" >&2
    exit 2   # <- ansible until-loop fails immediately, doesn't burn retries
fi

Real crash loops now surface in ~30s with the actual log error in stderr instead of after 10 min of retries. rc=1 (not yet ready) remains the normal retry signal; rc=2 (crash-loop) fails fast.

Failure-mode comparison

Scenario Before After
LMS healthy on AMI layout (:8600) wait passes ✓ wait passes ✓
LMS healthy on Caddy layout (:81) wait times out (10 min) ✗ wait passes ✓
LMS crash-loops on missing-schema wait times out (10 min), buries error ✗ migration check fails (3 min) with recovery instructions ✓
LMS crash-loops on other reason wait times out (10 min), buries error ✗ wait fails fast (30s) with last log error ✓

Test plan

  • YAML parses
  • tests/ansible/test_runner.py::TestConstants/TestBuildExtraVars/TestExtractRoleFromLine — all green
  • Re-run iblai infra setup <env> on a fresh box where the previous attempt timed out — should pass through the readiness wait without 10-minute timeout
  • Re-run iblai infra setup <env> on a deliberately broken box (LMS image with missing schema) — should fail in ≤3 min with the migration-check error message, not the readiness-wait timeout

🤖 Generated with Claude Code

…n check

The `Wait for LMS to be ready` task in the `ibl_edx` ansible role has been
the silent failure point of every fresh-bootstrap deployment that takes
the non-AMI compose path. Two compounding bugs land the operator at the
same misleading "wait timed out after 40 retries" message after 10 minutes:

1. The probe target — `http://localhost:8600/heartbeat` — only exists on
   tutor's published-port layout. On bootstrap paths that render a
   Caddy-fronted layout (LMS reachable only via a Host header on the
   local reverse proxy), `:8600` is never bound on the host. The wait
   times out even though LMS is healthy on a different port.

2. The launch step has `ignore_errors: true`. When `ibl tutor local
   launch -I` silently skips its `tutor local do init` migration step,
   the openedx schema stays empty, LMS crash-loops on
   `ProgrammingError: Table 'openedx.<table>' doesn't exist`, and the
   wait task — not the launch task — is the one that fails. The actual
   error gets buried 10 minutes deep in retry noise.

This commit makes the role self-diagnose both failure modes:

- Replace the `:8600` probe with a layout-agnostic
  `--resolve learn.{{ base_domain }}:80:127.0.0.1` probe through the
  host nginx (always on `:80`, routes by Host to whichever upstream the
  deployment rendered — direct LMS, Caddy, etc.). Works uniformly on
  every bootstrap path.

- Add a dedicated `Verify openedx migrations ran` task between launch
  and the readiness wait. Polls for `waffle_flag` (created early in the
  migration sequence) and fails in ≤3 minutes with an actionable
  recovery message ("SSH in and run `ibl tutor local do init`, OR
  `docker compose down -v` and re-run `ibl edx launch`") if the
  schema is empty.

- Move `RestartCount` sampling INTO the readiness loop. A real LMS
  crash loop now returns rc=2 from the second iteration (~30s) with the
  last error from `docker logs` in stderr, instead of consuming the
  full 40-retry budget. rc=1 (not yet ready) remains the normal retry
  signal.

Net: an empty-schema LMS now surfaces in ~3 min with the right diagnosis;
a crashing LMS surfaces in ~30s with the actual log error; a healthy
LMS reaches ready on every layout that has a working host nginx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant