Skip to content

[spark-compete] Narrow bare except in spawner-ui liveness urlopen in cli.py to (urllib.error.URLError, TimeoutError)#537

Closed
4gjnbzb4zf-sudo wants to merge 1 commit into
vibeforge1111:masterfrom
4gjnbzb4zf-sudo:sentinel/exception-narrow/cli-spawner-runtime-health-readycheck
Closed

[spark-compete] Narrow bare except in spawner-ui liveness urlopen in cli.py to (urllib.error.URLError, TimeoutError)#537
4gjnbzb4zf-sudo wants to merge 1 commit into
vibeforge1111:masterfrom
4gjnbzb4zf-sudo:sentinel/exception-narrow/cli-spawner-runtime-health-readycheck

Conversation

@4gjnbzb4zf-sudo

@4gjnbzb4zf-sudo 4gjnbzb4zf-sudo commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

{
"schema": "spark-compete-hotfix-v1",
"event": "spark-compete-first-event",
"submission_mode": "public_repo_pr",
"submission_target_url": "#537",
"team": {
"name": "SparkThisUp",
"members": ["ValHallaBuilder", "Baz707", "DanFireDash"],
"github_accounts": ["4gjnbzb4zf-sudo"],
"llm_device_holder": "ValHallaBuilder",
"device_holder_github": "4gjnbzb4zf-sudo"
},
"target_repo": {
"id": "vibeforge1111/spark-cli",
"source": "https://github.com/vibeforge1111/spark-cli",
"owner_surface": "spark-cli"
},
"issue": {
"type": "bug",
"severity": "low",
"title": "evaluate_module_health swallows programming bugs as a fake 'Spawner UI live health failed' status",
"actual_behavior": "src/spark_cli/cli.py L5039 evaluate_module_health issues a GET against the spawner-ui liveness endpoint and wraps the urllib.request.urlopen call in try: ... except Exception as exc: healthy=False; detail=f'Spawner UI live health failed: {exc}'. The try body contains only urllib.request.Request construction and urllib.request.urlopen plus an int() cast on response.status; the only exceptions those calls can raise are urllib.error.URLError (the parent of HTTPError, covering connection-refused, DNS failure, and socket-level failures) and TimeoutError (when the socket-level timeout fires). The bare Exception folds unrelated runtime errors (NameError from a typo in the helper, AttributeError if module.manifest gets mis-typed by a refactor) into the same 'Spawner UI live health failed' detail string; operators looking at spark status see 'live health failed' and start debugging the spawner-ui process when really the bug is in this branch of cli.py.",
"expected_behavior": "The 'live health failed' detail string fires only for the documented failure modes: urllib.error.URLError when the GET cannot complete (connection refused, DNS, HTTP error) and TimeoutError when the socket-level timeout fires. Programming bugs propagate as a real traceback.",
"repro_steps": [
"gh pr checkout ",
"PYTHONPATH=src python3 -m unittest tests.test_cli.SparkCliTests.test_spawner_health_records_url_error_via_narrow_tuple tests.test_cli.SparkCliTests.test_spawner_health_records_timeout_via_narrow_tuple -> 2 tests OK; URLError and TimeoutError both still route through the narrow tuple and produce the documented detail.",
"PYTHONPATH=src python3 -c "import urllib.error; print(issubclass(urllib.error.HTTPError, urllib.error.URLError)); print(issubclass(urllib.error.URLError, OSError)); print(issubclass(TimeoutError, OSError))" -> True True True. HTTPError is a URLError; URLError and TimeoutError are both OSError subclasses, so the narrow tuple is the right shape for stdlib urlopen.",
"Inverse: drop a Python-side typo into the try body (e.g. rename urllib.request.urlopen to urllib.request.urlopenZZZ); before the change, the AttributeError gets swallowed and spark status reports 'Spawner UI live health failed: module ...has no attribute urlopenZZZ' as if the spawner were down. After the change, the AttributeError propagates so the operator sees the real bug."
],
"affected_workflow": "evaluate_module_health is called by spark status to surface per-module health to the operator. The spawner-ui liveness check is the gating signal that tells the operator whether the spawner is reachable and serving the Spark Live container's mission-control endpoints. When this branch silently swallows a Python bug, spark status reports 'Spawner UI live health failed' and the operator starts debugging the spawner-ui process, port mapping, or the SPARK_LIVE_CONTAINER setup; they never see the real Python traceback that would point at the cli.py helper. Downstream, that wrong status feeds into the mission-control posture and the spawner-ui ready-watch loop -- they assume the spawner is unreachable and skip starting missions that depend on it."
},
"evidence": {
"safe_links_only": true,
"before_after_proof": "Single try/except pattern at cli.py L5058. Before: except Exception as exc: healthy=False; detail=f'Spawner UI live health failed: {exc}'. After: except (urllib.error.URLError, TimeoutError) as exc: healthy=False; detail=f'Spawner UI live health failed: {exc}'. The except body is byte-identical; only the catch tuple changes. urllib.error.URLError is the parent class for HTTPError (4xx/5xx and protocol errors) and wraps the connection-refused / DNS / socket failures urlopen produces; TimeoutError covers the socket-level timeout from urlopen(timeout=N). Both are OSError subclasses. Diff: 1 insertion, 1 deletion in src/spark_cli/cli.py plus 51 insertions in tests/test_cli.py for two focused unittest.mock.patch tests.",
"links": ["https://github.com//pull/537"],
"forbidden": ["pdf","zip","exe","unknown downloads","shortened links","archives","binaries","tokens","browser cookies","wallet material","raw logs","raw conversations","raw memory","raw patches","private repo maps","private scoring details"]
},
"proposed_fix": {
"approach": "Replace except Exception as exc: with except (urllib.error.URLError, TimeoutError) as exc: in evaluate_module_health (cli.py L5063). The narrow tuple is exhaustive for the urlopen path: urllib.error.URLError is the parent of HTTPError and covers every connection-level and HTTP-protocol failure urlopen produces, and TimeoutError covers the socket-timeout case the urlopen(timeout=N) keyword can surface directly. urllib.error is already imported at the top of cli.py (line 25). Two unittest.mock.patch tests assert URLError and TimeoutError both still route through the narrow tuple and produce the documented 'Spawner UI live health failed' detail.",
"files_expected": ["src/spark_cli/cli.py", "tests/test_cli.py"],
"tests_or_smoke": "Manual smoke verified: subclass tree -- HTTPError IS URLError True, URLError IS OSError True, TimeoutError IS OSError True. AST parse of cli.py is clean. The two new unittest.mock.patch tests pass: PYTHONPATH=src python3 -m unittest tests.test_cli.SparkCliTests.test_spawner_health_records_url_error_via_narrow_tuple tests.test_cli.SparkCliTests.test_spawner_health_records_timeout_via_narrow_tuple -> 2 tests OK. Sibling test_spawner_health_uses_liveness_endpoint_in_hosted_mode still passes."
},
"pr": {
"branch": "sentinel/exception-narrow/cli-spawner-runtime-health-readycheck",
"title_prefix": "[spark-compete]",
"author_github": "4gjnbzb4zf-sudo",
"body_must_include": ["packet","team","pr_author","repo","actual_behavior","expected_behavior","repro_steps","before_after_proof","tests_or_smoke","duplicate_notes","risk_notes","review_claim"],
"url": "#537"
},
"review_claim": {
"impact_claim": "low",
"evidence_types": ["redacted_terminal_excerpt"],
"duplicate_notes": "Pre-flight gh pr list --search 'spawner_runtime_health OR evaluate_module_health narrow' returned zero hits. Our own #527 narrowed urllib+json paths elsewhere in cli.py, #528 narrowed save_json, and #532-#536 covered system_map.py. This single-site fix in evaluate_module_health is a distinct branch (L5063, spawner-ui liveness urlopen) that no prior or competitor PR touches.",
"risk_notes": "Local scope: one file changed for the fix (one line), plus tests. The except body is byte-identical pre/post. The narrow tuple is exhaustive for the urlopen path -- HTTPError flows through URLError, socket-level connection failures flow through URLError (which itself is an OSError subclass), and socket-timeout flows through TimeoutError. Two unittest.mock.patch tests assert both classes still route to the documented detail string. The only observable behavior change is that programming bugs no longer surface as a fake 'Spawner UI live health failed' status.",
"review_state_requested": "pr_review"
}
}

…b.error.URLError, TimeoutError)

evaluate_module_health (cli.py L5063) issues a GET against the spawner-ui
liveness endpoint and wraps the urllib.request.urlopen call in
except Exception: healthy=False; detail=f'Spawner UI live health failed: {exc}'.
The try body contains only urllib.request.Request construction and
urllib.request.urlopen; the only exceptions those calls can raise are
urllib.error.URLError (parent of HTTPError, covering connection-refused,
DNS failure, and socket-level failures) and TimeoutError (when the
socket-level timeout fires).

The bare Exception swallows unrelated runtime errors (NameError from
a typo in ready_check_headers, AttributeError if module.manifest gets
mis-typed) into the same 'live health failed' detail, hiding programming
bugs behind a fake 'spawner is down' status. Operators looking at
spark status see 'Spawner UI live health failed' and start debugging
the spawner-ui process when really the bug is in this helper.

Narrow tuple is exhaustive for the urlopen path; except body is
byte-identical pre/post. Two unittest.mock.patch tests assert URLError
and TimeoutError both still route through the narrow tuple and produce
the documented 'live health failed' detail.
@4gjnbzb4zf-sudo

Copy link
Copy Markdown
Contributor Author

QA write-up — for reviewer eyes

TL;DR. evaluate_module_health (src/spark_cli/cli.py L5039) issues a GET against the spawner-ui liveness endpoint to back spark status. The urlopen call is wrapped in except Exception: so a Python-side typo in the helper surfaces as a fake "Spawner UI live health failed" status, sending the operator down a wrong debugging trail. Narrow to (urllib.error.URLError, TimeoutError) -- the exhaustive set for stdlib urlopen.

Before

try:
    request = urllib.request.Request(health_url, headers=ready_check_headers(health_url))
    with urllib.request.urlopen(request, timeout=ready_timeout_seconds(module)) as response:
        healthy = 200 <= int(response.status) < 300
        detail = f"Spawner UI live health {'OK' if healthy else 'failed'}: HTTP {response.status}"
except Exception as exc:
    healthy = False
    detail = f"Spawner UI live health failed: {exc}"

If a typo or refactor lands a NameError or AttributeError inside the try block (e.g. a rename misses urllib.request.urlopen), spark status reports "Spawner UI live health failed: module ...has no attribute urlopenZZZ" as if the spawner were down. The operator restarts the spawner-ui process, the symptom stays, and the real bug stays hidden in cli.py.

After

except (urllib.error.URLError, TimeoutError) as exc:
    healthy = False
    detail = f"Spawner UI live health failed: {exc}"

Same "live health failed" short-circuit fires for the documented failure modes: urllib.error.URLError (parent of HTTPError; covers connection-refused, DNS, socket-level failure, 4xx/5xx) and TimeoutError (socket-level timeout from urlopen(timeout=N)). Programming bugs propagate as a real traceback.

Smoke

$ PYTHONPATH=src python3 -c "
import urllib.error
print('HTTPError IS URLError:', issubclass(urllib.error.HTTPError, urllib.error.URLError))
print('URLError IS OSError:', issubclass(urllib.error.URLError, OSError))
print('TimeoutError IS OSError:', issubclass(TimeoutError, OSError))

try:
    raise AttributeError('typo')
except (urllib.error.URLError, TimeoutError):
    print('FAIL: AttributeError caught')
except AttributeError:
    print('OK: AttributeError propagates past narrow tuple')
"
HTTPError IS URLError: True
URLError IS OSError: True
TimeoutError IS OSError: True
OK: AttributeError propagates past narrow tuple

Two unittest.mock.patch tests pass:

$ PYTHONPATH=src python3 -m unittest \
    tests.test_cli.SparkCliTests.test_spawner_health_records_url_error_via_narrow_tuple \
    tests.test_cli.SparkCliTests.test_spawner_health_records_timeout_via_narrow_tuple -v
test_spawner_health_records_url_error_via_narrow_tuple ... ok
test_spawner_health_records_timeout_via_narrow_tuple ... ok
Ran 2 tests in 0.001s
OK

Sibling test_spawner_health_uses_liveness_endpoint_in_hosted_mode still passes (verified locally).

What didn't change

One file changed for the fix (one line), plus tests. Except body byte-identical pre/post. The spawner_should_use_liveness_endpoint / spawner_liveness_can_trust_local_port fast-paths above this branch are unchanged. URLError + TimeoutError is exhaustive for stdlib urlopen. Only observable behavior change: programming bugs in the helper no longer surface as a fake "Spawner UI live health failed" status.

@vibeforge1111

Copy link
Copy Markdown
Owner

Thanks, this was useful. We maintainer-adopted this exception-boundary improvement in #603 and merged it into master at 70b1034a1ce490fddfb3a5d0ba5d6ab098a9f89f.

What shipped:

  • Spawner UI liveness failures now catch only expected URL/timeout failures.
  • System-map filesystem readers now catch expected OSError failures.
  • Builder SQLite readers now catch expected (sqlite3.Error, OSError) failures.

Verification on the adopted PR:

  • focused Spawner/system-map tests passed
  • full tests/test_system_map.py passed
  • full tests/test_cli.py passed
  • compileall, git diff --check, registry-pin verification, CodeQL, scorecard, secret-scan, and test-and-audit all passed

Source credit is recorded for this PR, and #603 itself should not receive participant credit. Points are still gated because this PR body had spark-compete-hotfix-v1 schema text but not a clean fenced JSON packet that the validator can extract. To unlock review for points, add a clean fenced JSON packet in a PR comment or body update, then validate it at https://compete.sparkswarm.ai/api/packet/validate.

Closing this PR as adopted into #603. No code action is needed unless you believe #603 missed part of the issue.

@vibeforge1111

Copy link
Copy Markdown
Owner

R22/R23 final scoring update: this PR did receive final R22/R23 public leaderboard credit. Earlier gate, credit-review, or points-lock wording described the pre-final review state, not a final rejection or permanent zero. Final R22/R23 credit for this PR: 48 points. Thanks for the contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants