Skip to content

fix: flaky Windows cross-user test and record_pr workflow error#178

Open
mwiebe wants to merge 1 commit into
OpenJobDescription:mainfrom
mwiebe:flaky-test
Open

fix: flaky Windows cross-user test and record_pr workflow error#178
mwiebe wants to merge 1 commit into
OpenJobDescription:mainfrom
mwiebe:flaky-test

Conversation

@mwiebe
Copy link
Copy Markdown
Contributor

@mwiebe mwiebe commented May 12, 2026

What was the problem/requirement? (What/Why)

Root caused a flaky test to an issue needing a fix, and need to fix the CI reusable_record_pr_details.yml workflow.

What was the solution? (How)

CreateEnvironmentBlock reads from the user's registry hive at HKEY_USERS<SID>. Windows unloads that hive asynchronously when the profile's refcount drops to zero. Back-to-back logons for the same user can race with an in-progress unload and surface as:

  • 0x800703FA ERROR_KEY_DELETED - 'Illegal operation attempted on a registry key that has been marked for deletion'
  • 0x80070005 E_ACCESSDENIED - documented by Microsoft as a related symptom of the same race

Observed as a flaky failure in the Cross-User (Windows) CI job on test_cross_user_helpers_dir_dacl_protects_from_session_user after the preceding test's session cleanup. All other cross-user tests in the same run passed, confirming a transient race rather than a logic bug.

Wrap CreateEnvironmentBlock in environment_for_user with a bounded retry: 5 attempts with geometric backoff (20/40/80/160 ms, worst-case ~300 ms). Only the two transient HRESULTs above trigger a retry; every other error returns immediately so real failures still surface promptly.

Preserves the defensive null-pointer guard added in 8a48851 by moving it into the Ok(()) arm of the retry loop - block_ptr is freshly initialized on each attempt and must be checked before being passed to parse_environment_block or DestroyEnvironmentBlock.

The upstream Python openjd-sessions has a first-hand comment describing this same class of error in _win32/_locate_executable.py; its mitigation is forcing handle cleanup between calls, which is more fragile than a bounded retry.

Updates specs/sessions/subprocess.md with a new 'CreateEnvironmentBlock profile-unload race' subsection documenting the retry policy.

The reusable workflow at
OpenJobDescription/.github/.github/workflows/reusable_record_pr_details.yml declares 'permissions: contents: read' in its workflow_call block. GitHub Actions enforces that a caller cannot invoke a reusable workflow requesting more permissions than the caller itself has granted, and the workflow-root 'permissions: {}' hardening added in 8a48851 zeroed out every permission - including the 'contents: read' the reusable workflow needs.

Add a job-level 'permissions: contents: read' block on the job that invokes the reusable workflow, matching the pattern already used by codeql.yml, responded.yml, and stale_prs_and_issues.yml.

Resolves the 'requesting contents: read, but is only allowed
contents: none' validation error.

  • cargo check -p openjd-sessions (Linux; win32.rs is #[cfg(windows)])
  • cargo clippy -p openjd-sessions --all-targets -- -D warnings
  • cargo fmt --all -- --check
  • cargo test -p openjd-sessions --lib
  • python3 YAML lint on .github/workflows/record_pr.yml

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

CreateEnvironmentBlock reads from the user's registry hive at
HKEY_USERS\<SID>. Windows unloads that hive asynchronously when the
profile's refcount drops to zero. Back-to-back logons for the same user
can race with an in-progress unload and surface as:

  * 0x800703FA ERROR_KEY_DELETED - 'Illegal operation attempted on a
    registry key that has been marked for deletion'
  * 0x80070005 E_ACCESSDENIED - documented by Microsoft as a related
    symptom of the same race

Observed as a flaky failure in the Cross-User (Windows) CI job on
test_cross_user_helpers_dir_dacl_protects_from_session_user after the
preceding test's session cleanup. All other cross-user tests in the same
run passed, confirming a transient race rather than a logic bug.

Wrap CreateEnvironmentBlock in environment_for_user with a bounded
retry: 5 attempts with geometric backoff (20/40/80/160 ms,
worst-case ~300 ms). Only the two transient HRESULTs above trigger a
retry; every other error returns immediately so real failures still
surface promptly.

Preserves the defensive null-pointer guard added in 8a48851 by moving
it into the Ok(()) arm of the retry loop - block_ptr is freshly
initialized on each attempt and must be checked before being passed to
parse_environment_block or DestroyEnvironmentBlock.

The upstream Python openjd-sessions has a first-hand comment describing
this same class of error in _win32/_locate_executable.py; its
mitigation is forcing handle cleanup between calls, which is more
fragile than a bounded retry.

Updates specs/sessions/subprocess.md with a new 'CreateEnvironmentBlock
profile-unload race' subsection documenting the retry policy.

The reusable workflow at
OpenJobDescription/.github/.github/workflows/reusable_record_pr_details.yml
declares 'permissions: contents: read' in its workflow_call block.
GitHub Actions enforces that a caller cannot invoke a reusable workflow
requesting more permissions than the caller itself has granted, and the
workflow-root 'permissions: {}' hardening added in 8a48851 zeroed out
every permission - including the 'contents: read' the reusable workflow
needs.

Add a job-level 'permissions: contents: read' block on the job that
invokes the reusable workflow, matching the pattern already used by
codeql.yml, responded.yml, and stale_prs_and_issues.yml.

Resolves the 'requesting contents: read, but is only allowed
contents: none' validation error.

  * cargo check -p openjd-sessions (Linux; win32.rs is #[cfg(windows)])
  * cargo clippy -p openjd-sessions --all-targets -- -D warnings
  * cargo fmt --all -- --check
  * cargo test -p openjd-sessions --lib
  * python3 YAML lint on .github/workflows/record_pr.yml

Signed-off-by: Mark <399551+mwiebe@users.noreply.github.com>
@mwiebe mwiebe requested a review from a team as a code owner May 12, 2026 00:52
@mwiebe mwiebe enabled auto-merge (rebase) May 12, 2026 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant