Skip to content

Add EFT-3295 Tentacle script abandon design doc#1226

Open
jimmyp wants to merge 41 commits into
mainfrom
jimpelletier/eft-3295-tentacle-script-abandonment-to-release-the-mutex
Open

Add EFT-3295 Tentacle script abandon design doc#1226
jimmyp wants to merge 41 commits into
mainfrom
jimpelletier/eft-3295-tentacle-script-abandonment-to-release-the-mutex

Conversation

@jimmyp
Copy link
Copy Markdown

@jimmyp jimmyp commented May 21, 2026

Background

EFT-3295. Tentacle script can hang inside endpoint protection (Philips' CrowdStrike + Rapid7 contention). Cancel doesn't propagate, the script-isolation mutex stays held, subsequent deployments queue forever. Customer's only recovery today is RDP-in-and-kill or reboot.

Results

Adds a design doc at docs/superpowers/specs/2026-05-21-tentacle-script-abandon-design.md proposing a new AbandonScript verb on IScriptServiceV2. Tentacle releases the mutex, logs honestly that the script may still be running, frees the .NET thread, does NOT kill the OS process.

Implementation approach: async. Replace SilentProcessRunner.ExecuteCommand outright with ExecuteCommandAsync (no additive overload). All ~20 callers across production code, Kubernetes integration tests, and Tentacle integration test scaffolding migrate to await.

Server-side contract is locked with the parallel session via the EFT-3295 Linear thread. Capability check up front. Single feature flag, server-side: AbandonTentacleScriptOnCancellationTimeoutFeatureToggle. Tentacle's capability advertisement is binary on build version.

Before

Status quo. Tentacle hangs indefinitely on process.WaitForExit() when Cancel can't kill the process. Mutex stays held. Deployments queue forever until reboot.

After

Server escalates to AbandonScript after 2 min. Tentacle's await process.WaitForExitAsync(abandon) returns. The existing using block disposes naturally, mutex released. Tentacle writes an honest line into the script log and accepts new work. OS process is left alone per the ticket.

How to review this PR

Doc-only PR. One open question remaining:

  1. Workspace cleanup policy. Best-effort leak + log vs janitor task (Section 3).

Earlier review questions resolved in the latest commit:

  • Async approach locked (was Option 1 vs Option 2).
  • Tentacle-side feature flag dropped (Luke flagged it can't be cleanly toggled without versioning the contract).
  • stdout/stderr-after-abandon paragraph fixed to describe the real behaviour (Process disposal closes our pipe handles on method unwind).

Tentacle-side scope only. Server-side has its own design doc in OctopusDeploy.

Quality ✔️
Design review only.

🤖 Generated with Claude Code

[JIM_BOT.EXE v2.13]

Design proposing new AbandonScript verb on IScriptServiceV2 to release
the script-isolation mutex when Cancel can't propagate. Contract locked
with parallel server-side session via the EFT-3295 Linear thread.
Implementation choice (Option 1 async vs Option 2 surgical
ManualResetEventSlim swap) preserved for external review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jimmyp jimmyp requested a review from a team as a code owner May 21, 2026 03:57
jimmyp and others added 3 commits May 21, 2026 14:02
Adds question 3 to "Open questions for external reviewer" asking whether
the Tentacle-side feature flag is pulling weight given the server-side
toggle already governs whether abandon dispatches at all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Section 4.3 now lists ScriptBuilder, TestExecuteShellScriptCommandBuilder,
TentacleServiceDecoratorBuilder, and Wait.For by name and points at
ClientScriptExecutionIsolationMutex.cs as the exemplar. File-signal
pattern (.CreateFile / .WaitForFileToExist) replaces sleep-based timing
in stuck-script tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Decision: replace ExecuteCommand with ExecuteCommandAsync outright. All
~20 callers across production code, Kubernetes integration tests, and
Tentacle integration test scaffolding migrate to await. Section 2
rewritten to commit to the async approach; the sync ManualResetEventSlim
variant moves to the rejected-alternatives list. Section 4 timing and
thread-leak tests consolidated. Open questions trimmed to two
(workspace cleanup, Tentacle-side flag).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `source/Octopus.Tentacle.Tests.Integration/Util/SilentProcessRunnerFixture.cs`
- `source/Octopus.Tentacle.Tests.Integration/Support/TentacleFetchers/LinuxTentacleFetcher.cs`

**What happens to stdout/stderr after abandon.** The OS process is still running and may still write to its stdout/stderr pipes. The `OutputDataReceived` handler we registered is still wired. On the abandon catch, we let the Process object's handles release through normal GC — the OS process eventually blocks on its next stdout write when the pipe buffer fills (~64KB), which is acceptable per the ticket's "we are not interfering" framing. Tentacle's resources are bounded.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude is this up to date? If we trigger the abandon token the very next thing looks like it call methods that stops process writing to its pipes

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught. Spec was lying. Fixed in the latest commit. The outer using (var process = new Process()) disposes the Process on method unwind, which closes our pipe handles. OS process may get EPIPE on its next stdout/stderr write. Consistent with the ticket because we're closing our own handles, not killing the process. Updated paragraph explains it honestly.

Comment thread docs/superpowers/specs/2026-05-21-tentacle-script-abandon-design.md

**Normal-path timing regression check.** Run a 100-iteration benchmark of normal short-script execution (`Write-Host "x"`); compare median wall-clock time vs. a baseline build without the changes. **Verify:** median delta within margin of error. The async swap should not measurably slow normal script execution.

### 4.4 Feature flag verification
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeButters we can't really effectively feature flag this on tentacle can we? Our only option would be to create a script service v2 which we said isn't worth the overhead of supporting that version forever in the wild right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Tentacle-side flag dropped in the latest commit. Capability advertisement is binary on build version. Server-side AbandonTentacleScriptOnCancellationTimeoutFeatureToggle is the only off-switch. @LukeButters if you disagree, this is the load-bearing comment to push back on.


## Section 4 — Automated test strategy

### 4.1 `SilentProcessRunner` unit tests (both options)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have selected an option now. So check the whole spec for optionality like this where its not needed. Also make things a bit more succinct

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up. Async is locked, "(both options)" headers and Option 1/2 comparison language removed throughout. Section 4.4 (Tentacle-flag tests) deleted, Section 5 M5 rewritten around the server-side flag, ~30 lines trimmed in the latest commit.

PR comments addressed:
- Tentacle-side feature flag dropped entirely. Can't be cleanly toggled
  at runtime without versioning the contract. Server-side toggle is the
  only off-switch.
- stdout/stderr-after-abandon paragraph rewritten honestly: the outer
  using (var process = new Process()) disposes Process on method
  unwind, closing our pipe handles. The OS process may get EPIPE on
  next write; consistent with the ticket's "we're not killing it"
  framing because we're closing our handles, not the process.
- "(both options)" header noise removed from Section 4.1/4.2 and
  Section 3. Section 4.4 (Tentacle-side flag tests) deleted.
- Coordination section reflects the single server-side flag, plus the
  V1+V2 execution-pipeline scope and polling-loop trigger updates the
  server-side session posted on Linear (no Tentacle-side impact).
- M5 manual scenario rewritten around the server-side flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- **Manual `Thread` instead of `Task`.** Same leak problem, just trades threadpool for kernel thread handles + stack memory.
- **`Thread.Abort` / `Thread.Interrupt` / `TerminateThread` P/Invoke.** No safe managed mechanism to release a thread parked in unmanaged code. `TerminateThread` doesn't unwind stack or release locks; can corrupt Tentacle's own state.
- **Out-of-process script worker.** Cleanly isolates the stuck-process problem from Tentacle, but is a massive refactor far outside EFT-3295's scope. Worth a separate proposal someday.
- **Sync cancellable wait via `ManualResetEventSlim.Wait()`** (the earlier "Option 2" we held open for external input). Replaces only the blocking primitive inside `SilentProcessRunner`, leaves everything else synchronous. Smaller diff, but preserves a parked thread per running script in the normal case (same cost as today) and doesn't move the codebase toward async. Rejected in favour of the async approach below — direction matters, not just diff size.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And tentacle is very well tested so we believe we can make this change safely

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Worked it into the rejected bullet: "Rejected in favour of the async approach below. Tentacle's existing test coverage gives us confidence the wider async migration is safe to ship, so the smaller-diff defensiveness isn't compelling."


**Why a new verb (not a "force" flag on Cancel).** Different semantics: Cancel = "try to stop the OS process gracefully". Abandon = "give up tracking; release the mutex; the OS process may still be running". Two verbs map cleanly to ProcessExecution's two-step escalation (cancel first, abandon if cancel doesn't propagate).

## Section 2 — Mutex release mechanics (locked: async)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(locked: async) still refers to earlier optionality. You can remove all of this and just leave the considered section as the only mention of optionality

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Dropped (locked: async) from Section 2 header and (locked) from Section 1 header for consistency. Status line at the top simplified too. The rejected-alternatives section is now the only place that mentions past optionality.

Comment thread docs/superpowers/specs/2026-05-21-tentacle-script-abandon-design.md
jimmyp and others added 2 commits May 21, 2026 16:13
CompleteScript reads stateStore exit code. If AbandonedExitCode, the
delete is wrapped in try/catch and leaks the directory with a Warn
log. Any other exit code follows existing behaviour (exceptions
propagate), so bugs that leak handles on normal completion paths
can't hide under a blanket try/catch. No janitor; OS-level host state
is the customer's responsibility per the ticket.

Closes the last open question. Spec status: ready for implementation
planning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
catch, note test coverage as confidence basis

- Section headers no longer flag past optionality. Rejected
  alternatives section is where the option history lives.
- Abandon catch now calls SafelyCancelRead before returning, mirroring
  the non-abandon cleanup path so async readers don't keep firing
  during method unwind.
- Sync-MRES rejected bullet explicitly cites Tentacle's existing test
  coverage as why the smaller-diff defensiveness isn't compelling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `TentacleServiceDecoratorBuilder.RecordMethodUsages(...)` decorates the script service so the test can assert how many times each method was called. Use this to verify capability negotiation and call counts for the new `AbandonScript` verb.
- `Wait.For(condition, timeout, onFail, ct)` is the event-driven polling helper. Always preferred over `Task.Delay` in test bodies.

**Pattern to follow:** mirror `ClientScriptExecutionIsolationMutex.cs`. Stuck-script tests should use `ScriptBuilder.WaitForFileToExist(...)` as the "kernel-blocked" simulant rather than `sleep 600`. The file-wait is event-driven and the test can release it on demand by creating the file. For the unkillable variant, combine the file-wait pattern with the `Tentacle.Debug.DisableProcessKill` flag described in the manual test setup so `Hitman` becomes a no-op for the test's duration.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeButters I can't think of a way to simulate being blocked at the kernel level than actually creating an alternate path in hitman that doesn't do the killing... lmk if you can think of anything better. I dont love this

jimmyp and others added 13 commits May 21, 2026 16:51
Sixteen-task plan covering contract additions, async migration of
SilentProcessRunner.ExecuteCommand (~20 callers), RunningScript and
ScriptServiceV2 abandon wiring, targeted CompleteScript best-effort
policy, capability advertisement, and end-to-end integration tests
that use the existing ClientScriptExecutionIsolationMutex pattern
with TentacleDebugDisableProcessKill for the stuck-process simulant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Question note Jim left in the code while reading; subagent
implementing Task 11/12 should either answer this or remove it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…isable flag

- Rename ExecuteCommand → ExecuteCommandAsync (both overloads), return Task<int>
- Add abandon CancellationToken parameter; await process.WaitForExitAsync(abandon)
- On OperationCanceledException from abandon (process still running): log honest
  message, SafelyCancelRead x2, set running=false, return AbandonedExitCode (-48)
- Add net48 polyfill WaitForExitAsyncNetFramework via TaskCompletionSource+Exited event
- Wire TentacleDebugDisableProcessKill env var into Hitman.TryKillProcessAndChildrenRecursively

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ons async

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jimmyp jimmyp requested a review from a team as a code owner May 21, 2026 20:25
jimmyp and others added 15 commits May 22, 2026 06:29
- Add `abandonToken` field and constructor parameter to `RunningScript`
  (primary constructor); secondary constructor defaults to `CancellationToken.None`
  for backwards-compatible callers (ScriptService V1, existing fixture SetUp).
- Replace synchronous `RunScript` with async `RunScriptAsync` that awaits
  `SilentProcessRunner.ExecuteCommandAsync` and forwards `abandonToken`.
- Update `Execute()` and `RunPowershellScriptWithMonitoring` to await the new method.
- Update `ScriptServiceV2.LaunchShell` to pass `CancellationToken.None` as placeholder
  until Task 11 wires the real abandon token.
- Add `Execute_WhenAbandonTokenFires_ReturnsAbandonedExitCode` to
  `RunningScriptFixture` using a pidfile pattern for deterministic process-start
  detection before firing the abandon token.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test now passes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add abandonTokenSource to RunningScriptWrapper; expose AbandonToken
  getter and Abandon() method; dispose both CTS in Dispose().
- Replace NotImplementedException stub with real AbandonScriptAsync:
  fires runningScript.Abandon() then delegates to GetResponse, which
  returns UnknownScriptExitCode for unknown tickets and the real exit
  code for already-completed scripts.
- Update LaunchShell signature to accept abandonToken and thread it
  through to new RunningScript(...) instead of CancellationToken.None.
- Remove Jim's exploratory comment; replace with a brief, accurate
  explanation of why the state-check guard exists.
- Add three service-layer tests covering unknown ticket, running script
  abandon (fires token → AbandonedExitCode), and already-completed
  script (real exit code preserved).
- Fix pre-existing compile errors in KubernetesDirectoryInformationProviderFixture
  and LinuxTestUserPrincipal left by the Task 8 async migration
  (ExecuteCommand → ExecuteCommandAsync).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CompleteScriptAsync now reads the state store to detect abandon before
deleting the workspace. On the abandoned path it swallows IOException
with a Warn log; on normal-completion it propagates unchanged so bugs
that leak file handles still surface. Two new fixture tests cover both
branches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Plumbing gap surfaced during Task 14 — the Halibut client async
contract needs the new method to mirror IScriptServiceV2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ClientScriptExecutionAbandon which starts a FullIsolation script,
calls AbandonScriptAsync via raw IAsyncClientScriptServiceV2 client
(TentacleDebugDisableProcessKill=1 so the process survives), then asserts
a second FullIsolation script on the same mutex name can start and
complete — proving the mutex is released on abandon.

Supporting changes:
- ClientAndTentacle.CreateScriptServiceV2Client() helper for raw V2 RPC access
- FuncDecoratingScriptServiceV2 and NoOPClientScriptService now implement
  AbandonScriptAsync (required by IAsyncClientScriptServiceV2 interface)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds AbandonScript_MultiLevelDeepHang_StillReleasesMutex which chains
outer shell → child shell → grandchild polling a release file. With
TentacleDebugDisableProcessKill=1 the stuck process is not killed, but
abandon still returns AbandonedExitCode and releases the isolation mutex
so a second FullIsolation script with the same mutex name can start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PowerShellPrerequisite (Octopus.Manager.Tentacle WPF installer) called
the deleted sync SilentProcessRunnerExtended.ExecuteCommand. Migrated
to ExecuteCommandAsync via GetAwaiter().GetResult() at the sync
boundary; the installer runs on a ThreadPool worker with no
synchronisation context so the bridge is deadlock-safe.

SetupHelpers.GetTentacleImageAndTag was assigning a Task<string?>
to a string after Task 8's async migration of DockerImageLoader.
Made the helper async and awaited at both call sites
(KubernetesClusterOneTimeSetUp, KubernetesClientCompatibilityTests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two regressions in our async migration that CI surfaced:

1. process.Close() in DoOurBestToCleanUp races with the Exited event
   firing. With the new WaitForExitAsync path that depends on Exited
   to complete the TaskCompletionSource, closing the handle
   immediately after Kill can win the race and the await hangs
   forever — exactly the symptom of ScriptServiceFixture.ShouldCancelPing
   timing out at 10 minutes. Removed the Close() call. The Process
   gets disposed by the outer `using (var process = new Process())`
   when ExecuteCommandAsync returns; that's the safe time.

   Luke's original Close() was needed to unblock the OLD sync
   WaitForExit() from pipe-drain hangs when a grandchild held
   redirected pipes. With WaitForExitAsync we don't depend on pipe
   drain anymore, and SafelyWaitForAllOutput's existing 5-second
   timeout bounds the grandchild-pipe case.

2. Two CapabilitiesServiceV2Test assertions check the exact count of
   advertised capabilities. Now that AbandonScriptV2 is advertised on
   Latest tentacles, the expected count is +1 for the version == null
   (Latest) variants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning

The spec promises that AbandonScriptAsync returns ScriptStatusResponseV2(Complete, AbandonedExitCode) immediately. Before this fix
it fired the abandon token and returned the runningScript's current
state synchronously — which is still Running because Execute() hasn't
yet processed the abandon. Callers asserting on the response saw
stale (Running, 0) values.

Added a short bounded wait (up to 5s, polling every 25ms) for State
to transition to Complete. In practice this completes in
single-digit ms because the abandon path is just propagating an
exception through async unwind.

The single-level abandon test was previously passing only because it
never asserted on the response. Added the same Complete/AbandonedExitCode
assertion there so we don't regress this contract silently again.

The multi-level test had a separate cosmetic issue: the inner bash
command didn't quote the path, so on macOS where temp paths contain
spaces ("Application Support") the [ -f ] test produced a "too many
arguments" error and the loop exited with code 0. Added double-quote
escaping around the path inside the nested bash command.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jimmyp
Copy link
Copy Markdown
Author

jimmyp commented May 22, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

I'm Claude, leaving this review on Jim's behalf.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

try
{
SilentProcessRunnerExtended.ExecuteCommand(
// Sync boundary: prerequisite check runs from the WPF installer
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean its illegible

Copy link
Copy Markdown
Author

@jimmyp jimmyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to address these comments

try
{
SilentProcessRunnerExtended.ExecuteCommand(
// Sync boundary: prerequisite check runs from the WPF installer
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is illegible

var stdOut = new List<string>();
var stdErr = new List<string>();
var exitCode = silentProcessRunner.ExecuteCommand("du", $"-s -B 1 {directoryPath}", "/", stdOut.Add, stdErr.Add);
// Sync boundary: called from IMemoryCache.GetOrCreate factory which is synchronous.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorm with me what we should do in each instance of these


//non-kubernetes agent tentacles only support the standard script services
return new CapabilitiesResponseV2(new List<string> { nameof(IScriptService), nameof(IFileTransferService), nameof(IScriptServiceV2) });
return new CapabilitiesResponseV2(new List<string> { nameof(IScriptService), nameof(IFileTransferService), nameof(IScriptServiceV2), "AbandonScriptV2" });
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the nameof operator on the method we added

return bashScript.ToString();
}

public ScriptBuilder AppendRaw(string bash, string windows)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we have to create this why couldnt we use ther existing primitives

exitCode = workspace.ShouldMonitorPowerShellStartup()
? await RunPowershellScriptWithMonitoring(shellPath, writer, runningScriptToken)
: RunScript(shellPath, writer, runningScriptToken);
: await RunScriptAsync(shellPath, writer, runningScriptToken, abandonToken);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where we catch the canceled exception below, do we want to differentiate abandoned scripts and log appropriately?

return new LegacyTentacleClientBuilder(halibutRuntime, ServiceEndPoint);
}

public IAsyncClientScriptServiceV2 CreateScriptServiceV2Client()
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you create this?

var exitCode = await task;
sw.Stop();

sw.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2), "abandon should return promptly");
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to assert on this?

}

// Latest tentacles also advertise AbandonScriptV2 (added in EFT-3295). Older versioned builds do not.
if (version == null)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this null?


// Load-bearing: second FullIsolation script with the same mutex name must now be able to start,
// proving the mutex was released by the abandon
var secondStartFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "second-start");
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a seperate test

var startFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "ml-start");
var releaseFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "ml-release");

// Multi-level chain: outer shell (bootstrap) → child shell → grandchild polls for the release file.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to test this directly, does it matter if its multiple levels deep where the stuck thing occurs

jimmyp and others added 6 commits May 22, 2026 15:11
GetDriveBytesUsingDu was blocking on ExecuteCommandAsync via
.GetAwaiter().GetResult() because GetPathUsedBytes was called from a
sync IMemoryCache.GetOrCreate factory. The chain is now fully async:

  GetDriveBytesUsingDuAsync → GetPathUsedBytesAsync (GetOrCreateAsync)
  → GetStorageInformationAsync → awaited by CreateScriptContainer

EnsureDiskHasEnoughFreeSpace still blocks (it overrides a sync
IOctopusFileSystem method called from config writes and script workspace
readiness checks), but the sync boundary is now at the override where
the constraint is obvious, not buried inside the du runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AbstractCommand gets a StartAsync() virtual hook (defaults to calling
Start() so all existing sync commands need no changes). ServiceCommand
overrides StartAsync() instead of Start(), allowing the whole chain
below it to be properly async:

  ServiceCommand.StartAsync()
    -> IServiceConfigurator.Configure*Async()
    -> LinuxServiceConfigurator / WindowsServiceConfigurator (async)
    -> SystemCtlHelper.RunServiceCommandAsync()
    -> await ExecuteCommandAsync()

Six .GetAwaiter().GetResult() calls removed (2 in SystemCtlHelper,
3 in LinuxServiceConfigurator, 1 in WindowsServiceConfigurator).
Replaced by one sync boundary in AbstractCommand's dispatcher where
the CLI host (OctopusProgram) requires a sync Action<ICommandRuntime>.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds StartAsync(string[], ICommandRuntime, OptionSet) to ICommand and
ICommandHost.RunAsync(Func<ICommandRuntime, Task>, Action) to all four
host implementations. OctopusProgram.Run() delegates to RunAsync();
Program.Main becomes async Task<int>.

The single remaining sync boundary is AbstractCommand.Start(string[]...)
which calls StartAsync(...).GetAwaiter().GetResult() — kept for backwards
compatibility with any callers that hold an ICommand and call Start()
synchronously, but OctopusProgram no longer goes through it.

HelpCommand migrated to override StartAsync(string[]...) instead of
Start(string[]...) for consistency.

WindowsServiceHost.RunAsync uses Task.Run for ServiceBase.Run because
that Windows API must block a thread by design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ICommand.void Start(string[],...) and ICommandHost.void Run(Action,...)
are now deleted — no production or test code calls them. All callers
have been migrated to StartAsync / RunAsync.

NoninteractiveHost.ManualResetEvent removed (only served the sync Run
path). Three test fixtures (ServerCommsCommandTest, CheckServicesCommand
Fixture, WatchdogCommandFixture) migrated to async Task tests using
StartAsync and Assert.ThrowsAsync.

Two Kubernetes integration test calls also updated to await StartAsync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ICommandLineRunner.Execute → ExecuteAsync (Task<bool>). CommandLineRunner
now awaits ExecuteCommandAsync directly — no .GetAwaiter().GetResult().

Callers updated:
  - ReviewAndRunScriptTabViewModel: already async Task<bool>; Rollback
    becomes async Task RollbackAsync
  - RunProcessDialog: ThreadPool.QueueUserWorkItem → Task.Run(async)
  - SetupTentacleWizardModelBuilder test: NSubstitute mock updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five commands (PollCommand, RegisterMachineCommandBase, ShowConfiguration,
DeregisterMachine, DeregisterWorker) used a pre-existing pattern where
protected Start() wrapped a private async Task StartAsync(). Now that
AbstractCommand.StartAsync(string[],...) calls the protected virtual
StartAsync() directly, they override that instead — no GetAwaiter needed.

Also remove OctopusProgram.Run() (dead code — Main now calls RunAsync()).

Remaining GetAwaiter().GetResult() sites are all pre-existing forced sync
boundaries: IWritableKeyValueStore, IMachineKeyEncryptor (Kubernetes config),
IOctopusFileSystem override, and WindowsServiceHost (ServiceBase.Run
must block). All have been documented in prior commits or were pre-existing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant