- Pull Request: (TBD — not yet created)
- Discussion Thread(s):
Description
Add a monitoring mechanism for Environment templates that lets an Environment periodically health-check processes running within its Session. An Environment currently defines onEnter and onExit Actions for setup and teardown, and RFC 0008 proposes onWrapTaskRun for intercepting task execution. However, once an Action is running, the runtime has no visibility into whether the process inside the Environment is still healthy — the only escape hatch is the Action timeout, which can waste hours of compute on stalled workloads.
This is especially relevant for container-based Environments (Docker, Apptainer) where the wrapped process can stall silently — the container process hangs, the GPU driver crashes, or the application deadlocks — while the outer docker exec remains waiting. But the problem is general: any long-running Environment Action (container or not) benefits from periodic health checks.
This RFC proposes a way for Environment templates to define a monitoring Action that the runtime invokes periodically alongside the running Action. Possible designs include an onMonitor Action on <EnvironmentActions>, a monitorCommand field on individual Actions, or a Task.Monitor template variable. The goal is to detect stalled workloads early and fail the task immediately rather than waiting for the timeout.
Depends on RFC 0008 (onWrapTaskRun).
Concrete example: detecting a crashed container
A render task is running inside a Docker container via onWrapTaskRun. The GPU driver crashes and the container exits silently. The docker exec process in the wrap script hangs waiting on a dead container. Without monitoring, the runtime waits for the Action timeout (potentially hours).
With onMonitor, the runtime periodically runs a health check:
$ docker ps --filter "id=$DOCKER_CONTAINER_ID" --format '{{.Status}}'
Up 12 minutes # healthy — container is running
When the container crashes:
$ docker ps --filter "id=$DOCKER_CONTAINER_ID" --format '{{.Status}}'
# empty — container is gone, exit non-zero
$ echo $?
1
The monitor returns non-zero → onFailure: terminate fires → the runtime cancels the wrap action and fails the Session immediately instead of burning hours of compute.
Roles
Workflow
Please close this tracking issue when the proposal enters the Released stage of the process.
Open Points
-
Monitor as a structured definition — Rather than a single command field, the monitor could be a structure on <EnvironmentActions>:
onMonitor:
command: <Action> # The health-check command to run periodically
period: <posinteger> # How often to run, in seconds (e.g., 30)
onFailure: <MonitorFailureAction> # What to do if command returns non-zero
Where <MonitorFailureAction> is one of:
log — Emit the monitor's stdout/stderr as an openjd_status: message but let the Action continue. Useful for observability without hard-killing on transient blips.
terminate — Fail the currently running Action immediately (terminal for the Session, consistent with existing failure semantics).
Open sub-questions:
- Should
onFailure support a threshold (e.g., terminate after N consecutive failures) to tolerate transient blips?
- Should there be a
warn action that emits openjd_status: for the first N failures then escalates to terminate?
- Should
period have a minimum floor to prevent runaway monitoring overhead?
-
Scope: which Actions does the monitor apply to? — The monitor should run alongside any Action in the Environment's lifecycle — onEnter, onWrapTaskRun, and onExit. A long image pull in onEnter or a stuck docker stop in onExit are just as susceptible to stalling as the task execution itself. The open question is whether the monitor definition is a single onMonitor on the Environment (active for all Actions), or whether each Action can override the monitor config (e.g., different period or failure action for onEnter vs. onWrapTaskRun).
-
Health signal semantics — The monitor command's exit code is the signal: zero means healthy, non-zero triggers onFailure. Should the spec also define stdout conventions (e.g., openjd_status: lines from the monitor are forwarded to the runtime)? This would let docker stats output flow through as status messages even when the action is log.
-
Interaction with cancelation — When onFailure: terminate fires, should the runtime use the same cancelation mechanism as user-initiated cancels (sending the Action's configured cancelation signal), or immediately kill the process? Using the cancelation mechanism gives the wrap script a chance to clean up (e.g., docker stop), but adds latency.
-
Template variable access — Should the monitor command have access to the same template variables as the Action it monitors? For onWrapTaskRun, this would mean Task.Command, Task.Args, Task.Environment, and Env.Action.Timeout are available. For containers, the monitor likely needs $DOCKER_CONTAINER_ID (set via openjd_env in onEnter), which is already available as a session environment variable.
The author is responsible to progress the RFC according to this checklist, and
apply the relevant labels to this issue.
Description
Add a monitoring mechanism for Environment templates that lets an Environment periodically health-check processes running within its Session. An Environment currently defines
onEnterandonExitActions for setup and teardown, and RFC 0008 proposesonWrapTaskRunfor intercepting task execution. However, once an Action is running, the runtime has no visibility into whether the process inside the Environment is still healthy — the only escape hatch is the Action timeout, which can waste hours of compute on stalled workloads.This is especially relevant for container-based Environments (Docker, Apptainer) where the wrapped process can stall silently — the container process hangs, the GPU driver crashes, or the application deadlocks — while the outer
docker execremains waiting. But the problem is general: any long-running Environment Action (container or not) benefits from periodic health checks.This RFC proposes a way for Environment templates to define a monitoring Action that the runtime invokes periodically alongside the running Action. Possible designs include an
onMonitorAction on<EnvironmentActions>, amonitorCommandfield on individual Actions, or aTask.Monitortemplate variable. The goal is to detect stalled workloads early and fail the task immediately rather than waiting for the timeout.Depends on RFC 0008 (
onWrapTaskRun).Concrete example: detecting a crashed container
A render task is running inside a Docker container via
onWrapTaskRun. The GPU driver crashes and the container exits silently. Thedocker execprocess in the wrap script hangs waiting on a dead container. Without monitoring, the runtime waits for the Action timeout (potentially hours).With
onMonitor, the runtime periodically runs a health check:When the container crashes:
The monitor returns non-zero →
onFailure: terminatefires → the runtime cancels the wrap action and fails the Session immediately instead of burning hours of compute.Roles
Workflow
rfc/proposed)rfc/exploring)rfc/exploringandrfc/final-comments)rfc/accepted-future)rfc/accepted-draft)rfc/accepted-staged)rfc/released)Please close this tracking issue when the proposal enters the
Releasedstage of the process.Open Points
Monitor as a structured definition — Rather than a single command field, the monitor could be a structure on
<EnvironmentActions>:Where
<MonitorFailureAction>is one of:log— Emit the monitor's stdout/stderr as anopenjd_status:message but let the Action continue. Useful for observability without hard-killing on transient blips.terminate— Fail the currently running Action immediately (terminal for the Session, consistent with existing failure semantics).Open sub-questions:
onFailuresupport a threshold (e.g., terminate after N consecutive failures) to tolerate transient blips?warnaction that emitsopenjd_status:for the first N failures then escalates toterminate?periodhave a minimum floor to prevent runaway monitoring overhead?Scope: which Actions does the monitor apply to? — The monitor should run alongside any Action in the Environment's lifecycle —
onEnter,onWrapTaskRun, andonExit. A long image pull inonEnteror a stuckdocker stopinonExitare just as susceptible to stalling as the task execution itself. The open question is whether the monitor definition is a singleonMonitoron the Environment (active for all Actions), or whether each Action can override the monitor config (e.g., different period or failure action foronEntervs.onWrapTaskRun).Health signal semantics — The monitor command's exit code is the signal: zero means healthy, non-zero triggers
onFailure. Should the spec also define stdout conventions (e.g.,openjd_status:lines from the monitor are forwarded to the runtime)? This would letdocker statsoutput flow through as status messages even when the action islog.Interaction with cancelation — When
onFailure: terminatefires, should the runtime use the same cancelation mechanism as user-initiated cancels (sending the Action's configured cancelation signal), or immediately kill the process? Using the cancelation mechanism gives the wrap script a chance to clean up (e.g.,docker stop), but adds latency.Template variable access — Should the monitor command have access to the same template variables as the Action it monitors? For
onWrapTaskRun, this would meanTask.Command,Task.Args,Task.Environment, andEnv.Action.Timeoutare available. For containers, the monitor likely needs$DOCKER_CONTAINER_ID(set viaopenjd_envinonEnter), which is already available as a session environment variable.