Skip to content

RFC: [WIP] Environment [Monitor] extension #133

@leongdl

Description

@leongdl

Description

Add a monitoring mechanism for Environment templates that lets an Environment periodically health-check processes running within its Session. An Environment currently defines onEnter and onExit Actions for setup and teardown, and RFC 0008 proposes onWrapTaskRun for intercepting task execution. However, once an Action is running, the runtime has no visibility into whether the process inside the Environment is still healthy — the only escape hatch is the Action timeout, which can waste hours of compute on stalled workloads.

This is especially relevant for container-based Environments (Docker, Apptainer) where the wrapped process can stall silently — the container process hangs, the GPU driver crashes, or the application deadlocks — while the outer docker exec remains waiting. But the problem is general: any long-running Environment Action (container or not) benefits from periodic health checks.

This RFC proposes a way for Environment templates to define a monitoring Action that the runtime invokes periodically alongside the running Action. Possible designs include an onMonitor Action on <EnvironmentActions>, a monitorCommand field on individual Actions, or a Task.Monitor template variable. The goal is to detect stalled workloads early and fail the task immediately rather than waiting for the timeout.

Depends on RFC 0008 (onWrapTaskRun).

Concrete example: detecting a crashed container

A render task is running inside a Docker container via onWrapTaskRun. The GPU driver crashes and the container exits silently. The docker exec process in the wrap script hangs waiting on a dead container. Without monitoring, the runtime waits for the Action timeout (potentially hours).

With onMonitor, the runtime periodically runs a health check:

$ docker ps --filter "id=$DOCKER_CONTAINER_ID" --format '{{.Status}}'
Up 12 minutes        # healthy — container is running

When the container crashes:

$ docker ps --filter "id=$DOCKER_CONTAINER_ID" --format '{{.Status}}'
                     # empty — container is gone, exit non-zero
$ echo $?
1

The monitor returns non-zero → onFailure: terminate fires → the runtime cancels the wrap action and fails the Session immediately instead of burning hours of compute.

Roles

Role User
Proposed By @leongdl
Author(s) @leongdl

Workflow

  • Tracking issue created (label: rfc/proposed)
  • RFC pull request submitted and ready for discussion (label: rfc/exploring)
  • Last call for comments (labels: rfc/exploring and rfc/final-comments)
  • Accepted and merged RFC pull request (label: rfc/accepted-future)
  • Green-light for inclusion in a draft specification, and the author is creating and iterating on pull requests (label: rfc/accepted-draft)
  • Pull requests are merged in to a draft specification (label: rfc/accepted-staged)
  • Officially published in a non-draft revision of the specification (label: rfc/released)

Please close this tracking issue when the proposal enters the Released stage of the process.

Open Points

  1. Monitor as a structured definition — Rather than a single command field, the monitor could be a structure on <EnvironmentActions>:

    onMonitor:
      command: <Action>       # The health-check command to run periodically
      period: <posinteger>    # How often to run, in seconds (e.g., 30)
      onFailure: <MonitorFailureAction>  # What to do if command returns non-zero

    Where <MonitorFailureAction> is one of:

    • log — Emit the monitor's stdout/stderr as an openjd_status: message but let the Action continue. Useful for observability without hard-killing on transient blips.
    • terminate — Fail the currently running Action immediately (terminal for the Session, consistent with existing failure semantics).

    Open sub-questions:

    • Should onFailure support a threshold (e.g., terminate after N consecutive failures) to tolerate transient blips?
    • Should there be a warn action that emits openjd_status: for the first N failures then escalates to terminate?
    • Should period have a minimum floor to prevent runaway monitoring overhead?
  2. Scope: which Actions does the monitor apply to? — The monitor should run alongside any Action in the Environment's lifecycle — onEnter, onWrapTaskRun, and onExit. A long image pull in onEnter or a stuck docker stop in onExit are just as susceptible to stalling as the task execution itself. The open question is whether the monitor definition is a single onMonitor on the Environment (active for all Actions), or whether each Action can override the monitor config (e.g., different period or failure action for onEnter vs. onWrapTaskRun).

  3. Health signal semantics — The monitor command's exit code is the signal: zero means healthy, non-zero triggers onFailure. Should the spec also define stdout conventions (e.g., openjd_status: lines from the monitor are forwarded to the runtime)? This would let docker stats output flow through as status messages even when the action is log.

  4. Interaction with cancelation — When onFailure: terminate fires, should the runtime use the same cancelation mechanism as user-initiated cancels (sending the Action's configured cancelation signal), or immediately kill the process? Using the cancelation mechanism gives the wrap script a chance to clean up (e.g., docker stop), but adds latency.

  5. Template variable access — Should the monitor command have access to the same template variables as the Action it monitors? For onWrapTaskRun, this would mean Task.Command, Task.Args, Task.Environment, and Env.Action.Timeout are available. For containers, the monitor likely needs $DOCKER_CONTAINER_ID (set via openjd_env in onEnter), which is already available as a session environment variable.


The author is responsible to progress the RFC according to this checklist, and
apply the relevant labels to this issue.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions