RFC: [WIP] Environment [Monitor] extension

* **Pull Request**: (TBD — not yet created)
* **Discussion Thread(s)**:
  * Originated from the monitoring open question in [RFC 0008 (Environment Wrap Task Run)](https://github.com/OpenJobDescription/openjd-specifications/pull/130)

## Description

Add a monitoring mechanism for Environment templates that lets an Environment periodically health-check processes running within its Session. An Environment currently defines `onEnter` and `onExit` Actions for setup and teardown, and RFC 0008 proposes `onWrapTaskRun` for intercepting task execution. However, once an Action is running, the runtime has no visibility into whether the process inside the Environment is still healthy — the only escape hatch is the Action timeout, which can waste hours of compute on stalled workloads.

This is especially relevant for container-based Environments (Docker, Apptainer) where the wrapped process can stall silently — the container process hangs, the GPU driver crashes, or the application deadlocks — while the outer `docker exec` remains waiting. But the problem is general: any long-running Environment Action (container or not) benefits from periodic health checks.

This RFC proposes a way for Environment templates to define a monitoring Action that the runtime invokes periodically alongside the running Action. Possible designs include an `onMonitor` Action on `<EnvironmentActions>`, a `monitorCommand` field on individual Actions, or a `Task.Monitor` template variable. The goal is to detect stalled workloads early and fail the task immediately rather than waiting for the timeout.

Depends on RFC 0008 (`onWrapTaskRun`).

### Concrete example: detecting a crashed container

A render task is running inside a Docker container via `onWrapTaskRun`. The GPU driver crashes and the container exits silently. The `docker exec` process in the wrap script hangs waiting on a dead container. Without monitoring, the runtime waits for the Action timeout (potentially hours).

With `onMonitor`, the runtime periodically runs a health check:

```bash
$ docker ps --filter "id=$DOCKER_CONTAINER_ID" --format '{{.Status}}'
Up 12 minutes        # healthy — container is running
```

When the container crashes:

```bash
$ docker ps --filter "id=$DOCKER_CONTAINER_ID" --format '{{.Status}}'
                     # empty — container is gone, exit non-zero
$ echo $?
1
```

The monitor returns non-zero → `onFailure: terminate` fires → the runtime cancels the wrap action and fails the Session immediately instead of burning hours of compute.

## Roles

| Role | User
| ---- | ----
| Proposed By | @leongdl
| Author(s)   | @leongdl

## Workflow

- [x] Tracking issue created (label: `rfc/proposed`)
- [ ] RFC pull request submitted and ready for discussion (label: `rfc/exploring`)
- [ ] Last call for comments (labels: `rfc/exploring` and `rfc/final-comments`)
- [ ] Accepted and merged RFC pull request (label: `rfc/accepted-future`)
- [ ] Green-light for inclusion in a draft specification, and the author is creating and iterating on pull requests (label: `rfc/accepted-draft`)
- [ ] Pull requests are merged in to a draft specification (label: `rfc/accepted-staged`)
- [ ] Officially published in a non-draft revision of the specification (label: `rfc/released`)

Please close this tracking issue when the proposal enters the `Released` stage of the process.

## Open Points

1. **Monitor as a structured definition** — Rather than a single command field, the monitor could be a structure on `<EnvironmentActions>`:

   ```yaml
   onMonitor:
     command: <Action>       # The health-check command to run periodically
     period: <posinteger>    # How often to run, in seconds (e.g., 30)
     onFailure: <MonitorFailureAction>  # What to do if command returns non-zero
   ```

   Where `<MonitorFailureAction>` is one of:
   - `log` — Emit the monitor's stdout/stderr as an `openjd_status:` message but let the Action continue. Useful for observability without hard-killing on transient blips.
   - `terminate` — Fail the currently running Action immediately (terminal for the Session, consistent with existing failure semantics).

   Open sub-questions:
   - Should `onFailure` support a threshold (e.g., terminate after N consecutive failures) to tolerate transient blips?
   - Should there be a `warn` action that emits `openjd_status:` for the first N failures then escalates to `terminate`?
   - Should `period` have a minimum floor to prevent runaway monitoring overhead?

2. **Scope: which Actions does the monitor apply to?** — The monitor should run alongside any Action in the Environment's lifecycle — `onEnter`, `onWrapTaskRun`, and `onExit`. A long image pull in `onEnter` or a stuck `docker stop` in `onExit` are just as susceptible to stalling as the task execution itself. The open question is whether the monitor definition is a single `onMonitor` on the Environment (active for all Actions), or whether each Action can override the monitor config (e.g., different period or failure action for `onEnter` vs. `onWrapTaskRun`).

3. **Health signal semantics** — The monitor command's exit code is the signal: zero means healthy, non-zero triggers `onFailure`. Should the spec also define stdout conventions (e.g., `openjd_status:` lines from the monitor are forwarded to the runtime)? This would let `docker stats` output flow through as status messages even when the action is `log`.

4. **Interaction with cancelation** — When `onFailure: terminate` fires, should the runtime use the same cancelation mechanism as user-initiated cancels (sending the Action's configured cancelation signal), or immediately kill the process? Using the cancelation mechanism gives the wrap script a chance to clean up (e.g., `docker stop`), but adds latency.

5. **Template variable access** — Should the monitor command have access to the same template variables as the Action it monitors? For `onWrapTaskRun`, this would mean `Task.Command`, `Task.Args`, `Task.Environment`, and `Env.Action.Timeout` are available. For containers, the monitor likely needs `$DOCKER_CONTAINER_ID` (set via `openjd_env` in `onEnter`), which is already available as a session environment variable.

---

> The author is responsible to progress the RFC according to this checklist, and
apply the relevant labels to this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: [WIP] Environment [Monitor] extension #133

Description

Concrete example: detecting a crashed container

Roles

Workflow

Open Points

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RFC: [WIP] Environment [Monitor] extension #133

Description

Description

Concrete example: detecting a crashed container

Roles

Workflow

Open Points

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions