RFC: Environment Wrap Actions#130
Conversation
Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>
| intercept and wrap the lifecycle actions of *inner* environments and tasks. The runtime | ||
| supplies each wrap action with the wrapped action's command, args, timeout, cancelation | ||
| method, and environment variables as template variables. A companion opt-out, | ||
| `runOnHost: true` on `<Action>`, lets individual actions bypass wrapping when they |
There was a problem hiding this comment.
Could we make this runOnHost also be controlled by the wrapping environment? Then an author of an external environment could have more control over this. If we leave this as is we might want to find a different name, because the distinguishing factor is that it skips the wrapping - both wrapped and non-wrapped actions are running on the host.
There was a problem hiding this comment.
Interesting, unless the wrapping environment inspects and has logic for every line in the wrapped action?
Something like exporting environment variables on the host would be one case that is hard to detect
(And agree on the naming is hard)
| `runOnHost: true` on `<Action>`, lets individual actions bypass wrapping when they | ||
| must run on the host (credential fetching, mount setup, cleanup that must always run). | ||
|
|
||
| The primary motivation is container support: a Docker or Apptainer environment template |
There was a problem hiding this comment.
This is the primary motivation, but it's worth mentioning that we want this to be a general composable feature to the extent possible. If we can avoid all references to containers in the structure of the extension (vs the docs and examples which should have many), I think that will be useful.
| onWrapTaskRun: | ||
| command: "bash" | ||
| args: ["{{Env.File.WrapTaskRun}}"] | ||
| timeout: "{{Task.Timeout}}" |
There was a problem hiding this comment.
The difference between Env.Wrapped.Timeout and Task.Timeout feels unnecessary. Switching to either Task.Wrapped.Timeout, or Env.Timeout would make it consistent. Since the timeout is on the action itself, maybe it could be Task.Action.Timeout and Env.Action.Timeout?
We should also think about the general forwarding mechanism. How will we forward #118 when we make that? Can there be a general mechanism, or will we give the template author responsibility to plumb through the features?
There was a problem hiding this comment.
Interesting, I did not see 118, let me read and incorporate it.
| ### Apptainer environment template | ||
|
|
||
| The same pattern applies to Apptainer (daemonless, each wrap hook invokes | ||
| `apptainer exec` directly rather than exec'ing into a running container). See |
There was a problem hiding this comment.
Can't it run as a daemon? This seems necessary for many use cases.
There was a problem hiding this comment.
I think daemonless was meaning the "apptainer" daemon, like the persistent "docker daemon", not necessarily referring to the container as a daemon. Maybe a framing difference here.
| [Appendix A: Apptainer environment template](#appendix-a-apptainer-environment-template) | ||
| for the full example. | ||
|
|
||
| ### Job template that works with any environment |
There was a problem hiding this comment.
It should just be able to point to our existing samples. Maybe add a technical requirement that the way we've been writing jobs already must work with or without wrapping to the extent we can do that.
I guess the purpose of this is to show the runOnHost option?
There was a problem hiding this comment.
Yeah - exactly. The RunOnHost was an escape hatch.
Lets chat offline how we can offer the escape hatch. Either within the hook, or as an explicit declaration.
| often where performance problems hide. | ||
|
|
||
| 5. **Privilege isolation.** Run inner actions as a different user or with reduced | ||
| capabilities by wrapping the command with `sudo -u`, `unshare`, or a jailed shell. |
There was a problem hiding this comment.
If the inner environments and task runs can select runOnHost then this use case is weaker.
There was a problem hiding this comment.
True, the runOnHost escape hatch would change this. But nonetheless, examples such as running a container would be able to use the sudo -u.
I'll re-frame this as a addendum use case, not necessarily an argument.
| 4. **Cross-OS wrapping.** The same-path bind-mount requirement assumes the host and | ||
| the wrapped execution context share path-separator conventions (Linux host with | ||
| Linux container, or Windows with Windows). Cross-OS wrapping (e.g., a Windows host | ||
| launching a Linux container) is not supported by this RFC. |
There was a problem hiding this comment.
The responsibility of the RFC would be to provide all information necessary to do this Cross-OS wrapping, not to assist in any way. I think it does that?
There was a problem hiding this comment.
True, let me delete it.
Along the way I was thinking - why wrap windows containers in linux, and vice versa. Although linux containers on windows absolutely work but is terrible thorugh WSL.
Was originall thinking to cut some scope or defer this family of cases for later.
|
|
||
| ### Security: container isolation boundaries | ||
|
|
||
| Each container instance should be scoped to a single security boundary. In the |
There was a problem hiding this comment.
The spec isn't about containers, it's about the ability to perform action wrapping reliably. I think the scheduler doesn't have any responsibility about "container," its responsibility is purely about providing all the specified arguments to the wrap* actions. The RFC should focus exclusively on the wrapping part where it is about the OpenJD behavior, and talk about containers in example contexts that are about containers.
There was a problem hiding this comment.
Sounds good - I'll do a pass and move this into my separate design doc instead.
| 2. The container does not run with elevated privileges (`--privileged`) unless explicitly | ||
| configured by the environment template author. | ||
| 3. Bind mounts are scoped to the session working directory and explicitly declared paths, | ||
| not the entire host filesystem. |
There was a problem hiding this comment.
None of these are scheduler responsibilities - they are the responsibility of the environment implementer writing support for containers. The spec should ensure that the environment has sufficient capabilities to do what it needs to do.
There was a problem hiding this comment.
True, I think here, AI here thought "deadline" overall as the scheduler, I'll delete these.
| ``` | ||
|
|
||
| 1. *onEnter* — The action to run when entering the environment. | ||
| 2. *onWrapEnter* — If provided, this action is run instead of the `onEnter` action of |
There was a problem hiding this comment.
Consider adding an all or nothing rule. Either all onWrap* must be included, or none of them. This would help against accidentally implementing just part of the wrapping, then getting hard to debug results because not everything is running where expected.
There was a problem hiding this comment.
Good suggestion! I'll add that into the spec.
While the template yaml / json is slightly larger it is much harder to do the wrong thing.
PR for RFC
Tracking Issue: #132
This is a request for comments about RFC 0008 — Environment Wrap Actions, which
extends
<Environment>with three new session actions —onWrapEnter,onWrapTaskRun, andonWrapExit— that let an environment template intercept andwrap the lifecycle actions of inner environments and tasks. A companion opt-out,
runOnHost: trueon<Action>, lets individual actions bypass wrapping when theymust run on the host (credential fetching, mount setup, cleanup that must always
run).
The primary motivation is portable container support: a Docker or Apptainer
environment template can start a container in
onEnter, route every innerenvironment's
onEnter/onExitand every task'sonRuninto the container via thethree wrap hooks, and stop the container in
onExit. Job templates and innerenvironments remain portable across Conda, Rez, Docker, and Apptainer. The design
also generalizes to remote execution, session-wide instrumentation, and privilege
isolation.
This RFC is gated by the new
WRAP_ACTIONSextension and depends on RFC 0002(Model Extensions), RFC 0005 (Expression Language), and RFC 0006 (Expression
Function Library — for
repr_sh()and friends).By submitting this pull request, I confirm that you can use, modify, copy, and
redistribute this contribution, under the terms of your choice.