feat: support custom job working directory via annotation#137
Conversation
Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/12e0a146-7ebc-4bc8-9e42-87b0eef0e7f5 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
… naming Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/12e0a146-7ebc-4bc8-9e42-87b0eef0e7f5 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
…clean-workdir annotation Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/4b5c9be5-429f-4d38-9041-25ede0beb360 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
Addressed in commit Memory cache recovery: All three handlers now recover the workdir even when the JIDs cache is wiped:
No-clean-workdir annotation: Added Removal logic fix: Removed the double-delete of |
|
@Bianco95 comsider for 0.6.2 |
There was a problem hiding this comment.
Pull request overview
Adds support for configuring a per-job working directory via the slurm-job.vk.io/job-workdir pod annotation, while keeping plugin metadata under DataRootFolder for restart recovery and consistent lifecycle operations (status/logs/delete).
Changes:
- Introduces
JidStruct.WorkDir,getJobWorkDir(...)helper, and persistence of the resolved workdir viaWorkDir.pathfor restart recovery. - Threads the resolved
workDirthrough job creation (mounts/envs/script generation) and updates status/log resolution to prefer cached/persisted workdir. - Adds deletion behavior control via
slurm-job.vk.io/no-clean-workdirand documents new annotations; includes unit tests forgetJobWorkDir.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents job-workdir and no-clean-workdir annotations and intended behavior. |
| pkg/slurm/Create.go | Computes workDir from annotation and uses it for script/log/env placement; persists metadata via handleJidAndPodUid. |
| pkg/slurm/Delete.go | Resolves workDir for deletion and supports preserving job output dir via no-clean-workdir. |
| pkg/slurm/GetLogs.go | Resolves log base path from in-memory cache or WorkDir.path on disk. |
| pkg/slurm/Status.go | Resolves status base path from annotation fallback and in-memory cache when present. |
| pkg/slurm/prepare.go | Adds WorkDir to JID struct, getJobWorkDir, loads/persists WorkDir.path, and updates env generation to target workDir. |
| pkg/slurm/prepare_test.go | Adds TestGetJobWorkDir coverage for default/override and validation cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| metadata := data.Pod.ObjectMeta | ||
| filesPath := h.Config.DataRootFolder + data.Pod.Namespace + "-" + string(data.Pod.UID) | ||
| workDir := getJobWorkDir(h.Config, metadata.Annotations, data.Pod.Namespace, string(data.Pod.UID)) | ||
|
|
There was a problem hiding this comment.
When a custom workDir is used, the code no longer ensures that the metadata directory (filesPath under DataRootFolder) exists. prepareMounts/produceSLURMScript now create workDir, but handleJidAndPodUid still writes JobID.jid, PodUID.uid, etc. into filesPath, which will fail if filesPath was never MkdirAll'd. Create the metadata directory unconditionally (e.g., os.MkdirAll(filesPath, ...)) before submitting the job so on-disk recovery via LoadJIDs works reliably.
| if err = os.MkdirAll(filesPath, 0o755); err != nil { | |
| statusCode = http.StatusInternalServerError | |
| h.handleError(spanCtx, w, statusCode, err) | |
| return | |
| } |
| func handleJidAndPodUid(Ctx context.Context, pod v1.Pod, JIDs *map[string]*JidStruct, output string, filesPath string, workDir string) (string, error) { | ||
| r := regexp.MustCompile(`Submitted batch job (?P<jid>\d+)`) | ||
| jid := r.FindStringSubmatch(output) | ||
| fJID, err := os.Create(path + "/JobID.jid") | ||
| fJID, err := os.Create(filesPath + "/JobID.jid") | ||
| if err != nil { | ||
| log.G(Ctx).Error("Can't create jid_file") | ||
| return "", err | ||
| } |
There was a problem hiding this comment.
handleJidAndPodUid assumes filesPath already exists, but with the new separation between filesPath (metadata) and workDir (job output) that may not be true. Consider creating filesPath inside this function (or returning a clearer error) before os.Create(filesPath+"/JobID.jid"), otherwise custom job-workdir submissions can fail and prevent WorkDir.path from being persisted.
| if strings.Contains(customBase, "..") { | ||
| return config.DataRootFolder + namespace + "-" + podUID | ||
| } | ||
| clean := filepath.Clean(customBase) | ||
| if filepath.IsAbs(clean) { | ||
| return clean + "/" + namespace + "-" + podUID |
There was a problem hiding this comment.
getJobWorkDir rejects any annotation value containing the substring "..". This is stricter than “reject traversal components” and will incorrectly reject valid directory names that simply contain ".." (e.g. "/scratch/v1..backup"). To match the comment/intent, validate path elements (split on the path separator and reject elements equal to ".."), and consider using filepath.Join(clean, namespace+"-"+podUID) instead of manual string concatenation.
| if strings.Contains(customBase, "..") { | |
| return config.DataRootFolder + namespace + "-" + podUID | |
| } | |
| clean := filepath.Clean(customBase) | |
| if filepath.IsAbs(clean) { | |
| return clean + "/" + namespace + "-" + podUID | |
| for _, elem := range strings.Split(customBase, string(filepath.Separator)) { | |
| if elem == ".." { | |
| return config.DataRootFolder + namespace + "-" + podUID | |
| } | |
| } | |
| clean := filepath.Clean(customBase) | |
| if filepath.IsAbs(clean) { | |
| return filepath.Join(clean, namespace+"-"+podUID) |
| JIDEntry := JidStruct{PodUID: string(podUID), PodNamespace: string(podNamespace), JID: string(JID), StartTime: StartedAt, EndTime: FinishedAt} | ||
| workDirBytes, err := os.ReadFile(path + entry.Name() + "/" + "WorkDir.path") | ||
| if err == nil && len(workDirBytes) > 0 { | ||
| JIDEntry.WorkDir = string(workDirBytes) |
There was a problem hiding this comment.
When reading WorkDir.path, the raw file contents are used directly as a filesystem path. It would be safer to strings.TrimSpace (or at least trim trailing newlines) before assigning to JIDEntry.WorkDir, to avoid hard-to-debug failures if the file ever contains a newline (manual edits, future format changes, etc.).
| JIDEntry.WorkDir = string(workDirBytes) | |
| workDir := strings.TrimSpace(string(workDirBytes)) | |
| if workDir != "" { | |
| JIDEntry.WorkDir = workDir | |
| } |
| if jid, ok := (*h.JIDs)[req.PodUID]; ok && jid.WorkDir != "" { | ||
| path = jid.WorkDir | ||
| } else if workDirBytes, rdErr := os.ReadFile(metadataPath + "/WorkDir.path"); rdErr == nil && len(workDirBytes) > 0 { | ||
| path = string(workDirBytes) |
There was a problem hiding this comment.
WorkDir.path is read from disk and used verbatim as path. Consider strings.TrimSpace before assigning so any trailing newline/whitespace in the pointer file doesn’t break log path resolution.
| path = string(workDirBytes) | |
| if workDirPath := strings.TrimSpace(string(workDirBytes)); workDirPath != "" { | |
| path = workDirPath | |
| } |
| // Pass the appropriate path to deleteContainer so it can cancel the SLURM job | ||
| // and clean up the job directory. When no-clean-workdir is set, only the | ||
| // metadata directory (filesPath) is removed; the job workdir is kept. | ||
| containerDeletePath := workDir | ||
| if noCleanWorkDir { | ||
| containerDeletePath = filesPath | ||
| } | ||
|
|
||
| err = deleteContainer(spanCtx, h.Config, string(pod.UID), h.JIDs, containerDeletePath) | ||
|
|
There was a problem hiding this comment.
StopHandler now explicitly supports the case where the in-memory JIDs cache was wiped, but it still unconditionally calls deleteContainer(...). deleteContainer dereferences (*JIDs)[podUID].JID even when the map entry is missing, which will panic when the cache is empty. Guard this call (e.g., check map membership and skip scancel/JID attributes when absent) or make deleteContainer robust to a missing JID entry so deletion remains safe after restarts.
| err = deleteContainer(spanCtx, h.Config, string(data.Pod.UID), h.JIDs, workDir) | ||
| if err != nil { | ||
| log.G(h.Ctx).Error(err) |
There was a problem hiding this comment.
On handleJidAndPodUid error, the code calls deleteContainer(..., workDir). If handleJidAndPodUid failed before registering the JID in h.JIDs (e.g., because filesPath didn't exist), deleteContainer will panic when it dereferences (*JIDs)[podUID].JID. Consider checking that a JID entry exists before calling deleteContainer, or update deleteContainer to handle missing entries safely.
| err = deleteContainer(spanCtx, h.Config, string(data.Pod.UID), h.JIDs, workDir) | |
| if err != nil { | |
| log.G(h.Ctx).Error(err) | |
| podUID := string(data.Pod.UID) | |
| if h.JIDs != nil { | |
| if _, ok := (*h.JIDs)[podUID]; ok { | |
| err = deleteContainer(spanCtx, h.Config, podUID, h.JIDs, workDir) | |
| if err != nil { | |
| log.G(h.Ctx).Error(err) | |
| } | |
| } else { | |
| log.G(h.Ctx).Info("Skipping deleteContainer cleanup because no JID was registered for pod UID " + podUID) | |
| } |
| JIDEntry := JidStruct{PodUID: string(podUID), PodNamespace: string(podNamespace), JID: string(JID), StartTime: StartedAt, EndTime: FinishedAt} | ||
| workDirBytes, err := os.ReadFile(path + entry.Name() + "/" + "WorkDir.path") | ||
| if err == nil && len(workDirBytes) > 0 { | ||
| JIDEntry.WorkDir = string(workDirBytes) | ||
| } | ||
| (*h.JIDs)[string(podUID)] = &JIDEntry |
There was a problem hiding this comment.
LoadJIDs trusts the contents of WorkDir.path and stores it in-memory without validation. Since that value later influences filesystem operations (e.g., log reads and potentially os.RemoveAll when deleting jobs), it should be validated/sanitized on load (absolute path, no .. elements, and ideally ensure it matches the expected <base>/<namespace>-<uid> shape) before using it.
Adds a
slurm-job.vk.io/job-workdirannotation to override the plugin-configuredDataRootFolderas the base path for job execution. Intended for use with--gidso jobs can write to a group-accessible directory. All subsequent operations (status, logs, delete) resolve the correct path via the stored working directory, with robust fallback recovery when the in-memory cache is wiped.Design
DataRootFolder/<ns>-<uid>): always created; storesJobID.jid,PodUID.uid,PodNamespace.ns, timing files, and aWorkDir.pathpointer when a custom dir is set — enabling restart recovery viaLoadJIDs.<annotation>/<ns>-<uid>): receives job scripts, SLURM output, container logs, env files, and mount data.Changes
prepare.go:JidStruct.WorkDirfield;getJobWorkDir()helper (validates absolute path, rejects..traversal); updatedhandleJidAndPodUid()(acceptsfilesPath/workDir, writesWorkDir.path);LoadJIDs()readsWorkDir.pathon restart;createEnvFile()/prepareEnvs()accept explicitworkDir.Create.go: computesworkDirfrom annotation, creates the directory if separate fromfilesPath, threads it throughprepareMounts,prepareEnvs,produceSLURMScript, andhandleJidAndPodUid; cleans up both dirs on error.Status.go: resolvespathfromJidStruct.WorkDirwhen set; falls back to recomputing from pod annotations when the in-memory JIDs cache is empty.GetLogs.go: resolvespathfromJidStruct.WorkDirwhen set; falls back to readingWorkDir.pathfrom the metadata directory on disk when the cache is empty.Delete.go: resolvesworkDirfrom pod annotations when cache is wiped; introducesslurm-job.vk.io/no-clean-workdirannotation to preserve the job output directory on pod deletion; fixes removal logic (no double-delete —deleteContainerhandlesworkDirremoval,StopHandleronly removes the separatefilesPathmetadata dir when needed).prepare_test.go:TestGetJobWorkDircovering default path, annotation override, trailing slash, empty value, relative path rejection, and traversal rejection.README.md: documents both new annotations.Usage
Job files land in
/scratch/mygroup/<namespace>-<podUID>/; plugin metadata stays inDataRootFolder/<namespace>-<podUID>/. Whenno-clean-workdiris"true", only the metadata directory is removed on deletion — the job output directory is preserved.