Skip to content

feat: support custom job working directory via annotation#137

Open
Copilot wants to merge 4 commits into
mainfrom
copilot/add-job-working-dir-annotation
Open

feat: support custom job working directory via annotation#137
Copilot wants to merge 4 commits into
mainfrom
copilot/add-job-working-dir-annotation

Conversation

Copilot AI commented Apr 5, 2026

Copy link
Copy Markdown
Contributor

Adds a slurm-job.vk.io/job-workdir annotation to override the plugin-configured DataRootFolder as the base path for job execution. Intended for use with --gid so jobs can write to a group-accessible directory. All subsequent operations (status, logs, delete) resolve the correct path via the stored working directory, with robust fallback recovery when the in-memory cache is wiped.

Design

  • Metadata directory (DataRootFolder/<ns>-<uid>): always created; stores JobID.jid, PodUID.uid, PodNamespace.ns, timing files, and a WorkDir.path pointer when a custom dir is set — enabling restart recovery via LoadJIDs.
  • Working directory (<annotation>/<ns>-<uid>): receives job scripts, SLURM output, container logs, env files, and mount data.

Changes

  • prepare.go: JidStruct.WorkDir field; getJobWorkDir() helper (validates absolute path, rejects .. traversal); updated handleJidAndPodUid() (accepts filesPath/workDir, writes WorkDir.path); LoadJIDs() reads WorkDir.path on restart; createEnvFile()/prepareEnvs() accept explicit workDir.
  • Create.go: computes workDir from annotation, creates the directory if separate from filesPath, threads it through prepareMounts, prepareEnvs, produceSLURMScript, and handleJidAndPodUid; cleans up both dirs on error.
  • Status.go: resolves path from JidStruct.WorkDir when set; falls back to recomputing from pod annotations when the in-memory JIDs cache is empty.
  • GetLogs.go: resolves path from JidStruct.WorkDir when set; falls back to reading WorkDir.path from the metadata directory on disk when the cache is empty.
  • Delete.go: resolves workDir from pod annotations when cache is wiped; introduces slurm-job.vk.io/no-clean-workdir annotation to preserve the job output directory on pod deletion; fixes removal logic (no double-delete — deleteContainer handles workDir removal, StopHandler only removes the separate filesPath metadata dir when needed).
  • prepare_test.go: TestGetJobWorkDir covering default path, annotation override, trailing slash, empty value, relative path rejection, and traversal rejection.
  • README.md: documents both new annotations.

Usage

metadata:
  annotations:
    slurm-job.vk.io/job-workdir: /scratch/mygroup
    slurm-job.vk.io/flags: "--gid=mygroup"
    # Optional: keep job output after pod deletion
    slurm-job.vk.io/no-clean-workdir: "true"

Job files land in /scratch/mygroup/<namespace>-<podUID>/; plugin metadata stays in DataRootFolder/<namespace>-<podUID>/. When no-clean-workdir is "true", only the metadata directory is removed on deletion — the job output directory is preserved.

Copilot AI linked an issue Apr 5, 2026 that may be closed by this pull request
Copilot AI changed the title [WIP] Add job working directory in annotation for --gid option feat: support custom job working directory via annotation Apr 5, 2026
Copilot AI requested a review from dciangot April 5, 2026 17:10

@dciangot dciangot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot if the memory cache of the job is wiped, can the status call recover the workdir from annotations?

Also check the workdir removal logic, and introduce an annotation to disable the cleanup of the directory

Copilot AI commented Apr 5, 2026

Copy link
Copy Markdown
Contributor Author

@copilot if the memory cache of the job is wiped, can the status call recover the workdir from annotations?

Also check the workdir removal logic, and introduce an annotation to disable the cleanup o...

Addressed in commit e31201d:

Memory cache recovery: All three handlers now recover the workdir even when the JIDs cache is wiped:

  • Status.go: falls back to getJobWorkDir() computed from the pod's annotations (pod is available in the request body)
  • GetLogs.go: reads WorkDir.path from the metadata directory on disk (the metadata dir is always at the predictable DataRootFolder/<ns>-<uid> path, so it's findable even without the cache)
  • Delete.go: falls back to getJobWorkDir() from pod annotations (pod is in the request body)

No-clean-workdir annotation: Added slurm-job.vk.io/no-clean-workdir: "true". When set (and a custom job-workdir is in use), deleteContainer is called with filesPath instead of workDir, so only the plugin metadata directory is removed — the job output directory is preserved.

Removal logic fix: Removed the double-delete of workDir that was previously happening (deleteContainer already removes it; the extra os.RemoveAll(workDir) in StopHandler was redundant). Now the logic is clean: deleteContainer(containerDeletePath) → then remove filesPath separately if it differs from workDir.

Copilot AI requested a review from dciangot April 5, 2026 17:24

@dciangot dciangot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@dciangot

dciangot commented Apr 5, 2026

Copy link
Copy Markdown
Member

@Bianco95 comsider for 0.6.2

@dciangot dciangot marked this pull request as ready for review April 5, 2026 17:31
@dciangot dciangot requested a review from Copilot April 5, 2026 17:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for configuring a per-job working directory via the slurm-job.vk.io/job-workdir pod annotation, while keeping plugin metadata under DataRootFolder for restart recovery and consistent lifecycle operations (status/logs/delete).

Changes:

  • Introduces JidStruct.WorkDir, getJobWorkDir(...) helper, and persistence of the resolved workdir via WorkDir.path for restart recovery.
  • Threads the resolved workDir through job creation (mounts/envs/script generation) and updates status/log resolution to prefer cached/persisted workdir.
  • Adds deletion behavior control via slurm-job.vk.io/no-clean-workdir and documents new annotations; includes unit tests for getJobWorkDir.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
README.md Documents job-workdir and no-clean-workdir annotations and intended behavior.
pkg/slurm/Create.go Computes workDir from annotation and uses it for script/log/env placement; persists metadata via handleJidAndPodUid.
pkg/slurm/Delete.go Resolves workDir for deletion and supports preserving job output dir via no-clean-workdir.
pkg/slurm/GetLogs.go Resolves log base path from in-memory cache or WorkDir.path on disk.
pkg/slurm/Status.go Resolves status base path from annotation fallback and in-memory cache when present.
pkg/slurm/prepare.go Adds WorkDir to JID struct, getJobWorkDir, loads/persists WorkDir.path, and updates env generation to target workDir.
pkg/slurm/prepare_test.go Adds TestGetJobWorkDir coverage for default/override and validation cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/slurm/Create.go
metadata := data.Pod.ObjectMeta
filesPath := h.Config.DataRootFolder + data.Pod.Namespace + "-" + string(data.Pod.UID)
workDir := getJobWorkDir(h.Config, metadata.Annotations, data.Pod.Namespace, string(data.Pod.UID))

Copilot AI Apr 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a custom workDir is used, the code no longer ensures that the metadata directory (filesPath under DataRootFolder) exists. prepareMounts/produceSLURMScript now create workDir, but handleJidAndPodUid still writes JobID.jid, PodUID.uid, etc. into filesPath, which will fail if filesPath was never MkdirAll'd. Create the metadata directory unconditionally (e.g., os.MkdirAll(filesPath, ...)) before submitting the job so on-disk recovery via LoadJIDs works reliably.

Suggested change
if err = os.MkdirAll(filesPath, 0o755); err != nil {
statusCode = http.StatusInternalServerError
h.handleError(spanCtx, w, statusCode, err)
return
}

Copilot uses AI. Check for mistakes.
Comment thread pkg/slurm/prepare.go
Comment on lines +1439 to 1446
func handleJidAndPodUid(Ctx context.Context, pod v1.Pod, JIDs *map[string]*JidStruct, output string, filesPath string, workDir string) (string, error) {
r := regexp.MustCompile(`Submitted batch job (?P<jid>\d+)`)
jid := r.FindStringSubmatch(output)
fJID, err := os.Create(path + "/JobID.jid")
fJID, err := os.Create(filesPath + "/JobID.jid")
if err != nil {
log.G(Ctx).Error("Can't create jid_file")
return "", err
}

Copilot AI Apr 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleJidAndPodUid assumes filesPath already exists, but with the new separation between filesPath (metadata) and workDir (job output) that may not be true. Consider creating filesPath inside this function (or returning a clearer error) before os.Create(filesPath+"/JobID.jid"), otherwise custom job-workdir submissions can fail and prevent WorkDir.path from being persisted.

Copilot uses AI. Check for mistakes.
Comment thread pkg/slurm/prepare.go
Comment on lines +61 to +66
if strings.Contains(customBase, "..") {
return config.DataRootFolder + namespace + "-" + podUID
}
clean := filepath.Clean(customBase)
if filepath.IsAbs(clean) {
return clean + "/" + namespace + "-" + podUID

Copilot AI Apr 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getJobWorkDir rejects any annotation value containing the substring "..". This is stricter than “reject traversal components” and will incorrectly reject valid directory names that simply contain ".." (e.g. "/scratch/v1..backup"). To match the comment/intent, validate path elements (split on the path separator and reject elements equal to ".."), and consider using filepath.Join(clean, namespace+"-"+podUID) instead of manual string concatenation.

Suggested change
if strings.Contains(customBase, "..") {
return config.DataRootFolder + namespace + "-" + podUID
}
clean := filepath.Clean(customBase)
if filepath.IsAbs(clean) {
return clean + "/" + namespace + "-" + podUID
for _, elem := range strings.Split(customBase, string(filepath.Separator)) {
if elem == ".." {
return config.DataRootFolder + namespace + "-" + podUID
}
}
clean := filepath.Clean(customBase)
if filepath.IsAbs(clean) {
return filepath.Join(clean, namespace+"-"+podUID)

Copilot uses AI. Check for mistakes.
Comment thread pkg/slurm/prepare.go
JIDEntry := JidStruct{PodUID: string(podUID), PodNamespace: string(podNamespace), JID: string(JID), StartTime: StartedAt, EndTime: FinishedAt}
workDirBytes, err := os.ReadFile(path + entry.Name() + "/" + "WorkDir.path")
if err == nil && len(workDirBytes) > 0 {
JIDEntry.WorkDir = string(workDirBytes)

Copilot AI Apr 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading WorkDir.path, the raw file contents are used directly as a filesystem path. It would be safer to strings.TrimSpace (or at least trim trailing newlines) before assigning to JIDEntry.WorkDir, to avoid hard-to-debug failures if the file ever contains a newline (manual edits, future format changes, etc.).

Suggested change
JIDEntry.WorkDir = string(workDirBytes)
workDir := strings.TrimSpace(string(workDirBytes))
if workDir != "" {
JIDEntry.WorkDir = workDir
}

Copilot uses AI. Check for mistakes.
Comment thread pkg/slurm/GetLogs.go
if jid, ok := (*h.JIDs)[req.PodUID]; ok && jid.WorkDir != "" {
path = jid.WorkDir
} else if workDirBytes, rdErr := os.ReadFile(metadataPath + "/WorkDir.path"); rdErr == nil && len(workDirBytes) > 0 {
path = string(workDirBytes)

Copilot AI Apr 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkDir.path is read from disk and used verbatim as path. Consider strings.TrimSpace before assigning so any trailing newline/whitespace in the pointer file doesn’t break log path resolution.

Suggested change
path = string(workDirBytes)
if workDirPath := strings.TrimSpace(string(workDirBytes)); workDirPath != "" {
path = workDirPath
}

Copilot uses AI. Check for mistakes.
Comment thread pkg/slurm/Delete.go
Comment on lines +64 to 73
// Pass the appropriate path to deleteContainer so it can cancel the SLURM job
// and clean up the job directory. When no-clean-workdir is set, only the
// metadata directory (filesPath) is removed; the job workdir is kept.
containerDeletePath := workDir
if noCleanWorkDir {
containerDeletePath = filesPath
}

err = deleteContainer(spanCtx, h.Config, string(pod.UID), h.JIDs, containerDeletePath)

Copilot AI Apr 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StopHandler now explicitly supports the case where the in-memory JIDs cache was wiped, but it still unconditionally calls deleteContainer(...). deleteContainer dereferences (*JIDs)[podUID].JID even when the map entry is missing, which will panic when the cache is empty. Guard this call (e.g., check map membership and skip scancel/JID attributes when absent) or make deleteContainer robust to a missing JID entry so deletion remains safe after restarts.

Copilot uses AI. Check for mistakes.
Comment thread pkg/slurm/Create.go
Comment on lines +296 to 298
err = deleteContainer(spanCtx, h.Config, string(data.Pod.UID), h.JIDs, workDir)
if err != nil {
log.G(h.Ctx).Error(err)

Copilot AI Apr 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On handleJidAndPodUid error, the code calls deleteContainer(..., workDir). If handleJidAndPodUid failed before registering the JID in h.JIDs (e.g., because filesPath didn't exist), deleteContainer will panic when it dereferences (*JIDs)[podUID].JID. Consider checking that a JID entry exists before calling deleteContainer, or update deleteContainer to handle missing entries safely.

Suggested change
err = deleteContainer(spanCtx, h.Config, string(data.Pod.UID), h.JIDs, workDir)
if err != nil {
log.G(h.Ctx).Error(err)
podUID := string(data.Pod.UID)
if h.JIDs != nil {
if _, ok := (*h.JIDs)[podUID]; ok {
err = deleteContainer(spanCtx, h.Config, podUID, h.JIDs, workDir)
if err != nil {
log.G(h.Ctx).Error(err)
}
} else {
log.G(h.Ctx).Info("Skipping deleteContainer cleanup because no JID was registered for pod UID " + podUID)
}

Copilot uses AI. Check for mistakes.
Comment thread pkg/slurm/prepare.go
Comment on lines 498 to 503
JIDEntry := JidStruct{PodUID: string(podUID), PodNamespace: string(podNamespace), JID: string(JID), StartTime: StartedAt, EndTime: FinishedAt}
workDirBytes, err := os.ReadFile(path + entry.Name() + "/" + "WorkDir.path")
if err == nil && len(workDirBytes) > 0 {
JIDEntry.WorkDir = string(workDirBytes)
}
(*h.JIDs)[string(podUID)] = &JIDEntry

Copilot AI Apr 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadJIDs trusts the contents of WorkDir.path and stores it in-memory without validation. Since that value later influences filesystem operations (e.g., log reads and potentially os.RemoveAll when deleting jobs), it should be validated/sanitized on load (absolute path, no .. elements, and ideally ensure it matches the expected <base>/<namespace>-<uid> shape) before using it.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pass job working dir in annotation

3 participants