feat: support custom job working directory via annotation by Copilot · Pull Request #137 · interlink-hq/interlink-slurm-plugin

Copilot · 2026-04-05T16:53:13Z

Adds a slurm-job.vk.io/job-workdir annotation to override the plugin-configured DataRootFolder as the base path for job execution. Intended for use with --gid so jobs can write to a group-accessible directory. All subsequent operations (status, logs, delete) resolve the correct path via the stored working directory, with robust fallback recovery when the in-memory cache is wiped.

Design

Metadata directory (DataRootFolder/<ns>-<uid>): always created; stores JobID.jid, PodUID.uid, PodNamespace.ns, timing files, and a WorkDir.path pointer when a custom dir is set — enabling restart recovery via LoadJIDs.
Working directory (<annotation>/<ns>-<uid>): receives job scripts, SLURM output, container logs, env files, and mount data.

Changes

prepare.go: JidStruct.WorkDir field; getJobWorkDir() helper (validates absolute path, rejects .. traversal); updated handleJidAndPodUid() (accepts filesPath/workDir, writes WorkDir.path); LoadJIDs() reads WorkDir.path on restart; createEnvFile()/prepareEnvs() accept explicit workDir.
Create.go: computes workDir from annotation, creates the directory if separate from filesPath, threads it through prepareMounts, prepareEnvs, produceSLURMScript, and handleJidAndPodUid; cleans up both dirs on error.
Status.go: resolves path from JidStruct.WorkDir when set; falls back to recomputing from pod annotations when the in-memory JIDs cache is empty.
GetLogs.go: resolves path from JidStruct.WorkDir when set; falls back to reading WorkDir.path from the metadata directory on disk when the cache is empty.
Delete.go: resolves workDir from pod annotations when cache is wiped; introduces slurm-job.vk.io/no-clean-workdir annotation to preserve the job output directory on pod deletion; fixes removal logic (no double-delete — deleteContainer handles workDir removal, StopHandler only removes the separate filesPath metadata dir when needed).
prepare_test.go: TestGetJobWorkDir covering default path, annotation override, trailing slash, empty value, relative path rejection, and traversal rejection.
README.md: documents both new annotations.

Usage

metadata:
  annotations:
    slurm-job.vk.io/job-workdir: /scratch/mygroup
    slurm-job.vk.io/flags: "--gid=mygroup"
    # Optional: keep job output after pod deletion
    slurm-job.vk.io/no-clean-workdir: "true"

Job files land in /scratch/mygroup/<namespace>-<podUID>/; plugin metadata stays in DataRootFolder/<namespace>-<podUID>/. When no-clean-workdir is "true", only the metadata directory is removed on deletion — the job output directory is preserved.

Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/12e0a146-7ebc-4bc8-9e42-87b0eef0e7f5 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>

… naming Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/12e0a146-7ebc-4bc8-9e42-87b0eef0e7f5 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>

dciangot

@copilot if the memory cache of the job is wiped, can the status call recover the workdir from annotations?

Also check the workdir removal logic, and introduce an annotation to disable the cleanup of the directory

…clean-workdir annotation Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/4b5c9be5-429f-4d38-9041-25ede0beb360 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>

Copilot · 2026-04-05T17:23:49Z

@copilot if the memory cache of the job is wiped, can the status call recover the workdir from annotations?

Also check the workdir removal logic, and introduce an annotation to disable the cleanup o...

Addressed in commit e31201d:

Memory cache recovery: All three handlers now recover the workdir even when the JIDs cache is wiped:

Status.go: falls back to getJobWorkDir() computed from the pod's annotations (pod is available in the request body)
GetLogs.go: reads WorkDir.path from the metadata directory on disk (the metadata dir is always at the predictable DataRootFolder/<ns>-<uid> path, so it's findable even without the cache)
Delete.go: falls back to getJobWorkDir() from pod annotations (pod is in the request body)

No-clean-workdir annotation: Added slurm-job.vk.io/no-clean-workdir: "true". When set (and a custom job-workdir is in use), deleteContainer is called with filesPath instead of workDir, so only the plugin metadata directory is removed — the job output directory is preserved.

Removal logic fix: Removed the double-delete of workDir that was previously happening (deleteContainer already removes it; the extra os.RemoveAll(workDir) in StopHandler was redundant). Now the logic is clean: deleteContainer(containerDeletePath) → then remove filesPath separately if it differs from workDir.

dciangot

Lgtm

dciangot · 2026-04-05T17:30:53Z

@Bianco95 comsider for 0.6.2

Copilot

Pull request overview

Adds support for configuring a per-job working directory via the slurm-job.vk.io/job-workdir pod annotation, while keeping plugin metadata under DataRootFolder for restart recovery and consistent lifecycle operations (status/logs/delete).

Changes:

Introduces JidStruct.WorkDir, getJobWorkDir(...) helper, and persistence of the resolved workdir via WorkDir.path for restart recovery.
Threads the resolved workDir through job creation (mounts/envs/script generation) and updates status/log resolution to prefer cached/persisted workdir.
Adds deletion behavior control via slurm-job.vk.io/no-clean-workdir and documents new annotations; includes unit tests for getJobWorkDir.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
README.md	Documents `job-workdir` and `no-clean-workdir` annotations and intended behavior.
pkg/slurm/Create.go	Computes `workDir` from annotation and uses it for script/log/env placement; persists metadata via `handleJidAndPodUid`.
pkg/slurm/Delete.go	Resolves `workDir` for deletion and supports preserving job output dir via `no-clean-workdir`.
pkg/slurm/GetLogs.go	Resolves log base path from in-memory cache or `WorkDir.path` on disk.
pkg/slurm/Status.go	Resolves status base path from annotation fallback and in-memory cache when present.
pkg/slurm/prepare.go	Adds `WorkDir` to JID struct, `getJobWorkDir`, loads/persists `WorkDir.path`, and updates env generation to target `workDir`.
pkg/slurm/prepare_test.go	Adds `TestGetJobWorkDir` coverage for default/override and validation cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T17:37:38Z

 	metadata := data.Pod.ObjectMeta
 	filesPath := h.Config.DataRootFolder + data.Pod.Namespace + "-" + string(data.Pod.UID)
+	workDir := getJobWorkDir(h.Config, metadata.Annotations, data.Pod.Namespace, string(data.Pod.UID))



When a custom workDir is used, the code no longer ensures that the metadata directory (filesPath under DataRootFolder) exists. prepareMounts/produceSLURMScript now create workDir, but handleJidAndPodUid still writes JobID.jid, PodUID.uid, etc. into filesPath, which will fail if filesPath was never MkdirAll'd. Create the metadata directory unconditionally (e.g., os.MkdirAll(filesPath, ...)) before submitting the job so on-disk recovery via LoadJIDs works reliably.

Suggested change

if err = os.MkdirAll(filesPath, 0o755); err != nil {

statusCode = http.StatusInternalServerError

h.handleError(spanCtx, w, statusCode, err)

return

}

Copilot · 2026-04-05T17:37:39Z

+func handleJidAndPodUid(Ctx context.Context, pod v1.Pod, JIDs *map[string]*JidStruct, output string, filesPath string, workDir string) (string, error) {
 	r := regexp.MustCompile(`Submitted batch job (?P<jid>\d+)`)
 	jid := r.FindStringSubmatch(output)
-	fJID, err := os.Create(path + "/JobID.jid")
+	fJID, err := os.Create(filesPath + "/JobID.jid")
 	if err != nil {
 		log.G(Ctx).Error("Can't create jid_file")
 		return "", err
 	}


handleJidAndPodUid assumes filesPath already exists, but with the new separation between filesPath (metadata) and workDir (job output) that may not be true. Consider creating filesPath inside this function (or returning a clearer error) before os.Create(filesPath+"/JobID.jid"), otherwise custom job-workdir submissions can fail and prevent WorkDir.path from being persisted.

Copilot · 2026-04-05T17:37:39Z

+		if strings.Contains(customBase, "..") {
+			return config.DataRootFolder + namespace + "-" + podUID
+		}
+		clean := filepath.Clean(customBase)
+		if filepath.IsAbs(clean) {
+			return clean + "/" + namespace + "-" + podUID


getJobWorkDir rejects any annotation value containing the substring "..". This is stricter than “reject traversal components” and will incorrectly reject valid directory names that simply contain ".." (e.g. "/scratch/v1..backup"). To match the comment/intent, validate path elements (split on the path separator and reject elements equal to ".."), and consider using filepath.Join(clean, namespace+"-"+podUID) instead of manual string concatenation.

Suggested change

if strings.Contains(customBase, "..") {

return config.DataRootFolder + namespace + "-" + podUID

}

clean := filepath.Clean(customBase)

if filepath.IsAbs(clean) {

return clean + "/" + namespace + "-" + podUID

for _, elem := range strings.Split(customBase, string(filepath.Separator)) {

if elem == ".." {

return config.DataRootFolder + namespace + "-" + podUID

}

}

clean := filepath.Clean(customBase)

if filepath.IsAbs(clean) {

return filepath.Join(clean, namespace+"-"+podUID)

Copilot · 2026-04-05T17:37:39Z

 			JIDEntry := JidStruct{PodUID: string(podUID), PodNamespace: string(podNamespace), JID: string(JID), StartTime: StartedAt, EndTime: FinishedAt}
+			workDirBytes, err := os.ReadFile(path + entry.Name() + "/" + "WorkDir.path")
+			if err == nil && len(workDirBytes) > 0 {
+				JIDEntry.WorkDir = string(workDirBytes)


When reading WorkDir.path, the raw file contents are used directly as a filesystem path. It would be safer to strings.TrimSpace (or at least trim trailing newlines) before assigning to JIDEntry.WorkDir, to avoid hard-to-debug failures if the file ever contains a newline (manual edits, future format changes, etc.).

Suggested change

JIDEntry.WorkDir = string(workDirBytes)

workDir := strings.TrimSpace(string(workDirBytes))

if workDir != "" {

JIDEntry.WorkDir = workDir

}

Copilot · 2026-04-05T17:37:39Z

+	if jid, ok := (*h.JIDs)[req.PodUID]; ok && jid.WorkDir != "" {
+		path = jid.WorkDir
+	} else if workDirBytes, rdErr := os.ReadFile(metadataPath + "/WorkDir.path"); rdErr == nil && len(workDirBytes) > 0 {
+		path = string(workDirBytes)


WorkDir.path is read from disk and used verbatim as path. Consider strings.TrimSpace before assigning so any trailing newline/whitespace in the pointer file doesn’t break log path resolution.

Suggested change

path = string(workDirBytes)

if workDirPath := strings.TrimSpace(string(workDirBytes)); workDirPath != "" {

path = workDirPath

}

Copilot · 2026-04-05T17:37:40Z

+	// Pass the appropriate path to deleteContainer so it can cancel the SLURM job
+	// and clean up the job directory.  When no-clean-workdir is set, only the
+	// metadata directory (filesPath) is removed; the job workdir is kept.
+	containerDeletePath := workDir
+	if noCleanWorkDir {
+		containerDeletePath = filesPath
+	}
+
+	err = deleteContainer(spanCtx, h.Config, string(pod.UID), h.JIDs, containerDeletePath)



StopHandler now explicitly supports the case where the in-memory JIDs cache was wiped, but it still unconditionally calls deleteContainer(...). deleteContainer dereferences (*JIDs)[podUID].JID even when the map entry is missing, which will panic when the cache is empty. Guard this call (e.g., check map membership and skip scancel/JID attributes when absent) or make deleteContainer robust to a missing JID entry so deletion remains safe after restarts.

Copilot · 2026-04-05T17:37:40Z

+		err = deleteContainer(spanCtx, h.Config, string(data.Pod.UID), h.JIDs, workDir)
 		if err != nil {
 			log.G(h.Ctx).Error(err)


On handleJidAndPodUid error, the code calls deleteContainer(..., workDir). If handleJidAndPodUid failed before registering the JID in h.JIDs (e.g., because filesPath didn't exist), deleteContainer will panic when it dereferences (*JIDs)[podUID].JID. Consider checking that a JID entry exists before calling deleteContainer, or update deleteContainer to handle missing entries safely.

Suggested change

err = deleteContainer(spanCtx, h.Config, string(data.Pod.UID), h.JIDs, workDir)

if err != nil {

log.G(h.Ctx).Error(err)

podUID := string(data.Pod.UID)

if h.JIDs != nil {

if _, ok := (*h.JIDs)[podUID]; ok {

err = deleteContainer(spanCtx, h.Config, podUID, h.JIDs, workDir)

if err != nil {

log.G(h.Ctx).Error(err)

}

} else {

log.G(h.Ctx).Info("Skipping deleteContainer cleanup because no JID was registered for pod UID " + podUID)

}

Copilot · 2026-04-05T17:37:40Z

 			JIDEntry := JidStruct{PodUID: string(podUID), PodNamespace: string(podNamespace), JID: string(JID), StartTime: StartedAt, EndTime: FinishedAt}
+			workDirBytes, err := os.ReadFile(path + entry.Name() + "/" + "WorkDir.path")
+			if err == nil && len(workDirBytes) > 0 {
+				JIDEntry.WorkDir = string(workDirBytes)
+			}
 			(*h.JIDs)[string(podUID)] = &JIDEntry


LoadJIDs trusts the contents of WorkDir.path and stores it in-memory without validation. Since that value later influences filesystem operations (e.g., log reads and potentially os.RemoveAll when deleting jobs), it should be validated/sanitized on load (absolute path, no .. elements, and ideally ensure it matches the expected <base>/<namespace>-<uid> shape) before using it.

Initial plan

309e5ea

Copilot AI assigned Copilot and dciangot Apr 5, 2026

Copilot started work on behalf of dciangot April 5, 2026 16:53 View session

Copilot AI linked an issue Apr 5, 2026 that may be closed by this pull request

pass job working dir in annotation #136

Open

Copilot AI and others added 2 commits April 5, 2026 17:03

feat: pass job working dir via annotation slurm-job.vk.io/job-workdir

81409d5

Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/12e0a146-7ebc-4bc8-9e42-87b0eef0e7f5 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>

fix: address code review - path validation, error handling, parameter…

c2de872

… naming Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/12e0a146-7ebc-4bc8-9e42-87b0eef0e7f5 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add job working directory in annotation for --gid option~~ feat: support custom job working directory via annotation Apr 5, 2026

Copilot AI requested a review from dciangot April 5, 2026 17:10

Copilot finished work on behalf of dciangot April 5, 2026 17:10

dciangot reviewed Apr 5, 2026

View reviewed changes

Copilot started work on behalf of dciangot April 5, 2026 17:18 View session

feat: recover workdir from annotations/disk when cache wiped, add no-…

e31201d

…clean-workdir annotation Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/4b5c9be5-429f-4d38-9041-25ede0beb360 Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>

Copilot AI requested a review from dciangot April 5, 2026 17:24

Copilot finished work on behalf of dciangot April 5, 2026 17:24

dciangot approved these changes Apr 5, 2026

View reviewed changes

dciangot marked this pull request as ready for review April 5, 2026 17:31

dciangot requested a review from Copilot April 5, 2026 17:32

Copilot started reviewing on behalf of dciangot April 5, 2026 17:32 View session

Copilot AI reviewed Apr 5, 2026

View reviewed changes

+	if err = os.MkdirAll(filesPath, 0o755); err != nil {
+		statusCode = http.StatusInternalServerError
+		h.handleError(spanCtx, w, statusCode, err)
+		return
+	}

-				JIDEntry.WorkDir = string(workDirBytes)
+				workDir := strings.TrimSpace(string(workDirBytes))
+				if workDir != "" {
+					JIDEntry.WorkDir = workDir
+				}

-		path = string(workDirBytes)
+		if workDirPath := strings.TrimSpace(string(workDirBytes)); workDirPath != "" {
+			path = workDirPath
+		}

-		err = deleteContainer(spanCtx, h.Config, string(data.Pod.UID), h.JIDs, workDir)
-		if err != nil {
-			log.G(h.Ctx).Error(err)
+		podUID := string(data.Pod.UID)
+		if h.JIDs != nil {
+			if _, ok := (*h.JIDs)[podUID]; ok {
+				err = deleteContainer(spanCtx, h.Config, podUID, h.JIDs, workDir)
+				if err != nil {
+					log.G(h.Ctx).Error(err)
+				}
+			} else {
+				log.G(h.Ctx).Info("Skipping deleteContainer cleanup because no JID was registered for pod UID " + podUID)
+			}

Conversation

Copilot AI commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

Changes

Usage

Uh oh!

dciangot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Apr 5, 2026

Uh oh!

dciangot left a comment

Choose a reason for hiding this comment

Uh oh!

dciangot commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 5, 2026 •

edited

Loading