Skip to content

interlink-hq/interlink-htcondor-plugin

InterLink HTCondor Sidecar Plugin

InterLink Compatible License

This repository contains an InterLink HTCondor sidecar plugin — a container manager that interfaces with InterLink instances to deploy Kubernetes pod containers on HTCondor batch systems using Singularity/Apptainer.

Features

  • Full Kubernetes Pod Support: Handles containers, volumes, secrets, configMaps, and resource requests
  • Multi-Container Pods: All containers in a pod run concurrently via the runCtn/waitCtns/endScript pattern; each container runs in the background and the job exits with the highest exit code
  • Kubernetes Probe Support: Translates livenessProbe, readinessProbe, and startupProbe specs to bash scripts; supports httpGet (curl-based) and exec (Singularity exec) probe types
  • Dual Execution Modes: Singularity containers and host-based script execution
  • InterLink API v0.6.1+ Compatible: Correct HTTP status codes, multi-pod /status responses, and initContainers field matching the PodStatus type
  • HTCondor Health Check: /system-info endpoint reports cluster connectivity via condor_status -totals
  • Comprehensive Logging: Aggregated stdout, stderr, and HTCondor job logs
  • Real-time Status: Live job status with actual timestamps from HTCondor
  • Robust Error Handling: Detailed error responses and validation

Quick Start

Prerequisites

Before you begin, make sure you have the following installed and configured:

Requirement Notes
Python 3.6+ python3 --version
pip pip3 --version
HTCondor command-line tools condor_submit, condor_q, condor_rm must be on your $PATH
Singularity or Apptainer Required to pull and run container images
Grid proxy certificate Only for GSI authentication — run voms-proxy-init or grid-proxy-init
curl For the quick test below

Tip for beginners: If you only want to verify that the plugin server starts and its REST API is reachable — without a real HTCondor cluster — you can still follow every step here. The /status health-check and the request-validation logic work without an active scheduler.


1 — Clone and install

git clone https://github.com/interlink-hq/interlink-htcondor-plugin.git
cd interlink-htcondor-plugin

# Install Python dependencies
pip3 install flask pyyaml

2 — Configure the plugin

Open SidecarConfig.yaml and adjust the values to your environment:

CommandPrefix: ""            # Optional shell prefix prepended to every job command
                             # e.g. "source /cvmfs/cms.cern.ch/cmsset_default.sh;"
ExportPodData: true          # Mount ConfigMaps and Secrets into the Singularity job
DataRootFolder: ".interlink/" # Directory used to store job scripts and tracking files

Create the directories expected by HTCondor and the plugin:

mkdir -p .interlink out err log

3 — Start the plugin server

The plugin exposes a small Flask HTTP server. The simplest way to start it — no authentication required — is:

python3 handles.py \
  --schedd-host    scheduler.example.com \
  --collector-host collector.example.com \
  --port           4000

No authentication needed for local / development setups. If your HTCondor pool uses CLAIMTOBE or no authentication, no extra flags are required.

All available flags:

Flag Description Default
--port TCP port the Flask server listens on 8000
--schedd-host Hostname of the HTCondor scheduler (empty)
--schedd-name HTCondor schedd name (empty)
--collector-host Hostname of the HTCondor collector (empty)
--condor-config Path to a custom condor_config file (empty)
--auth-method HTCondor authentication method (e.g. GSI, SCITOKENS) (empty)
--proxy Path to the X.509 proxy / SciToken file (empty)
--cadir Directory with trusted CA certificates (empty)
--certfile Path to the SSL client certificate (empty)
--keyfile Path to the SSL private key (empty)
--debug HTCondor tool debug level (e.g. D_FULLDEBUG) (empty)
--dummy-job Submit a placeholder sleep job instead of the real workload (false)

Once running you should see output similar to:

 * Running on http://0.0.0.0:4000

The server exposes five REST endpoints:

Method Path Description
POST /create Submit a new pod as an HTCondor job
POST /delete Cancel and remove a running job
GET /status Query the current status of one or more pods
GET /getLogs Retrieve job stdout / stderr
GET /system-info Report HTCondor cluster connectivity

4 — Quick plugin test with curl

The examples below target localhost:4000. Adjust the port if you used a different --port value.

4a — Health check (ping)

Send an empty status array to confirm the server is alive and the proxy file is accessible:

curl -s -X GET http://localhost:4000/status \
  -H "Content-Type: application/json" \
  -d '[]' | python3 -m json.tool

Expected response (healthy):

{
    "message": "HTCondor sidecar is alive",
    "status": "healthy"
}

4b — Submit a test pod (POST /create)

The request body follows the InterLink API v0.6.1 RetrievedPodData format: a pod object (standard Kubernetes pod spec) and a container array with resolved volumes.

curl -s -X POST http://localhost:4000/create \
  -H "Content-Type: application/json" \
  -d '{
    "pod": {
      "metadata": {
        "name":      "test-pod",
        "namespace": "default",
        "uid":       "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
      },
      "spec": {
        "containers": [
          {
            "name":    "busybox",
            "image":   "docker://busybox:latest",
            "command": ["sleep", "30"],
            "resources": {
              "requests": {
                "cpu":    "1",
                "memory": "100Mi"
              }
            }
          }
        ]
      }
    },
    "container": []
  }' | python3 -m json.tool

On success (HTTP 200) the plugin returns the HTCondor cluster job ID:

{
    "PodUID": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
    "PodJID": "12345"
}

Note: The PodJID value (12345 in the example) is the HTCondor cluster ID. You can verify the job with condor_q 12345.

4c — Check pod status (GET /status)

Use the same pod metadata to query the current state:

curl -s -X GET http://localhost:4000/status \
  -H "Content-Type: application/json" \
  -d '[{
    "metadata": {
      "name":      "test-pod",
      "namespace": "default",
      "uid":       "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
    },
    "spec": {
      "containers": [
        {"name": "busybox", "image": "docker://busybox:latest"}
      ]
    }
  }]' | python3 -m json.tool

Example response while the job is running:

[
    {
        "name": "test-pod",
        "UID":  "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
        "namespace": "default",
        "JID": "12345",
        "containers": [
            {
                "name":  "busybox",
                "state": {"running": {"startedAt": "2024-01-15T10:30:00Z"}},
                "ready": true,
                "restartCount": 0,
                "image": "docker://busybox:latest",
                "imageID": "docker://busybox:latest"
            }
        ],
        "initContainers": []
    }
]

HTCondor job status codes map to Kubernetes container states as follows:

HTCondor status Code Kubernetes state
Idle (queued) 1 waitingContainerCreating
Running 2 running
Completed 4 terminatedCompleted
Removed 3 terminatedCancelled
Held 5 waitingJobHeld

4d — HTCondor connectivity check (GET /system-info)

Returns the cluster connectivity status using condor_status -totals:

curl -s http://localhost:4000/system-info | python3 -m json.tool

Example response when the collector is reachable:

{
    "status": "ok",
    "timestamp": "2024-01-15T10:30:00Z",
    "htcondor_connected": true,
    "condor_status_output": "Total Owner Claimed Unclaimed Matched Preempting Drain\n  10     0       8         2       0          0     0\n"
}

If the collector is unreachable, htcondor_connected is false and status is "warning".

4e — Delete a pod (POST /delete)

curl -s -X POST http://localhost:4000/delete \
  -H "Content-Type: application/json" \
  -d '{
    "metadata": {
      "name":      "test-pod",
      "namespace": "default",
      "uid":       "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
    }
  }' | python3 -m json.tool

Expected response (HTTP 200):

{
    "message": "Pod successfully deleted",
    "podUID":  "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
    "podName": "test-pod"
}

5 — Test with Kubernetes (requires InterLink + Virtual Kubelet)

To set up a full InterLink deployment with a Virtual Kubelet node, follow the official guide: 👉 InterLink in-cluster setup

Once your Virtual Kubelet node is registered and ready, you can test end-to-end by applying the manifests in the tests/ directory:

# Apply supporting resources first
kubectl apply -f ./tests/test_configmap.yaml
kubectl apply -f ./tests/test_secret.yaml

# Submit the test pod (mounts the ConfigMap and Secret via volume binds)
kubectl apply -f ./tests/busyecho_k8s.yaml

# Watch the pod status
kubectl get pod test-pod -w

The test pod runs sleep 10 inside a busybox Singularity container and mounts both the ConfigMap and Secret as volumes.

Multi-container pods

When a pod spec contains more than one container, each container runs concurrently in the background via the runCtn/waitCtns/endScript helper pattern (matching the SLURM plugin). The HTCondor job exits with the highest exit code of all containers. Per-container stdout/stderr is saved to ${workingPath}/run-<container-name>.out.

Kubernetes probes

livenessProbe, readinessProbe, and startupProbe specs are translated to bash scripts and run in background sub-shells alongside the container. Supported probe types:

Probe type Implementation
httpGet curl -sf http://localhost:<port><path>
exec singularity exec <image> <command>
tcpSocket Not supported (warning emitted, probe skipped)

Example pod spec with probes:

containers:
  - name: my-app
    image: docker://myimage:latest
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 5
    readinessProbe:
      exec:
        command: ["cat", "/tmp/ready"]
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
        initialDelaySeconds: 5

Singularity options for exec probes can be set via the slurm-job.vk.io/singularity-options pod annotation.

Host-based script execution

A special execution mode is triggered when the container image name starts with the literal string host. In that case the plugin extracts the script from the container arguments and submits it directly — bypassing Singularity entirely. This is useful for site-specific scripts that must run in the host environment:

kubectl apply -f ./tests/production_deployment_LNL.yaml

6 — Running with Docker

A pre-built image based on htcondor/mini:23.0.25-el9 is available via the Dockerfile:

# Build the image
docker build -t interlink-htcondor-plugin -f docker/Dockerfile .

# Run the container (adjust environment variables for your site)
docker run --rm -p 4000:8000 \
  -v /etc/grid-security:/etc/grid-security:ro \
  -v /tmp:/tmp \
  interlink-htcondor-plugin

Inside the container the plugin starts automatically (python3 handles.py) alongside the HTCondor mini-schedd (/start.sh).


Troubleshooting

Symptom Likely cause Fix
condor_submit: command not found HTCondor CLI not on $PATH Install HTCondor or add its bin/ to $PATH
HTTP 503 on health check Proxy file missing or expired For GSI: run voms-proxy-init; for SciTokens: refresh the JWT. Pass the path via --proxy
HTTP 500 on /create Job submission to HTCondor failed Check err/ and log/ directories for HTCondor error output
FileNotFoundError: .interlink/… Required directories missing Run mkdir -p .interlink out err log
Pod stuck in ContainerCreating HTCondor job is queued (Idle) Run condor_q to inspect the queue; check for hold reasons
Flask import error Python dependencies not installed Run pip3 install flask pyyaml
/system-info returns htcondor_connected: false Collector unreachable Check HTCondor collector host/port settings; verify condor_status works from the shell
Probe type tcpSocket not working Not supported Use httpGet or exec probe types; tcpSocket probes are skipped with a warning

Appendix: Authentication

By default the plugin connects to HTCondor without any special authentication (suitable for local or development clusters). For production deployments that require authentication, pass --auth-method and the relevant credential flags.

GSI (X.509 proxy certificates)

python3 handles.py \
  --schedd-host    scheduler.example.com \
  --collector-host collector.example.com \
  --auth-method    GSI \
  --proxy          /tmp/x509up_u$(id -u) \
  --port           4000

Certificates must be in /etc/grid-security/certificates and a valid proxy must exist at the path provided to --proxy. Generate a proxy with:

voms-proxy-init --voms <your-vo>
# or
grid-proxy-init

SciTokens

python3 handles.py \
  --schedd-host    scheduler.example.com \
  --collector-host collector.example.com \
  --auth-method    SCITOKENS \
  --proxy          /path/to/scitoken.jwt \
  --port           4000

Pass the SciToken JWT file via --proxy.

SSL (mutual TLS)

python3 handles.py \
  --schedd-host    scheduler.example.com \
  --collector-host collector.example.com \
  --auth-method    SSL \
  --certfile       /path/to/client.crt \
  --keyfile        /path/to/client.key \
  --cadir          /etc/grid-security/certificates \
  --port           4000

Repository management

All changes should go through Pull Requests.

Merge management

  • Only squash should be enforced in the repository settings.
  • Update commit message for the squashed commits as needed.

Protection on main branch

To be configured on the repository settings.

  • Require pull request reviews before merging
    • Dismiss stale pull request approvals when new commits are pushed
    • Require review from Code Owners
  • Require status checks to pass before merging
    • GitHub actions if available
    • Other checks as available and relevant
    • Require branches to be up to date before merging
  • Include administrators

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors