This repository contains an InterLink HTCondor sidecar plugin — a container manager that interfaces with InterLink instances to deploy Kubernetes pod containers on HTCondor batch systems using Singularity/Apptainer.
- Full Kubernetes Pod Support: Handles containers, volumes, secrets, configMaps, and resource requests
- Multi-Container Pods: All containers in a pod run concurrently via the
runCtn/waitCtns/endScriptpattern; each container runs in the background and the job exits with the highest exit code - Kubernetes Probe Support: Translates
livenessProbe,readinessProbe, andstartupProbespecs to bash scripts; supportshttpGet(curl-based) andexec(Singularity exec) probe types - Dual Execution Modes: Singularity containers and host-based script execution
- InterLink API v0.6.1+ Compatible: Correct HTTP status codes, multi-pod
/statusresponses, andinitContainersfield matching thePodStatustype - HTCondor Health Check:
/system-infoendpoint reports cluster connectivity viacondor_status -totals - Comprehensive Logging: Aggregated stdout, stderr, and HTCondor job logs
- Real-time Status: Live job status with actual timestamps from HTCondor
- Robust Error Handling: Detailed error responses and validation
Before you begin, make sure you have the following installed and configured:
| Requirement | Notes |
|---|---|
| Python 3.6+ | python3 --version |
pip |
pip3 --version |
| HTCondor command-line tools | condor_submit, condor_q, condor_rm must be on your $PATH |
| Singularity or Apptainer | Required to pull and run container images |
| Grid proxy certificate | Only for GSI authentication — run voms-proxy-init or grid-proxy-init |
curl |
For the quick test below |
Tip for beginners: If you only want to verify that the plugin server starts and its REST API is reachable — without a real HTCondor cluster — you can still follow every step here. The
/statushealth-check and the request-validation logic work without an active scheduler.
git clone https://github.com/interlink-hq/interlink-htcondor-plugin.git
cd interlink-htcondor-plugin
# Install Python dependencies
pip3 install flask pyyamlOpen SidecarConfig.yaml and adjust the values to your environment:
CommandPrefix: "" # Optional shell prefix prepended to every job command
# e.g. "source /cvmfs/cms.cern.ch/cmsset_default.sh;"
ExportPodData: true # Mount ConfigMaps and Secrets into the Singularity job
DataRootFolder: ".interlink/" # Directory used to store job scripts and tracking filesCreate the directories expected by HTCondor and the plugin:
mkdir -p .interlink out err logThe plugin exposes a small Flask HTTP server. The simplest way to start it — no authentication required — is:
python3 handles.py \
--schedd-host scheduler.example.com \
--collector-host collector.example.com \
--port 4000No authentication needed for local / development setups. If your HTCondor pool uses
CLAIMTOBEor no authentication, no extra flags are required.
All available flags:
| Flag | Description | Default |
|---|---|---|
--port |
TCP port the Flask server listens on | 8000 |
--schedd-host |
Hostname of the HTCondor scheduler | (empty) |
--schedd-name |
HTCondor schedd name | (empty) |
--collector-host |
Hostname of the HTCondor collector | (empty) |
--condor-config |
Path to a custom condor_config file |
(empty) |
--auth-method |
HTCondor authentication method (e.g. GSI, SCITOKENS) |
(empty) |
--proxy |
Path to the X.509 proxy / SciToken file | (empty) |
--cadir |
Directory with trusted CA certificates | (empty) |
--certfile |
Path to the SSL client certificate | (empty) |
--keyfile |
Path to the SSL private key | (empty) |
--debug |
HTCondor tool debug level (e.g. D_FULLDEBUG) |
(empty) |
--dummy-job |
Submit a placeholder sleep job instead of the real workload | (false) |
Once running you should see output similar to:
* Running on http://0.0.0.0:4000
The server exposes five REST endpoints:
| Method | Path | Description |
|---|---|---|
POST |
/create |
Submit a new pod as an HTCondor job |
POST |
/delete |
Cancel and remove a running job |
GET |
/status |
Query the current status of one or more pods |
GET |
/getLogs |
Retrieve job stdout / stderr |
GET |
/system-info |
Report HTCondor cluster connectivity |
The examples below target localhost:4000. Adjust the port if you used a different --port value.
Send an empty status array to confirm the server is alive and the proxy file is accessible:
curl -s -X GET http://localhost:4000/status \
-H "Content-Type: application/json" \
-d '[]' | python3 -m json.toolExpected response (healthy):
{
"message": "HTCondor sidecar is alive",
"status": "healthy"
}The request body follows the InterLink API v0.6.1 RetrievedPodData format: a pod object
(standard Kubernetes pod spec) and a container array with resolved volumes.
curl -s -X POST http://localhost:4000/create \
-H "Content-Type: application/json" \
-d '{
"pod": {
"metadata": {
"name": "test-pod",
"namespace": "default",
"uid": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
},
"spec": {
"containers": [
{
"name": "busybox",
"image": "docker://busybox:latest",
"command": ["sleep", "30"],
"resources": {
"requests": {
"cpu": "1",
"memory": "100Mi"
}
}
}
]
}
},
"container": []
}' | python3 -m json.toolOn success (HTTP 200) the plugin returns the HTCondor cluster job ID:
{
"PodUID": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"PodJID": "12345"
}Note: The
PodJIDvalue (12345in the example) is the HTCondor cluster ID. You can verify the job withcondor_q 12345.
Use the same pod metadata to query the current state:
curl -s -X GET http://localhost:4000/status \
-H "Content-Type: application/json" \
-d '[{
"metadata": {
"name": "test-pod",
"namespace": "default",
"uid": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
},
"spec": {
"containers": [
{"name": "busybox", "image": "docker://busybox:latest"}
]
}
}]' | python3 -m json.toolExample response while the job is running:
[
{
"name": "test-pod",
"UID": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"namespace": "default",
"JID": "12345",
"containers": [
{
"name": "busybox",
"state": {"running": {"startedAt": "2024-01-15T10:30:00Z"}},
"ready": true,
"restartCount": 0,
"image": "docker://busybox:latest",
"imageID": "docker://busybox:latest"
}
],
"initContainers": []
}
]HTCondor job status codes map to Kubernetes container states as follows:
| HTCondor status | Code | Kubernetes state |
|---|---|---|
| Idle (queued) | 1 | waiting — ContainerCreating |
| Running | 2 | running |
| Completed | 4 | terminated — Completed |
| Removed | 3 | terminated — Cancelled |
| Held | 5 | waiting — JobHeld |
Returns the cluster connectivity status using condor_status -totals:
curl -s http://localhost:4000/system-info | python3 -m json.toolExample response when the collector is reachable:
{
"status": "ok",
"timestamp": "2024-01-15T10:30:00Z",
"htcondor_connected": true,
"condor_status_output": "Total Owner Claimed Unclaimed Matched Preempting Drain\n 10 0 8 2 0 0 0\n"
}If the collector is unreachable, htcondor_connected is false and status is "warning".
curl -s -X POST http://localhost:4000/delete \
-H "Content-Type: application/json" \
-d '{
"metadata": {
"name": "test-pod",
"namespace": "default",
"uid": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
}
}' | python3 -m json.toolExpected response (HTTP 200):
{
"message": "Pod successfully deleted",
"podUID": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"podName": "test-pod"
}To set up a full InterLink deployment with a Virtual Kubelet node, follow the official guide: 👉 InterLink in-cluster setup
Once your Virtual Kubelet node is registered and ready, you can test end-to-end by applying the
manifests in the tests/ directory:
# Apply supporting resources first
kubectl apply -f ./tests/test_configmap.yaml
kubectl apply -f ./tests/test_secret.yaml
# Submit the test pod (mounts the ConfigMap and Secret via volume binds)
kubectl apply -f ./tests/busyecho_k8s.yaml
# Watch the pod status
kubectl get pod test-pod -wThe test pod runs sleep 10 inside a busybox Singularity container and mounts both the ConfigMap and Secret as volumes.
When a pod spec contains more than one container, each container runs concurrently in the background
via the runCtn/waitCtns/endScript helper pattern (matching the SLURM plugin). The HTCondor job
exits with the highest exit code of all containers. Per-container stdout/stderr is saved to
${workingPath}/run-<container-name>.out.
livenessProbe, readinessProbe, and startupProbe specs are translated to bash scripts and run
in background sub-shells alongside the container. Supported probe types:
| Probe type | Implementation |
|---|---|
httpGet |
curl -sf http://localhost:<port><path> |
exec |
singularity exec <image> <command> |
tcpSocket |
Not supported (warning emitted, probe skipped) |
Example pod spec with probes:
containers:
- name: my-app
image: docker://myimage:latest
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
exec:
command: ["cat", "/tmp/ready"]
startupProbe:
httpGet:
path: /startup
port: 8080
initialDelaySeconds: 5Singularity options for exec probes can be set via the slurm-job.vk.io/singularity-options
pod annotation.
A special execution mode is triggered when the container image name starts with the literal string
host. In that case the plugin extracts the script from the container arguments and submits it
directly — bypassing Singularity entirely. This is useful for site-specific scripts that must run
in the host environment:
kubectl apply -f ./tests/production_deployment_LNL.yamlA pre-built image based on htcondor/mini:23.0.25-el9 is available via the Dockerfile:
# Build the image
docker build -t interlink-htcondor-plugin -f docker/Dockerfile .
# Run the container (adjust environment variables for your site)
docker run --rm -p 4000:8000 \
-v /etc/grid-security:/etc/grid-security:ro \
-v /tmp:/tmp \
interlink-htcondor-pluginInside the container the plugin starts automatically (python3 handles.py) alongside the HTCondor mini-schedd (/start.sh).
| Symptom | Likely cause | Fix |
|---|---|---|
condor_submit: command not found |
HTCondor CLI not on $PATH |
Install HTCondor or add its bin/ to $PATH |
| HTTP 503 on health check | Proxy file missing or expired | For GSI: run voms-proxy-init; for SciTokens: refresh the JWT. Pass the path via --proxy |
HTTP 500 on /create |
Job submission to HTCondor failed | Check err/ and log/ directories for HTCondor error output |
FileNotFoundError: .interlink/… |
Required directories missing | Run mkdir -p .interlink out err log |
Pod stuck in ContainerCreating |
HTCondor job is queued (Idle) | Run condor_q to inspect the queue; check for hold reasons |
| Flask import error | Python dependencies not installed | Run pip3 install flask pyyaml |
/system-info returns htcondor_connected: false |
Collector unreachable | Check HTCondor collector host/port settings; verify condor_status works from the shell |
Probe type tcpSocket not working |
Not supported | Use httpGet or exec probe types; tcpSocket probes are skipped with a warning |
By default the plugin connects to HTCondor without any special authentication (suitable for local or
development clusters). For production deployments that require authentication, pass --auth-method
and the relevant credential flags.
python3 handles.py \
--schedd-host scheduler.example.com \
--collector-host collector.example.com \
--auth-method GSI \
--proxy /tmp/x509up_u$(id -u) \
--port 4000Certificates must be in /etc/grid-security/certificates and a valid proxy must exist at the path
provided to --proxy. Generate a proxy with:
voms-proxy-init --voms <your-vo>
# or
grid-proxy-initpython3 handles.py \
--schedd-host scheduler.example.com \
--collector-host collector.example.com \
--auth-method SCITOKENS \
--proxy /path/to/scitoken.jwt \
--port 4000Pass the SciToken JWT file via --proxy.
python3 handles.py \
--schedd-host scheduler.example.com \
--collector-host collector.example.com \
--auth-method SSL \
--certfile /path/to/client.crt \
--keyfile /path/to/client.key \
--cadir /etc/grid-security/certificates \
--port 4000All changes should go through Pull Requests.
- Only squash should be enforced in the repository settings.
- Update commit message for the squashed commits as needed.
To be configured on the repository settings.
- Require pull request reviews before merging
- Dismiss stale pull request approvals when new commits are pushed
- Require review from Code Owners
- Require status checks to pass before merging
- GitHub actions if available
- Other checks as available and relevant
- Require branches to be up to date before merging
- Include administrators