Skip to content

WebSocket communication plan #527

@dciangot

Description

@dciangot

Goal

Add WebSocket transport for InterLink server ↔ plugin communication while keeping the existing REST API working.

Phase 0 — Confirm the desired direction and roles

Recommended topology: plugin connects to server.

  • Plugin (apptainer) opens wss://interlink-server/ws
  • Keeps one long-lived connection
  • Server can send requests to plugin (bidirectional control) and plugin can push events back without polling

This typically works better with NAT/firewalls than server-initiated connections.


Phase 1 — Introduce a transport layer abstraction (server + plugin)

1A) Define a common internal “service interface”

On the server, refactor REST handlers so they call a set of internal methods like:

  • CreatePod(...)
  • DeletePod(...)
  • GetPodStatus(...)
  • GetLogs(...) (even if currently implemented as polling)

The key is: HTTP handlers become thin adapters.

1B) Define a transport-neutral message envelope

Start with JSON (easy to debug). Example:

{
  "v": 1,
  "id": "uuid-or-monotonic",
  "kind": "request",
  "op": "POST",
  "path": "/v1/pods",
  "headers": { "authorization": "Bearer …" },
  "body": { "...": "..." },
  "deadline_ms": 30000
}

Response:

{
  "v": 1,
  "id": "same-id",
  "kind": "response",
  "status": 200,
  "body": { "...": "..." },
  "error": null
}

And define a few control messages:

  • hello / welcome (handshake + protocol version)
  • ping / pong (keepalive)
  • register (plugin type, capabilities, plugin instance ID)
  • optionally event for async server→plugin or plugin→server events later

This is the “REST-over-WS” bridge that avoids rewriting all payloads immediately.


Phase 2 — WebSocket endpoint on the InterLink server

2A) Add WS endpoint

Expose a new endpoint (example):

  • GET /ws (HTTP upgrade)

On upgrade:

  1. Authenticate (see below)
  2. Read register message (plugin identifies itself: apptainer, slurm, etc.)
  3. Store the connection in a connection manager keyed by plugin instance ID

2B) Authentication options

Pick one (in order of simplicity):

  1. Bearer token in Authorization header during WS upgrade
  2. Token as query param (works but less clean)
  3. Mutual TLS if you already have that model

For parity with REST, keep the same token validation logic.

2C) Routing incoming WS requests

Implement a dispatcher that:

  • parses the envelope
  • maps op+path to the same internal handler logic you use for REST
  • returns a response message with the same id

This allows you to reuse existing request structs and validation.


Phase 3 — Plugin client (interlink-plugin-apptainer) changes

3A) Implement Transport interface in the plugin

Create an interface like:

  • Do(ctx, method, path, body) -> (status, respBody, err)

Implementations:

  • HTTPTransport (existing)
  • WSTransport (new)

The plugin chooses based on config:

  • INTERLINK_TRANSPORT=ws|http|auto
  • INTERLINK_WS_URL=wss://.../ws

3B) WS client features to implement

Minimum viable:

  • connect + handshake/register
  • request/response correlation via id
  • timeouts
  • reconnect (exponential backoff)
  • on reconnect: re-register

Edge-case policy (pick one):

  • fail in-flight requests on disconnect (simpler)
  • or retry idempotent ones

Phase 4 — Dual-stack rollout and feature flags

4A) Server

  • Keep existing REST endpoints unchanged
  • Add WS in parallel
  • Add metrics: connected plugins, reconnect count, WS request latency

4B) Plugin

  • Default stays HTTP
  • Enable WS only in testing environments first
  • Add verbose logging of WS connect/register/errors

4C) “Auto” mode

auto tries WS first, falls back to HTTP if:

  • WS endpoint not reachable
  • handshake fails
  • server replies “unsupported”

Phase 5 — Streaming enhancements (the real reason to use WS)

Once REST-over-WS is stable, you can add native streaming message types without breaking existing operations.

5A) Log streaming

Instead of repeated REST polling, add:

  • request: StreamLogs { podID, sinceTime, follow: true }
  • server sends multiple logChunk messages (same stream ID)
  • client can cancel stream with CancelStream { streamID }

5B) Pod lifecycle events

Have plugin push events:

  • podStarted, podFinished, podFailed, etc.
    This reduces server polling and improves UI responsiveness.

Phase 6 — Testing plan

  1. Protocol unit tests: encode/decode envelope, unknown fields, version mismatch
  2. Integration tests:
    • start server
    • connect WS client
    • run a set of “golden” calls that are currently REST calls and validate identical outputs
  3. Fault injection:
    • disconnect mid-request
    • server restart
    • auth failure
  4. Load tests: concurrent requests over one connection, or multiple connections

Phase 7 — Long-term cleanup

After a few releases:

  • make WS the default transport for capable plugins
  • keep REST for backward compatibility
  • optionally move from JSON → protobuf for performance and strict schemas

What I need from you to tailor this to InterLink exactly

To make this plan file-level actionable (exact packages, structs, endpoints), paste either:

  1. the list of current plugin-facing REST endpoints + where they’re implemented (paths), or
  2. links to the router/handler files in interlink-hq/interLink and the HTTP client code in interlink-plugin-apptainer.

If you share those, I can return:

  • a mapping table “REST endpoint → WS message op/path → internal method”
  • a minimal set of new Go packages/files to add
  • and a migration sequence that avoids breaking existing plugins.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions