A Kubernetes operator for managing Flower Federated Learning federations.
- Overview
- Architecture
- Features
- Prerequisites
- Installation
- Quick Start
- Deployment Modes
- Isolation Modes
- SuperNode Pools
- GPU Configuration
- Node Configuration
- Configuration Reference
- Examples
- Usage Workflow
- Status and Monitoring
- Troubleshooting
- Uninstallation
- Development
- Contributing
- License
The flower-operator manages Flower Federated Learning federations on Kubernetes. It automates the deployment and lifecycle management of federated learning infrastructure, allowing you to focus on your ML workloads.
A Federation consists of:
- SuperLink: The control plane that coordinates federated learning runs
- SuperNodes: Client nodes that participate in training (organized into pools)
- SuperExec (optional): Processes that execute ServerApps and ClientApps
The operator provisions and manages this infrastructure, exposes a stable SuperLink endpoint, and lets users submit training jobs externally using flwr run.
For more information see the Flower Documentation.
Note: The operator does NOT start training jobs automatically. Training is triggered by users via the Flower CLI
flwr.
┌─────────────────────────────────────────────────────────────────┐
│ Federation CR │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ flower-operator │
│ (Controller Manager) │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌─────────────┐ ┌─────────────────────────┐
│ SuperLink │ │ Service │ │ SuperNode Pools │
│ Deployment │ │ (ClusterIP/ │ │ (DaemonSet or │
│ + SuperExec │ │ NodePort/ │ │ StatefulSet per pool) │
│ sidecar* │ │ LoadBal.) │ │ + SuperExec sidecars* │
└───────────────────┘ └─────────────┘ └─────────────────────────┘
* SuperExec sidecars are only created in "process" isolation mode
The controller reconciles a single Federation CR into:
- One SuperLink Deployment + Service
- One workload (DaemonSet or StatefulSet) per SuperNode pool
Resources are named deterministically:
<federation>-superlinkfor SuperLink components<federation>-supernode-<pool>for each pool's workload
For more information see Flower Architecture.
| Feature | Description |
|---|---|
| Deployment Modes | DaemonSet (one pod per node) or StatefulSet (fixed replicas) |
| Isolation Modes | Process (sidecar containers) or subprocess (internal execution) |
| Multi-Pool Support | Heterogeneous hardware with different images, resources, and scheduling per pool |
| GPU Support | NVIDIA and AMD GPUs with per-pool configuration |
| Resource Management | Fine-grained resource requests/limits for all containers |
| Flexible Scheduling | nodeSelector, affinity, and tolerations via podTemplate |
| Service Exposure | ClusterIP, NodePort, or LoadBalancer for SuperLink |
| Status Reporting | Per-pool readiness, conditions, and SuperLink endpoint |
- Kubernetes v1.25+ (for CEL validation support)
- kubectl v1.11.3+
- Access to a Kubernetes cluster with cluster-admin privileges
- (Optional) GPU device plugins if using GPU features (NVIDIA or AMD)
Install the operator using Helm:
# Install using Helm
helm install flower-operator ./dist/chart/ \
--namespace flower-operator-system \
--create-namespace \
--set manager.image.repository=<registry>/flower-operator \
--set manager.image.tag=<tag>Or with a custom values file:
helm install flower-operator ./dist/chart/ \
--namespace flower-operator-system \
--create-namespace \
-f my-values.yamlKey Helm values:
manager.image.repository: Controller manager image repositorymanager.image.tag: Controller manager image tagmanager.replicas: Number of controller replicas (default: 1)crd.enable: Install CRDs with chart (default: true)crd.keep: Keep CRDs on uninstall (default: true)metrics.enable: Enable metrics endpoint (default: true)
See dist/chart/values.yaml for all available options.
- Create a minimal Federation:
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: my-federation
spec:
mode: StatefulSet
isolationMode: subprocess
superlink:
image: flwr/superlink:1.25.0
service:
type: ClusterIP
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: default
replicas: 2- Apply the Federation:
kubectl apply -f my-federation.yaml- Check the status:
kubectl get federation my-federationOutput:
NAME MODE ISOLATION SUPERNODES READY AGE
my-federation StatefulSet subprocess 2/2 True 1m
- Port-forward the SuperLink service (Port 9093):
kubectl port-forward service/my-federation-superlink 9093:9093- Add the remote federation configuration to your Flower app's
pyproject.toml:
[tool.flwr.federations.remote-federation]
address = "127.0.0.1:9093"
insecure = true- Run your Flower app on the remote federation:
flwr run . remote-federation --streamThe spec.mode field controls how SuperNode pools are deployed.
Creates one SuperNode pod per eligible Kubernetes node. Use this when you want node-local clients that scale automatically with your cluster.
| Characteristic | Description |
|---|---|
| Scaling | Automatic - one pod per matching node |
| Replica count | Not configurable (replicas must be unset) |
| Node targeting | Use nodeSelector, affinity, tolerations in podTemplate |
| Use cases | Edge deployments, node-local data, GPU nodes |
Example:
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: federation-daemonset
spec:
mode: DaemonSet
isolationMode: subprocess
superlink:
image: flwr/superlink:1.25.0
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
service:
type: ClusterIP
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: default
images: {}
supernodeResources:
requests:
memory: "512Mi"
cpu: "200m"
limits:
memory: "1Gi"
cpu: "1000m"
# Target specific nodes:
# podTemplate:
# spec:
# nodeSelector:
# flower.dev/role: clientCreates a fixed number of SuperNode replicas per pool. Use this when you need a predictable client count or stable pod identities.
| Characteristic | Description |
|---|---|
| Scaling | Manual - controlled by replicas field |
| Replica count | Required for each pool (replicas >= 1) |
| Pod identity | Stable ordinal naming (e.g., fed-supernode-pool-0, fed-supernode-pool-1) |
| Use cases | Testing, predictable workloads, stateful clients |
Example:
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: federation-statefulset
spec:
mode: StatefulSet
isolationMode: subprocess
superlink:
image: flwr/superlink:1.25.0
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
service:
type: ClusterIP
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: default
replicas: 2 # Required in StatefulSet mode
images: {}
supernodeResources:
requests:
memory: "512Mi"
cpu: "200m"
limits:
memory: "1Gi"
cpu: "1000m"The spec.isolationMode field controls how Flower executes ServerApps and ClientApps. This is a global setting for the entire Federation. The isolation modes directly map to the isolation modes supported by the Flower framework.
SuperExec runs as sidecar containers alongside SuperLink and SuperNodes. This separates the control plane from application execution.
Architecture:
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│ SuperLink Pod │ │ SuperNode Pod │
│ ┌───────────┐ ┌─────────────┐ │ │ ┌───────────┐ ┌─────────────┐ │
│ │ superlink │ │ superexec- │ │ │ │ supernode │ │ superexec- │ │
│ │ container │ │ serverapp │ │ │ │ container │ │ clientapp │ │
│ │ │ │ sidecar │ │ │ │ │ │ sidecar │ │
│ └───────────┘ └─────────────┘ │ │ └───────────┘ └─────────────┘ │
│ control dependencies │ │ control dependencies │
│ logic + ServerApp │ │ logic + ClientApp │
└─────────────────────────────────┘ └─────────────────────────────────┘
Requirements:
spec.superlink.superexecImagemust be setpool.images.superexecClientAppmust be set for each pool
Benefits:
- Separation of concerns between control and execution
- Independent image management for dependencies
- Smaller base images for SuperLink/SuperNode
Example:
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: federation-process
spec:
mode: StatefulSet
isolationMode: process
superlink:
image: flwr/superlink:1.25.0
superexecImage: flwr/superexec:1.25.0 # Required
service:
type: ClusterIP
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: default
replicas: 2
images:
superexecClientApp: flwr/superexec:1.25.0 # RequiredSuperLink and SuperNode containers spawn SuperExec internally. No sidecar containers are created.
Architecture:
┌─────────────────────┐ ┌─────────────────────┐
│ SuperLink Pod │ │ SuperNode Pod │
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ superlink │ │ │ │ supernode │ │
│ │ container │ │ │ │ container │ │
│ │ (includes │ │ │ │ (includes │ │
│ │ ServerApp │ │ │ │ ClientApp │ │
│ │ deps) │ │ │ │ deps) │ │
│ └───────────────┘ │ │ └───────────────┘ │
└─────────────────────┘ └─────────────────────┘
Requirements:
spec.superlink.superexecImagemust be empty/unsetpool.images.superexecClientAppmust be empty/unset for all pools- Images must include all application dependencies
Benefits:
- Simpler deployment with fewer containers
- Single image to manage per component
- Lower resource overhead
Example:
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: federation-subprocess
spec:
mode: StatefulSet
isolationMode: subprocess
superlink:
image: my-registry/superlink-with-deps:1.25.0 # Includes ServerApp deps
service:
type: ClusterIP
supernodes:
image: my-registry/supernode-with-deps:1.25.0 # Includes ClientApp deps
pools:
- name: default
replicas: 2
images: {} # No superexecClientApp| Aspect | Process | Subprocess |
|---|---|---|
| Containers per pod | 2 (main + sidecar) | 1 |
| Image management | Separate control/exec images | Single image with all deps |
| Resource overhead | Higher (two containers) | Lower |
| Dependency isolation | Better (sidecar handles deps) | All deps in main image |
| Configuration complexity | Higher (more images to specify) | Lower |
Pools are the only way to define SuperNodes. Each pool becomes its own Kubernetes workload (DaemonSet or StatefulSet), enabling heterogeneous configurations within a single Federation.
- Heterogeneous hardware: Different nodes may have different capabilities
- Different images: Each pool can use different container images
- Different resources: Customize CPU/memory/GPU per pool
- Independent scheduling: Target specific nodes per pool
- Mixed environments: CPU, NVIDIA GPU, and AMD GPU pools in one Federation
supernodes:
image: flwr/supernode:1.25.0 # Default image for all pools
pools:
- name: cpu-pool
replicas: 4 # StatefulSet mode only
images: {} # Uses default image
supernodeResources:
requests:
cpu: "1"
memory: "2Gi"
- name: gpu-pool
replicas: 2
images:
supernode: my-registry/supernode-gpu:1.25.0 # Docker image with dependencies for the GPU training
gpu:
enabled: true
vendor: nvidia
resourceName: nvidia.com/gpu
count: 1 # use one nvidia GPU in the SuperNode podSee examples/federation-multipools.yaml for a complete multi-pool configuration:
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: federation-multipools
spec:
mode: DaemonSet
isolationMode: subprocess
superlink:
image: flwr/superlink:1.25.0
service:
type: ClusterIP
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: pool-a
images: {}
# podTemplate:
# spec:
# nodeSelector:
# flower.ai/pool: a
- name: pool-b
images:
supernode: flwr/supernode:1.24.0 # Override with different version
# podTemplate:
# spec:
# nodeSelector:
# flower.ai/pool: bGPU support is opt-in and configured per pool. The operator does not model GPU inventory; it relies on Kubernetes device plugins to provide GPU resources.
| Field | Description | Default |
|---|---|---|
enabled |
Toggle GPU requests for this pool | false |
vendor |
GPU vendor hint: nvidia or amd |
- |
resourceName |
Extended resource name (e.g., nvidia.com/gpu) |
- |
count |
Number of GPUs per pod | 1 |
mountAll |
Expose all GPUs on each node | false |
pools:
- name: nvidia-pool
replicas: 2
images: {}
gpu:
enabled: true
vendor: nvidia
resourceName: nvidia.com/gpu
count: 1pools:
- name: amd-pool
replicas: 2
images: {}
gpu:
enabled: true
vendor: amd
resourceName: amd.com/gpu
count: 1When mountAll: true, all GPUs on each node are exposed to the pod:
- NVIDIA: Sets
runtimeClassNameandNVIDIA_VISIBLE_DEVICES=all - AMD: Mounts
/dev/kfdand/dev/dri, setsROCR_VISIBLE_DEVICES=all
pools:
- name: all-gpus
images: {}
gpu:
enabled: true
vendor: nvidia
mountAll: true # Requires vendor to be setUse separate pools for different GPU vendors:
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: cpu-only
replicas: 4
images: {}
- name: nvidia-gpus
replicas: 2
images:
supernode: my-registry/supernode-nvidia:1.25.0 # Image with NVIDIA dependencies
gpu:
enabled: true
vendor: nvidia
resourceName: nvidia.com/gpu
count: 1
- name: amd-gpus
replicas: 2
images:
supernode: my-registry/supernode-amd:1.25.0 # Image with AMD dependencies
gpu:
enabled: true
vendor: amd
resourceName: amd.com/gpu
count: 1The nodeConfig field allows passing key-value pairs to the --node-config flag on SuperNodes. This is useful for data partitioning, client identification, and other per-replica configuration.
Values can contain placeholders that are expanded at runtime:
| Placeholder | Description | StatefulSet | DaemonSet |
|---|---|---|---|
{index} |
Replica ordinal (0-based) | Yes | No |
{replicas} |
Total replicas in the pool | Yes | No |
{pool} |
Pool name | Yes | Yes |
{node} |
Kubernetes node name | Yes | Yes |
In StatefulSet mode, use {index} and {replicas} to implement data partitioning:
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: partitioned-training
spec:
mode: StatefulSet
isolationMode: subprocess
superlink:
image: flwr/superlink:1.25.0
supernodes:
image: my-app/client:latest
pools:
- name: default
replicas: 4
images: {}
nodeConfig:
# Each replica gets a unique partition (0, 1, 2, 3)
partition-id: "{index}"
# Total partitions matches replica count (4)
num-partitions: "{replicas}"
# Unique client name
name: "client_{pool}_{index}"This generates pods with the following --node-config values (in TOML format):
- Pod 0:
--node-config 'name="client_default_0" num-partitions=4 partition-id=0' - Pod 1:
--node-config 'name="client_default_1" num-partitions=4 partition-id=1' - Pod 2:
--node-config 'name="client_default_2" num-partitions=4 partition-id=2' - Pod 3:
--node-config 'name="client_default_3" num-partitions=4 partition-id=3'
Note: String values are automatically quoted for TOML compatibility. Integer placeholders like {index} and {replicas} remain unquoted since they expand to integers.
In DaemonSet mode, use {node} for unique identification since {index} is not available:
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: edge-federation
spec:
mode: DaemonSet
isolationMode: subprocess
superlink:
image: flwr/superlink:1.25.0
supernodes:
image: my-app/client:latest
pools:
- name: edge
images: {}
nodeConfig:
# Use Kubernetes node name for unique identification
name: "client_{node}"
node-id: "{node}"
pool: "{pool}"You can also use nodeConfig for static values (same for all replicas):
nodeConfig:
dataset: "cifar10"
batch-size: "32"
learning-rate: "0.01"The operator supports fine-grained resource configuration for all containers in a Federation. Resource requirements help Kubernetes schedule pods appropriately and ensure fair resource allocation.
| Field | Component | Isolation Mode | Description |
|---|---|---|---|
spec.superlink.resources |
SuperLink container | All | CPU/memory for SuperLink control plane |
spec.superlink.superexecResources |
SuperExec ServerApp sidecar | process only |
CPU/memory for ServerApp execution |
spec.supernodes.pools[].supernodeResources |
SuperNode container | All | CPU/memory for SuperNode control plane |
spec.supernodes.pools[].superexecResources |
SuperExec ClientApp sidecar | process only |
CPU/memory for ClientApp execution |
Configure resources for the SuperLink container and optionally for the SuperExec ServerApp sidecar (in process mode):
superlink:
image: flwr/superlink:1.25.0
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
# Process mode only: SuperExec ServerApp sidecar resources
superexecResources:
requests:
memory: "128Mi"
cpu: "50m"
limits:
memory: "256Mi"
cpu: "200m"
service:
type: ClusterIPConfigure resources per pool for the SuperNode container and optionally for the SuperExec ClientApp sidecar (in process mode):
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: default
replicas: 3
supernodeResources:
requests:
memory: "512Mi"
cpu: "200m"
limits:
memory: "1Gi"
cpu: "1000m"
# Process mode only: SuperExec ClientApp sidecar resources
superexecResources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"Different pools can have different resource requirements to match their workloads:
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: cpu-pool
replicas: 3
supernodeResources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "1Gi"
- name: memory-pool
replicas: 2
supernodeResources:
requests:
cpu: "200m"
memory: "2Gi"
limits:
cpu: "1000m"
memory: "4Gi"
- name: gpu-pool
replicas: 2
supernodeResources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "4000m"
memory: "8Gi"
gpu:
enabled: true
vendor: nvidia
resourceName: nvidia.com/gpu
count: 1Requests vs Limits:
- Requests: Minimum resources guaranteed to the container. Used by the scheduler to place pods.
- Limits: Maximum resources the container can use. Prevents resource exhaustion.
GPU Workloads:
- When using GPUs, allocate sufficient CPU and memory for GPU operations
- GPU pools typically need more CPU (1000m+) and memory (2Gi+) than CPU-only pools
Resource Planning:
- Start with conservative requests and adjust based on monitoring
- Set limits to prevent runaway processes from consuming cluster resources
- Consider node capacity when setting pool resource requirements
Resource configuration works alongside GPU settings. When GPUs are enabled, ensure sufficient CPU and memory are allocated:
pools:
- name: nvidia-gpu-pool
replicas: 2
supernodeResources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "4000m"
memory: "8Gi"
gpu:
enabled: true
vendor: nvidia
resourceName: nvidia.com/gpu
count: 1| Field | Type | Required | Description |
|---|---|---|---|
mode |
DaemonSet | StatefulSet |
Yes | How SuperNode pools are deployed |
isolationMode |
process | subprocess |
Yes | How ServerApp/ClientApp execution is handled |
version |
string | No | Convenience field for Flower version (for future use) |
insecure |
bool | No | Use insecure connections (default: true) |
superlink |
SuperLinkSpec | Yes | SuperLink configuration |
supernodes |
SuperNodesSpec | Yes | SuperNode pools configuration |
| Field | Type | Required | Description |
|---|---|---|---|
image |
string | Yes | SuperLink container image |
superexecImage |
string | process mode | SuperExec sidecar image for ServerApp |
resources |
ResourceRequirements | No | Resources for SuperLink container |
superexecResources |
ResourceRequirements | No | Resources for SuperExec sidecar |
env |
[]EnvVar | No | Environment variables |
extraArgs |
[]string | No | Extra command-line arguments |
podTemplate |
PodTemplateSpec | No | Pod customization |
service |
SuperLinkServiceSpec | No | Service configuration |
| Field | Type | Default | Description |
|---|---|---|---|
type |
ClusterIP | NodePort | LoadBalancer |
ClusterIP |
Service type |
ports.grpc |
int32 | 9092 |
GRPC port for SuperNode connections |
ports.serverAppIO |
int32 | 9091 |
ServerAppIO port (process mode) |
ports.extra |
int32 | 9093 |
Extra port (metrics, etc.) |
labels |
map[string]string | - | Additional service labels |
annotations |
map[string]string | - | Service annotations |
| Field | Type | Required | Description |
|---|---|---|---|
image |
string | Yes | Default SuperNode image |
env |
[]EnvVar | No | Default environment variables for all pools |
pools |
[]SuperNodePoolSpec | Yes | List of SuperNode pools (1-16) |
| Field | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Pool name (unique, DNS-compatible) |
replicas |
int32 | StatefulSet mode | Number of replicas (must be >= 1) |
images.supernode |
string | No | Override SuperNode image |
images.superexecClientApp |
string | process mode | SuperExec ClientApp sidecar image |
gpu |
GPUConfig | No | GPU configuration |
supernodeResources |
ResourceRequirements | No | Resources for SuperNode container |
superexecResources |
ResourceRequirements | No | Resources for SuperExec sidecar |
podTemplate |
PodTemplateSpec | No | Pod customization |
extraArgs |
[]string | No | Extra command-line arguments |
env |
[]EnvVar | No | Environment variables (merged with defaults) |
nodeConfig |
map[string]string | No | Key-value pairs for --node-config flag (see Node Configuration) |
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable GPU requests |
vendor |
nvidia | amd |
- | GPU vendor |
resourceName |
string | - | Extended resource name |
count |
int32 | 1 |
GPUs per pod |
mountAll |
bool | false |
Mount all GPUs on node |
The examples/ directory contains ready-to-use Federation configurations:
examples/federation-daemonset-process.yaml
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: federation-daemonset-process
spec:
mode: DaemonSet
isolationMode: process
superlink:
image: flwr/superlink:1.25.0
superexecImage: flwr/superexec:1.25.0
service:
type: ClusterIP
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: default
images:
superexecClientApp: flwr/superexec:1.25.0examples/federation-statefulset-subprocess.yaml
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: federation-statefulset-subprocess
spec:
mode: StatefulSet
isolationMode: subprocess
superlink:
image: flwr/superlink:1.25.0
service:
type: ClusterIP
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: default
replicas: 2
images: {}examples/federation-resources.yaml demonstrates comprehensive resource configuration:
- SuperLink with resources (process and subprocess modes)
- SuperExec sidecar resources (process mode)
- Multiple pools with different resource requirements (CPU, memory, GPU)
- Both DaemonSet and StatefulSet modes
- Resource configuration combined with GPU settings
examples/federation-nodeconfig.yaml - StatefulSet with data partitioning:
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: federation-nodeconfig
spec:
mode: StatefulSet
isolationMode: subprocess
superlink:
image: flwr/superlink:1.25.0
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: default
replicas: 4
images: {}
nodeConfig:
partition-id: "{index}"
num-partitions: "{replicas}"
name: "client_{pool}_{index}"examples/federation-nodeconfig-daemonset.yaml - DaemonSet with node identification:
apiVersion: flwr.exalsius.ai/v1
kind: Federation
metadata:
name: federation-nodeconfig-daemonset
spec:
mode: DaemonSet
isolationMode: subprocess
superlink:
image: flwr/superlink:1.25.0
supernodes:
image: flwr/supernode:1.25.0
pools:
- name: edge
images: {}
nodeConfig:
name: "client_{node}"
node-id: "{node}"
pool: "{pool}"kubectl apply -f my-federation.yaml# Check Federation status
kubectl get federation my-federation
# Detailed status
kubectl describe federation my-federation# SuperLink deployment
kubectl get deployment -l app.kubernetes.io/managed-by=flower-operator
# SuperNode workloads (DaemonSets or StatefulSets)
kubectl get daemonset,statefulset -l app.kubernetes.io/managed-by=flower-operator
# Services
kubectl get service -l app.kubernetes.io/managed-by=flower-operator# From status
kubectl get federation my-federation -o jsonpath='{.status.endpoints.superlinkService}'
# Or from service directly
kubectl get service my-federation-superlink -o jsonpath='{.spec.clusterIP}'# Create the Flower numpy quickstart example
flwr new @flwrlabs/quickstart-pytorch
cd quickstart-pytorch
# Adjust the pyproject.toml with the remote federation
[tool.flwr.federations.remote-federation]
address = "127.0.0.1:9093"
insecure = true
# Using port-forward for local development
kubectl port-forward service/my-federation-superlink 9093:9093
# In another terminal, start the flower run
flwr run . remote-federation --stream# Watch Federation status
kubectl get federation my-federation -w
# Check SuperNode pod logs
kubectl logs -l flower.flwr.ai/component=supernode -f
# Check SuperLink pod logs
kubectl logs -l flower.flwr.ai/component=superlink -f| Field | Description |
|---|---|
observedGeneration |
Generation last processed by controller |
supernodes |
Aggregate availability (e.g., "2/3") |
readySupernodes |
Total ready SuperNodes across all pools |
desiredSupernodes |
Total desired SuperNodes across all pools |
endpoints.superlinkService |
SuperLink service address for flwr run |
pools |
Per-pool status (name, kind, desired, ready) |
conditions |
Standard Kubernetes conditions |
$ kubectl get federation
NAME MODE ISOLATION SUPERNODES READY AGE
my-federation StatefulSet subprocess 2/2 True 5m| Condition | Description |
|---|---|
Ready |
Federation is fully operational |
Reconciling |
Controller is processing changes |
$ kubectl get federation my-federation -o jsonpath='{.status.pools}' | jq
[
{
"name": "default",
"kind": "StatefulSet",
"desired": 2,
"ready": 2
}
]# Check conditions
kubectl describe federation my-federation
# Check operator logs
kubectl logs -n flower-operator-system deployment/flower-operator-controller-manager -c manager# Check pod events
kubectl describe pod -l flower.flwr.ai/federation=my-federation
# Common causes:
# - nodeSelector doesn't match any nodes
# - Insufficient resources
# - Taints without tolerations# Verify GPU device plugin is running
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds
# Check node GPU capacity
kubectl describe node <node-name> | grep -A5 "Allocatable"The CRD enforces these rules via CEL validation:
| Error | Cause | Fix |
|---|---|---|
| "replicas must not be set for pools when mode is DaemonSet" | Set replicas in DaemonSet mode | Remove replicas from pools |
| "replicas must be set (>=1) for every pool when mode is StatefulSet" | Missing replicas in StatefulSet mode | Add replicas to each pool |
| "spec.superlink.superexecImage is required when isolationMode=process" | Process mode without SuperExec image | Set superexecImage |
| "spec.supernodes.pools[*].images.superexecClientApp must be empty when isolationMode=subprocess" | Subprocess mode with SuperExec images | Remove superexecClientApp |
# Federation details
kubectl describe federation my-federation
# Operator logs
kubectl logs -n flower-operator-system deployment/flower-operator-controller-manager -c manager -f
# Pod events
kubectl get events --field-selector involvedObject.kind=Pod -l flower.flwr.ai/federation=my-federation
# Resource status
kubectl get all -l flower.flwr.ai/federation=my-federation# Delete specific Federation
kubectl delete federation my-federation
# Or delete all Federations
kubectl delete federation --all# Delete CRDs (this also deletes all Federation resources!)
make uninstall
# Undeploy the controller
make undeploykubectl delete namespace flower-operator-system# Build the operator binary
make build
# Build the container image
make docker-build IMG=<registry>/flower-operator:tag# Run unit tests
make test
# Run linter
make lint
# Auto-fix lint issues
make lint-fix# Install CRDs
make install
# Run the controller locally (uses current kubeconfig)
make runAfter modifying api/v1/federation_types.go:
# Regenerate CRDs and RBAC
make manifests
# Regenerate DeepCopy methods
make generateFor detailed development guidelines, see AGENTS.md.
Contributions are welcome! Please follow these guidelines:
- Follow the Kubebuilder conventions
- Use Conventional Commits for commit messages
- Ensure all tests pass:
make test - Run linter before submitting:
make lint
See AGENTS.md for detailed development practices and code structure.
Copyright 2026.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.