diff --git a/.ai/spec/README.md b/.ai/spec/README.md new file mode 100644 index 000000000..061821c72 --- /dev/null +++ b/.ai/spec/README.md @@ -0,0 +1,69 @@ +# OpenShift Lightspeed Operator -- Specifications + +Machine-readable behavioral and architectural specifications for the OpenShift Lightspeed Operator. + +## Structure + +This specification uses a two-layer structure: + +| Layer | Path | Purpose | +|---|---|---| +| **what/** | `.ai/spec/what/` | Behavioral rules. Defines what the operator must do, its invariants, and its configuration surface. Implementation-agnostic. | +| **how/** | `.ai/spec/how/` | Architecture. Defines how the codebase is organized, how reconciliation is implemented, and how resources are generated. Implementation-specific. | + +The separation exists so that behavioral rules remain stable across refactors. An agent fixing a reconciliation bug reads both layers; an agent answering "what happens when X" reads only `what/`. + +## Scope + +These specs cover the **operator** only. The following are separate projects with their own repositories and specifications: + +- **lightspeed-service** -- the Python/FastAPI backend application +- **lightspeed-console** -- the OpenShift Console plugin UI code +- **RAG content pipeline** -- the retrieval-augmented generation data pipeline +- **Jira project data** -- issue tracking lives in the service repo's Jira project (OLS) + +## Audience + +AI agents (Claude). Content is optimized for precision and machine consumption over human readability. + +## Quick Start + +| Task | Start here | +|---|---| +| Understand what the operator does | `what/system-overview.md` | +| Fix a reconciliation bug | `what/reconciliation.md` + `how/reconciliation.md` | +| Add a new managed component | `what/system-overview.md` + `how/project-structure.md` | +| Understand the CRD | `what/crd-api.md` | +| Navigate the codebase | `how/project-structure.md` | +| Understand TLS configuration | `what/tls.md` | +| Understand security constraints | `what/security.md` | +| Debug external resource watching | `what/resource-lifecycle.md` + `how/reconciliation.md` | +| Add metrics or alerts | `what/observability.md` | + +## Conventions + +### Planned changes + +Unimplemented behavior is marked with `[PLANNED: OLS-XXXX]` where `OLS-XXXX` is the Jira ticket. These markers appear inline next to the behavioral rule they affect. A summary table of all planned changes appears at the end of each `what/` spec that contains them. + +### Configuration field references + +User-configurable values are referenced by their CRD field path (e.g., `spec.ols.defaultModel`). Operator startup flags are referenced by their flag name (e.g., `--namespace`). + +### Internal constants + +Behavioral rules state the rule without embedding the numeric value. For example: "the finalizer cleanup waits for owned resources to be deleted before removing the finalizer" rather than "waits for 3 minutes". The actual value lives in code and may change. + +### Rule numbering + +Behavioral rules are numbered sequentially within each section. Numbers are stable within a spec version but may be renumbered across major revisions. + +## Project History + +| Phase | Period | Operator milestones | +|---|---|---| +| Prototype | Q4 2023 | Initial operator scaffold with kubebuilder. Basic OLSConfig CRD. AppServer deployment reconciliation. | +| Early Access | Q1-Q2 2024 | PostgreSQL conversation cache. Console UI plugin integration. LLM secret management. Redis replaced by PostgreSQL. | +| Tech Preview | Q3 2024 | TLS hardening (service-ca integration, custom certs). Prometheus monitoring. Status conditions. Air-gap support (image overrides). | +| GA | Q4 2024 - Q1 2025 | Finalizer-based cleanup. ResourceVersion-based change detection. External resource watcher system. OCP version detection for console plugin image selection. | +| Post-GA | 2025-2026 | MCP server integration. RAG support with vector database. Event-driven reconciliation (removed timer-based). Dataverse exporter. PatternFly 5/6 console image selection. LCore/Llama Stack backend (added then removed). | diff --git a/.ai/spec/how/README.md b/.ai/spec/how/README.md new file mode 100644 index 000000000..bd5d7c4ec --- /dev/null +++ b/.ai/spec/how/README.md @@ -0,0 +1,32 @@ +# Architecture Specifications + +Defines how the operator is implemented. Each spec maps behavioral rules from `what/` to code locations, patterns, and structural decisions. + +## Spec Index + +| Spec | Description | +|---|---| +| `project-structure.md` | Codebase layout: package responsibilities, file naming conventions, import graph, key entry points. Map from concept to file path. | +| `reconciliation.md` | Reconciliation implementation: task registration pattern, error propagation, status update mechanics, watcher configuration, finalizer implementation. | +| `deployment-generation.md` | How Kubernetes resources (Deployments, Services, ConfigMaps, Secrets, PVCs) are generated: builder functions, volume/mount assembly, container spec construction, owner references. | +| `config-generation.md` | How CRD fields are transformed into operand configuration: OLS config YAML generation, PostgreSQL configuration, MCP server configuration, environment variable mapping. | + +## When to Read + +| Situation | Read | +|---|---| +| Need to find where something is implemented | `project-structure.md` | +| Debugging reconciliation ordering or error handling | `reconciliation.md` | +| Modifying a deployment, service, or volume | `deployment-generation.md` | +| Changing how CRD fields map to operand config | `config-generation.md` | +| Adding a new reconciliation task | `reconciliation.md` + `deployment-generation.md` | +| Understanding watcher behavior | `reconciliation.md` | + +## Relationship to what/ + +The `how/` specs implement the behavioral rules defined in `what/`. Each `how/` spec references the `what/` rules it implements. + +- `how/` specs describe code structure, function signatures, and file locations. +- `what/` specs describe invariants, ordering constraints, and expected behavior. +- When implementing a change, read the `what/` spec first to understand the required behavior, then read the `how/` spec to find the implementation location. +- If a `how/` spec contradicts a `what/` spec, the `what/` spec is authoritative and the implementation should be updated to match. diff --git a/.ai/spec/how/config-generation.md b/.ai/spec/how/config-generation.md new file mode 100644 index 000000000..8b0aa3ffa --- /dev/null +++ b/.ai/spec/how/config-generation.md @@ -0,0 +1,218 @@ +# Config Generation + +## Module Map + +| File | Key Functions | Responsibility | +|---|---|---| +| `internal/controller/appserver/assets.go` | `GenerateOLSConfigMap()`, `buildProviderConfigs()`, `buildOLSConfig()`, `generateMCPServerConfigs()`, `buildToolFilteringConfig()` | OLS config YAML (olsconfig.yaml) | +| `internal/controller/postgres/assets.go` | `GeneratePostgresConfigMap()`, `GeneratePostgresBootstrapSecret()`, `GeneratePostgresSecret()` | PostgreSQL config + bootstrap script + credentials | +| `internal/controller/console/assets.go` | `GenerateConsoleUIConfigMap()` | Nginx config for console plugin | +| `internal/controller/utils/mcp_server_config.go` | `GenerateOpenShiftMCPServerConfigMap()` | MCP server denied-resources config (TOML) | + +## Data Flow + +### OLS Config (olsconfig.yaml) +``` +CR spec -> GenerateOLSConfigMap() -> ConfigMap "olsconfig" +``` + +Generated YAML structure (marshaled from `utils.AppSrvConfigFile`): +```yaml +llm_providers: + - name: + type: # direct from CRD enum: openai, azure_openai, etc. + url: # non-Azure providers + credentials_path: /etc/apikeys/ # mount path to secret dir + models: + - name: + url: + context_window_size: + parameters: + max_tokens_for_response: + tool_budget_ratio: + # Azure-specific: + azure_openai_config: + url: + credentials_path: /etc/apikeys/ + azure_deployment_name: + api_version: + # Watsonx-specific: + project_id: + # Fake provider: + fake_provider_config: + url: "http://example.com" + response: "This is a preconfigured fake response." + chunks: 30 + sleep: 0.1 + stream: false + mcp_tool_call: + +ols_config: + default_model: + default_provider: + max_iterations: + logging: + app_log_level: + lib_log_level: + uvicorn_log_level: + conversation_cache: + type: postgres + postgres: + host: lightspeed-postgres-server..svc + port: 5432 + user: postgres + db: postgres + password_path: /etc/credentials/lightspeed-postgres-secret/password + ssl_mode: require + ca_cert_path: /etc/certs/postgres-ca/service-ca.crt + tls_config: + tls_certificate_path: /etc/certs/lightspeed-tls/tls.crt + tls_key_path: /etc/certs/lightspeed-tls/tls.key + reference_content: + indexes: + - path: /app-root/rag/rag-0 # BYOK first (one per spec.ols.rag entry) + index_id: + origin: + - path: /app-root/vector_db/ocp_product_docs/. # OCP docs (unless byokRAGOnly) + index_id: ocp-product-docs-_ + origin: "Red Hat OpenShift . documentation" + embeddings_model_path: /app-root/embeddings_model + user_data_collection: + feedback_disabled: + feedback_storage: /app-root/ols-user-data/feedback + transcripts_disabled: + transcripts_storage: /app-root/ols-user-data/transcripts + extra_cas: [] + certificate_directory: /etc/certs/cert-bundle + proxy_config: + proxy_url: + proxy_ca_cert_path: /etc/certs/cm-proxycacert/ + query_filters: [{name, pattern, replace_with}] # if spec.ols.queryFilters set + system_prompt_path: /etc/ols/system_prompt # if spec.ols.querySystemPrompt set + quota_handlers_config: # if spec.ols.quotaHandlersConfig set + storage: + scheduler: {period: 300} + limiters_config: [{name, type, initial_quota, quota_increase, period}] + enable_token_history: + tool_filtering: # if ToolFiltering gate + MCP servers exist + alpha: + top_k: + threshold: + tools_approval: # always present + approval_type: + approval_timeout: + +mcp_servers: # if any MCP servers configured + - name: openshift # if introspectionEnabled + url: http://localhost: + timeout: + headers: + x-kube-auth: "{{KUBERNETES_TOKEN}}" + - name: # if MCPServer feature gate + url: + timeout: + headers: + : # kubernetes -> "{{KUBERNETES_TOKEN}}" + # client -> "{{CLIENT_TOKEN}}" + # secret -> /etc/mcp/headers//header + +user_data_collector_config: # if dataCollectorEnabled + data_storage: /app-root/ols-user-data + log_level: +``` + +### PostgreSQL Bootstrap Script +Content is in `utils.PostgresBootStrapScriptContent` constant. Deployed as a Secret (not ConfigMap) named `lightspeed-postgres-bootstrap`. + +```bash +#!/bin/bash +cat /var/lib/pgsql/data/userdata/postgresql.conf + +_psql () { psql --set ON_ERROR_STOP=1 "$@" ; } + +# Create pg_trgm extension in default database (for OLS conversation cache) +echo "CREATE EXTENSION IF NOT EXISTS pg_trgm;" | _psql -d $POSTGRESQL_DATABASE + +# Create schemas for isolating different components' data +echo "CREATE SCHEMA IF NOT EXISTS quota;" | _psql -d $POSTGRESQL_DATABASE +echo "CREATE SCHEMA IF NOT EXISTS conversation_cache;" | _psql -d $POSTGRESQL_DATABASE +``` + +### PostgreSQL Config (postgresql.conf.sample) +Content is in `utils.PostgresConfigMapContent` constant. Deployed as ConfigMap. +``` +huge_pages = off +ssl = on +ssl_cert_file = '/etc/certs/tls.crt' +ssl_key_file = '/etc/certs/tls.key' +ssl_ca_file = '/etc/certs/cm-olspostgresca/service-ca.crt' +``` + +### PostgreSQL Password Secret +Generated via `GeneratePostgresSecret()`: 12 random bytes, base64 encoded, stored in secret key `password` (`utils.PostgresSecretKeyName`). + +### Nginx Config (Console UI) +Inline in `GenerateConsoleUIConfigMap()`: +- PID file: `/tmp/nginx/nginx.pid` +- Temp paths: `/tmp/nginx/{client_body,proxy,fastcgi,uwsgi,scgi}` (for read-only root filesystem) +- Serves static files from `/usr/share/nginx/html` on port 9443 with SSL +- TLS cert/key from `/var/cert/tls.crt` and `/var/cert/tls.key` + +### MCP Server Config (TOML) +Inline in `utils.OpenShiftMCPServerConfigTOML` constant: +```toml +[[denied_resources]] +group = "" +version = "v1" +kind = "Secret" + +[[denied_resources]] +group = "rbac.authorization.k8s.io" +version = "v1" +``` + +## Key Abstractions + +### Credential Injection Pattern +Provider credentials are mounted as files at `/etc/apikeys//`. The OLS config references the directory path as `credentials_path`. The secret key used is `apitoken` by default, overridable by `credentialKey` in the CR. + +### External Resource Iteration +`utils.ForEachExternalSecret(cr, callback)` and `utils.ForEachExternalConfigMap(cr, callback)` provide consistent iteration over CR-referenced external resources. Each callback receives `(name, source)` where `source` identifies the reference origin: +- `"llm-provider-"` for LLM credential secrets +- `"mcp-"` for MCP header secrets +- `"additional-ca"` for additional CA configmaps +- `"proxy-ca"` for proxy CA configmaps + +### Config Building Pattern +Config is built programmatically using typed Go structs from the `utils/` package (e.g., `utils.AppSrvConfigFile`) and marshaled with `yaml.Marshal()`. No templates are used. + +### PostgreSQL Schema Isolation +PostgreSQL schemas isolate data from different components within the same database: +- `conversation_cache` schema: conversation history +- `quota` schema: token quota tracking +These schemas are created by the bootstrap script. + +## Integration Points + +| Config Section | Source | Notes | +|---|---|---| +| Provider credentials | CR `spec.llm.providers[].credentialsSecretRef` | File mount at `/etc/apikeys//` | +| Default model/provider | CR `spec.ols.defaultModel`, `spec.ols.defaultProvider` | Required fields | +| Log level | CR `spec.ols.logLevel` | Enum: DEBUG, INFO, WARNING, ERROR, CRITICAL. Default: INFO | +| PostgreSQL connection | `utils/constants.go` | Host built from service name + namespace + ".svc" | +| TLS certs | Service-ca operator or user-provided secret | Path: `/etc/certs/lightspeed-tls/` | +| RAG indexes | CR `spec.ols.rag[]` | File paths in config YAML | +| OpenShift version | Reconciler options | Used for OCP docs RAG index path | +| MCP servers | CR `spec.mcpServers[]` + `spec.ols.introspectionEnabled` | Feature gated by `MCPServer` gate | +| Tool filtering | CR `spec.ols.toolFilteringConfig` | Feature gated by `ToolFiltering` gate; requires MCP servers | +| Proxy config | CR `spec.ols.proxyConfig` | Proxy URL + optional CA cert configmap | +| Query filters | CR `spec.ols.queryFilters[]` | Regex patterns for content filtering | +| Quota config | CR `spec.ols.quotaHandlersConfig` | Rate limiting with scheduler period fixed at 300s | + +## Implementation Notes + +- Config YAML is built programmatically using Go structs and marshaled with `yaml.Marshal()`, not templates. +- The fake provider config is hardcoded with test response values (`"This is a preconfigured fake response."`). +- PostgreSQL uses `POSTGRESQL_ADMIN_PASSWORD` env var for the admin password (mapped from the generated secret in the deployment spec, not shown in config files). +- Exporter config for data collector uses a separate ConfigMap (`utils.ExporterConfigCmName`) with collection interval of 300 seconds, cleanup after send, and ingress URL to `console.redhat.com`. +- The `OLSSystemPromptFileName` is stored as a separate key in the OLS config ConfigMap when `querySystemPrompt` is set. diff --git a/.ai/spec/how/deployment-generation.md b/.ai/spec/how/deployment-generation.md new file mode 100644 index 000000000..1ff911dc6 --- /dev/null +++ b/.ai/spec/how/deployment-generation.md @@ -0,0 +1,111 @@ +# Deployment Generation + +## Module Map + +| File | Key Functions | Responsibility | +|---|---|---| +| `internal/controller/appserver/deployment.go` | `GenerateOLSDeployment()`, `updateOLSDeployment()`, `RestartAppServer()`, `dataCollectorEnabled()` | AppServer deployment spec, change detection, restart | +| `internal/controller/postgres/deployment.go` | `GeneratePostgresDeployment()`, `UpdatePostgresDeployment()` | PostgreSQL deployment spec | +| `internal/controller/console/deployment.go` | `GenerateConsoleUIDeployment()` | Console UI deployment spec | + +## Data Flow + +### AppServer Deployment Construction +``` +GenerateOLSDeployment(r, cr) + 1. Check dataCollectorEnabled (requires both user config AND telemetry pull secret) + 2. Build LLM provider credential volumes + mounts (via ForEachExternalSecret, source "llm-provider-*") + 3. Build postgres secret volume + mount + 4. Build TLS volume + mount (user-provided KeyCertSecretRef OR service-ca generated OLSCertsSecretName) + 5. Build OLS config configmap volume + mount + 6. Conditionally add data collector volumes (user-data emptyDir, exporter config CM) + 7. Add kube-root-ca.crt configmap volume + cert-bundle emptyDir volume + 8. Add user-provided CA volumes (additional-ca CM, proxy-ca CM via ForEachExternalConfigMap) + 9. Add RAG emptyDir volume (if spec.ols.rag configured) + 10. Add postgres-ca configmap volume + tmp emptyDir volume + 11. Add MCP header secret volumes (via ForEachExternalSecret, source "mcp-*") + 12. Build init containers: + a. PostgreSQL wait init container (polls pg service) + b. RAG init containers (one per RAG entry, copies data to shared emptyDir) + 13. Get ConfigMap ResourceVersions for tracking annotations + 14. Get proxy CA cert hash for tracking annotation + 15. Assemble Deployment: + - Container: "lightspeed-service-api", image: r.GetAppServerImage(), port: 8443 + - Env: OLS_CONFIG_FILE path + proxy vars (HTTP_PROXY, HTTPS_PROXY, NO_PROXY) + - Probes: HTTPS GET on /readiness, /liveness (initial: 30s, period: 30s, timeout: 30s, failure: 15) + - Default resources: 500m CPU request, 1Gi-4Gi memory + 16. Apply pod-level config (replicas, nodeSelector, tolerations, affinity, topologySpreadConstraints) + 17. Set ImageStream triggers annotation (if RAG configured) + 18. Set owner reference to OLSConfig CR + 19. Conditionally add data collector sidecar container ("lightspeed-to-dataverse-exporter") + 20. Conditionally add OpenShift MCP server sidecar container ("openshift-mcp-server") +``` + +### Change Detection Pattern +All deployments use the same pattern in their update functions: +1. Compare desired vs existing deployment spec using `DeploymentSpecEqual()` (from `utils/`) +2. Compare ConfigMap ResourceVersions via deployment annotations (one per tracked CM) +3. Compare content hashes (proxy CA cert hash) via annotations +4. If any differ: update spec + annotations, call RestartX() function + - RestartX() sets `ols.openshift.io/force-reload` annotation to `time.Now().Format(time.RFC3339Nano)` + - This triggers a rolling restart by changing the pod template + +**AppServer tracks:** OLS config CM version, MCP server config CM version, proxy CA cert hash + +## Key Abstractions + +### Resource Requirement Defaults +Each component defines default CPU/memory requests and limits in local `get*Resources()` functions. User-provided values from the CR override defaults via `utils.GetResourcesOrDefault()` which returns user values if non-nil, otherwise defaults. + +Default resources by container: +| Container | CPU Request | CPU Limit | Memory Request | Memory Limit | +|---|---|---|---|---| +| AppServer `lightspeed-service-api` | 500m | - | 1Gi | 4Gi | +| Data collector | 50m | - | 64Mi | 200Mi | +| MCP server | 50m | - | 64Mi | 200Mi | + +### Volume/Mount Construction +Volumes and mounts are built as slices and conditionally appended using inline append patterns. + +### Init Container Generation +- **PostgreSQL wait:** `utils.GeneratePostgresWaitInitContainer()` generates a container that polls the PostgreSQL service until it responds. +- **RAG (AppServer only):** `GenerateRAGInitContainers()` creates one init container per RAG entry, each copying data from the RAG image to the shared emptyDir volume at `/app-root/rag/rag-`. + +### ImageStream Triggers (AppServer only) +RAG images use OpenShift ImageStreams for automatic updates. The deployment is annotated with `image.openshift.io/triggers` JSON that maps ImageStreamTag changes to init container image fields. This allows RAG content updates without operator intervention. + +### Data Collector Enablement +Computed from two inputs: +1. User data collection config: `!FeedbackDisabled || !TranscriptsDisabled` +2. Telemetry pull secret: `openshift-config/pull-secret` has `.auths."cloud.openshift.com"` entry in `.dockerconfigjson` + +Both must be true. The service ID is `"ols"` unless the CR has `openstack.org/lightspeed-owner-id` label, in which case it's `"rhos-lightspeed"`. + +### Pod Scheduling Configuration +`utils.ApplyPodDeploymentConfig()` applies scheduling from `cr.Spec.OLSConfig.DeploymentConfig.APIContainer`: +- Replicas (configurable for API container; forced to 1 for postgres and console) +- NodeSelector +- Tolerations +- Affinity +- TopologySpreadConstraints + +## Integration Points + +| Consumer | Provider | Data | +|---|---|---| +| Deployment spec | `utils/constants.go` | Resource names, ports, mount paths | +| Container resources | CR `spec.ols.deployment.api.resources` | User-overridable CPU/memory | +| Pod scheduling | CR `spec.ols.deployment.api` | Tolerations, nodeSelector, affinity, topology | +| Volume secrets | Kubernetes Secrets | LLM credentials, TLS certs, PostgreSQL password, MCP header values | +| Volume configmaps | Generated ConfigMaps | OLS config, nginx config, MCP server config | +| Proxy env vars | `utils.GetProxyEnvVars()` | HTTP_PROXY, HTTPS_PROXY, NO_PROXY from cluster | +| RAG images | CR `spec.ols.rag[].image` | Container images for init containers | + +## Implementation Notes + +- `RevisionHistoryLimit` is set to 1 for all deployments to minimize stored ReplicaSets. +- All sidecar containers use `utils.RestrictedContainerSecurityContext()` which sets: `RunAsNonRoot: true`, `ReadOnlyRootFilesystem: true`, `AllowPrivilegeEscalation: false`, Drop ALL capabilities, RuntimeDefault seccomp profile. +- The force-reload annotation (`ols.openshift.io/force-reload`) is set to `time.Now().Format(time.RFC3339Nano)` to guarantee uniqueness and trigger pod replacement. +- The OpenShift MCP server always uses `PullIfNotPresent`. +- The `VolumeDefaultMode` is `int32(420)` (0644 octal), defined in `utils/constants.go`. +- AppServer deployment name is `utils.OLSAppServerDeploymentName` (`"lightspeed-app-server"`). diff --git a/.ai/spec/how/project-structure.md b/.ai/spec/how/project-structure.md new file mode 100644 index 000000000..805cf531b --- /dev/null +++ b/.ai/spec/how/project-structure.md @@ -0,0 +1,221 @@ +# Project Structure + +## Module Map + +| Path | Key Symbols | Responsibility | +|---|---|---| +| `api/v1alpha1/olsconfig_types.go` | `OLSConfig`, `OLSConfigSpec`, `OLSConfigStatus`, `ProviderSpec`, `ModelSpec` | CRD type definitions, validation markers, defaults | +| `api/v1alpha1/groupversion_info.go` | `SchemeBuilder`, `GroupVersion` | API group/version registration | +| `api/v1alpha1/zz_generated.deepcopy.go` | Generated `DeepCopyObject()` methods | Auto-generated deep copy | +| `cmd/main.go` | `main()`, `overrideImages()` | Operator entry point, flag parsing, manager setup | +| `internal/controller/olsconfig_controller.go` | `OLSConfigReconciler`, `Reconcile()`, `SetupWithManager()` | Main reconciler, orchestration, watcher registration | +| `internal/controller/olsconfig_helpers.go` | `UpdateStatusCondition()`, `checkDeploymentStatus()`, `annotateExternalResources()`, `shouldWatchSecret()` | Status management, diagnostics, annotation, watcher predicates | +| `internal/controller/operator_assets.go` | `ReconcileServiceMonitorForOperator()`, `ReconcileNetworkPolicyForOperator()` | Operator-level resources | +| `internal/controller/appserver/reconciler.go` | `ReconcileAppServerResources()`, `ReconcileAppServerDeployment()` | AppServer Phase 1 + Phase 2 orchestration | +| `internal/controller/appserver/deployment.go` | `GenerateOLSDeployment()`, `updateOLSDeployment()` | AppServer deployment generation, update detection | +| `internal/controller/appserver/assets.go` | `GenerateOLSConfigMap()`, service/RBAC/ServiceMonitor/PrometheusRule generators | AppServer resource generation, OLS config YAML | +| `internal/controller/appserver/rag.go` | `GenerateRAGInitContainers()`, `reconcileImageStreams()` | RAG init container and ImageStream management | +| `internal/controller/postgres/reconciler.go` | `ReconcilePostgresResources()`, `ReconcilePostgresDeployment()` | PostgreSQL Phase 1 + Phase 2 | +| `internal/controller/postgres/deployment.go` | `GeneratePostgresDeployment()` | PostgreSQL deployment generation | +| `internal/controller/postgres/assets.go` | `GeneratePostgresConfigMap()`, `GeneratePostgresBootstrapSecret()`, `GeneratePostgresSecret()` | PostgreSQL config, bootstrap script, credentials | +| `internal/controller/console/reconciler.go` | `ReconcileConsoleUIResources()`, `ReconcileConsoleUIDeploymentAndPlugin()`, `RemoveConsoleUI()` | Console UI Phase 1 + Phase 2 + cleanup | +| `internal/controller/console/deployment.go` | `GenerateConsoleUIDeployment()` | Console UI deployment generation | +| `internal/controller/console/assets.go` | ConsolePlugin CR generator, nginx config, service, network policy | Console UI resource generation | +| `internal/controller/reconciler/interface.go` | `Reconciler` interface | Dependency injection interface for component packages | +| `internal/controller/utils/constants.go` | ~200 constants | Resource names, ports, paths, annotation keys, defaults | +| `internal/controller/utils/errors.go` | ~80 error message constants | Structured error messages for all operations | +| `internal/controller/utils/mcp_server_config.go` | `GenerateOpenShiftMCPServerConfigMap()`, TOML config | MCP server configuration with denied resources | +| `internal/controller/utils/postgres_wait.go` | `GeneratePostgresWaitInitContainer()` | PostgreSQL readiness init container | +| `internal/controller/watchers/watchers.go` | `SecretUpdateHandler`, `ConfigMapUpdateHandler`, `SecretWatcherFilter()`, `ConfigMapWatcherFilter()` | External resource change handlers, deployment restart logic | +| `internal/tls/` | `GetTLSProfileSpec()`, `FetchAPIServerTlsProfile()` | TLS profile resolution | +| `config/crd/` | CRD YAML manifests | Generated CRD definitions | +| `config/rbac/` | RBAC YAML manifests | Generated RBAC rules | +| `config/manager/` | Deployment manifest | Operator deployment | +| `test/e2e/` | E2E test suites | End-to-end integration tests | + +## Startup Sequence + +``` +main() + 1. Parse flags (images, namespace, leader election, secure metrics) + 2. Get Kubernetes config and client + 3. Detect OpenShift version (major, minor) + 4. Select console image: if minor < 19 -> PF5, else -> PF6 + 5. Check Prometheus Operator availability (probe CRD existence) + 6. Configure metrics TLS (if --secure-metrics-server): + a. Read client CA from openshift-monitoring/metrics-client-ca + b. Read TLS profile from OLSConfig CR or API server + 7. Create controller manager with: + - Multi-namespace cache (operator ns + openshift-config for secrets) + - TLS metrics server + - Health/readiness probes (ping) + - Leader election (if enabled) + 8. Build WatcherConfig (system secrets + configmaps) + 9. Create OLSConfigReconciler with all options + 10. Register with manager via SetupWithManager() + 11. Start manager (blocking) +``` + +## Data Flow + +### Reconciliation Flow +``` +OLSConfigReconciler.Reconcile() + 1. getAndValidateCR() -- Only processes CR named "cluster" + 2. handleFinalizer() -- Add finalizer or run deletion cleanup + 3. reconcileOperatorResources() -- ServiceMonitor, NetworkPolicy (operator-level) + 4. annotateExternalResources() -- Mark external secrets/configmaps for watching + 5. reconcileIndependentResources() -- Phase 1: ConfigMaps, Secrets, ServiceAccounts, RBAC, NetworkPolicies + +-- console.ReconcileConsoleUIResources() + +-- postgres.ReconcilePostgresResources() + +-- appserver.ReconcileAppServerResources() + 6. reconcileDeploymentsAndStatus() -- Phase 2: Deployments, Services, TLS certs, status + +-- console.ReconcileConsoleUIDeploymentAndPlugin() + +-- postgres.ReconcilePostgresDeployment() + +-- appserver.ReconcileAppServerDeployment() + +-- checkDeploymentStatus() per deployment -> build newStatus + +-- UpdateStatusCondition() +``` + +Phase 1 uses continue-on-error (reconciles all resources even if some fail). +Phase 2 uses fail-fast per step but collects status for all steps. + +### Watcher-Triggered Restart Flow +``` +External secret/configmap changes + -> Watches() with custom predicate (shouldWatchSecret/shouldWatchConfigMap) + -> SecretUpdateHandler.Update() / ConfigMapUpdateHandler.Update() + -> Compare old vs new Data (DeepEqual) + -> If changed: SecretWatcherFilter() / ConfigMapWatcherFilter() + -> Match against SystemResources list (by name+namespace) + -> OR match against WatcherAnnotationKey annotation + -> Resolve "ACTIVE_BACKEND" to appserver deployment name + -> Call RestartAppServer() / RestartPostgres() / RestartConsoleUI() + -> Set force-reload annotation with current timestamp +``` + +## Key Abstractions + +### Image Management +Default images are stored in a `defaultImages` map in `cmd/main.go` keyed by logical name (e.g., `"lightspeed-service"`, `"postgres-image"`, `"console-plugin"`). Default values come from `internal/relatedimages/` which reads `related_images.json` at build time. Command-line flags override individual images. The map is passed to the reconciler via `OLSConfigReconcilerOptions` as individual named fields (e.g., `LightspeedServiceImage`, `ConsoleUIImage`). + +### WatcherConfig +Declarative configuration for external resource watching. Contains: +- `Secrets.SystemResources`: Fixed list of system secrets with affected deployment names (telemetry pull secret, console TLS cert, postgres TLS cert) +- `ConfigMaps.SystemResources`: Fixed list of system configmaps (kube-root-ca.crt, service-ca bundle) +- `AnnotatedSecretMapping`: Dynamic map populated from CR spec at runtime (maps secret name to deployment names) +- `AnnotatedConfigMapMapping`: Dynamic map populated from CR spec at runtime (maps configmap name to deployment names) +The special deployment name `"ACTIVE_BACKEND"` resolves to the AppServer deployment name (`lightspeed-app-server`). + +### Component Package Pattern +Each component (appserver, postgres, console) follows the same package structure: +- `reconciler.go`: Phase 1 (resources) and Phase 2 (deployment) entry points +- `deployment.go`: Deployment spec generation and update detection +- `assets.go` and/or `config.go`: Resource and config generation +The packages receive `reconciler.Reconciler` interface, never import the controller package. + +### Reconciler Interface (`internal/controller/reconciler/interface.go`) +Embeds `client.Client` and adds getter methods for: +- `GetScheme()`, `GetLogger()`, `GetNamespace()` +- Image getters: `GetAppServerImage()`, `GetPostgresImage()`, `GetConsoleUIImage()`, `GetOpenShiftMCPServerImage()`, `GetDataverseExporterImage()` +- Version getters: `GetOpenShiftMajor()`, `GetOpenshiftMinor()` +- Config getters: `IsPrometheusAvailable()`, `GetWatcherConfig()` + +### Finalizer Pattern +The OLSConfig CR uses finalizer `ols.openshift.io/finalizer` (defined in `utils.OLSConfigFinalizer`). On deletion: +1. Remove Console UI (deactivate plugin, delete ConsolePlugin CR) +2. List all owned resources via owner references +3. Explicitly delete owned resources +4. Wait up to 3 minutes for deletion (poll every 5 seconds) +5. Remove finalizer (proceeds even if cleanup times out) + +## Integration Points + +| Component | External Dependency | Mechanism | +|---|---|---| +| Manager cache | `openshift-config` namespace | Multi-namespace cache config for telemetry pull secret | +| Console image selection | OpenShift version | API call to `clusterversions.config.openshift.io` | +| Metrics TLS | `openshift-monitoring/metrics-client-ca` | ConfigMap read at startup | +| TLS profile | OLSConfig CR or API server | CR field or `apiservers.config.openshift.io` | +| Prometheus resources | Prometheus Operator CRDs | CRD existence check at startup; skips if unavailable | +| External secret watching | User-provided LLM secrets, MCP header secrets | Annotation-based (`watchers.openshift.io/watch`) | +| External configmap watching | Additional CA, proxy CA configmaps | Annotation-based (`watchers.openshift.io/watch`) | + +## Testing + +### Unit Tests + +Unit tests are co-located with source files (`*_test.go`). They use envtest (a local Kubernetes API server) with Ginkgo v2/Gomega. `make test` is required instead of `go test` because the Makefile handles envtest binary download, CRD installation, and build flags. + +### E2E Tests + +E2E tests live in `test/e2e/` and run against a real OpenShift cluster with the operator deployed. + +**Framework:** Ginkgo v2 with Gomega. All suites use `Ordered` for serial execution. Tests prone to transient failures use `FlakeAttempts(5)`. + +**Suite setup** (`suite_test.go` `BeforeSuite`): +- Registers OLSConfig API, creates Kubernetes client +- Waits for operator deployment to be ready +- Creates LLM provider credential secrets (from `LLM_TOKEN` env var) +- `AfterSuite` runs `oc adm must-gather` for diagnostics and cleans up secrets + +**Test suites by area:** + +| File | Area | What it validates | +|---|---|---| +| `reconciliation_test.go` | Reconciliation | Deployment creation, config changes (log level, model, secrets) trigger updates, CA certificate volume mounting | +| `autocorrection_test.go` | Auto-correction | Operator restores manually modified deployments, services, ConsolePlugin CRs, ConfigMaps | +| `tls_test.go` | TLS & RBAC | Service TLS activation, HTTPS endpoints, authorized vs unauthorized access (metrics, query API) | +| `proxy_test.go` | Proxy | Queries succeed through squid proxy with TLS | +| `database_test.go` | Database persistence | Conversation records survive postgres pod restart via PVC | +| `postgres_restart_test.go` | Postgres recovery | Operator restores postgres after scale-to-zero, queries resume | +| `metrics_test.go` | Prometheus metrics | Operator metrics scraped by Prometheus, reconcile metrics available | +| `byok_test.go` | BYOK RAG | Custom RAG image used, ByokRAGOnly prevents OCP docs fallback, image update propagation | +| `byok_auth_test.go` | BYOK auth | Authenticated registry access with pull secrets | +| `all_features_test.go` | All features combined | 2 replicas, multiple providers, quotas, MCP servers, tool filtering, proxy, BYOK, data collector -- all enabled simultaneously | +| `upgrade_test.go` | Operator upgrade | CR persists and queries continue after operator bundle upgrade | +| `rapidast_test.go` | Security scanning | Route creation for OWASP ZAP / RapiDAST scanning | + +**Test pattern:** Each suite creates its own OLSConfig CR in `BeforeAll`, runs ordered tests, then calls `mustGather()` and `DeleteAndWait()` in `AfterAll`. Port forwarding provides local HTTPS access to in-cluster services. + +**Supporting files:** + +| File | Purpose | +|---|---| +| `constants.go` | Namespace, deployment names, ports, LLM env var names, test CA certificate | +| `assets.go` | OLSConfig CR generation helpers (`generateBaseOLSConfig()`, `generateAllFeaturesOLSConfig()`) | +| `client.go` | Kubernetes client wrapper with wait/poll helpers, port forwarding, image registry operations, storage class management | +| `utils.go` | `OLSTestEnvironment` setup/teardown, HTTPS query helpers, must-gather, route creation | +| `http_client.go` | HTTPS client with custom CA, polling GET/POST helpers | +| `prometheus_client.go` | Prometheus query wrapper via thanos-querier route | + +**How to run:** + +| Command | Scope | Timeout | +|---|---|---| +| `make test-e2e` | Standard tests (excludes AllFeatures, Upgrade, Rapidast) | 2h | +| `make test-e2e-local` | Local tests (excludes Database-Persistency, Rapidast) | 2h | +| `make test-e2e-all-features` | Comprehensive all-features test | 3h | +| `make test-upgrade` | Upgrade scenario only (requires `BUNDLE_IMAGE`) | 2h | + +**Required environment variables:** + +| Variable | Required | Description | +|---|---|---| +| `KUBECONFIG` | Yes | Path to cluster kubeconfig | +| `LLM_TOKEN` | Yes | API token for LLM provider | +| `LLM_PROVIDER` | No | Provider name (default: `openai`) | +| `LLM_MODEL` | No | Model name (default: `gpt-4o-mini`) | +| `BUNDLE_IMAGE` | For upgrade | Operator bundle image for upgrade test | +| `CONDITION_TIMEOUT` | No | Custom timeout in seconds for condition checks | +| `ARTIFACT_DIR` | No | Directory for must-gather diagnostics output | + +## Implementation Notes + +- The operator uses kubebuilder v3 markers for CRD generation and RBAC. +- The `cmd/check-isa-level/` package is a build-time utility for AMD64 ISA level checking. +- All generated files (deepcopy, CRD YAML) should be regenerated after API type changes using `make generate manifests`. +- The OLSConfig CRD is cluster-scoped and validated to require `.metadata.name == "cluster"`. +- `SetupWithManager()` registers `Owns()` watches for: Deployment, ServiceAccount, ClusterRole, ClusterRoleBinding, Service, ConfigMap, Secret, PersistentVolumeClaim, ConsolePlugin, ServiceMonitor, PrometheusRule, ImageStream. +- Controller-runtime handles retry with exponential backoff; the operator does not use periodic reconciliation. +- `LOCAL_DEV_MODE=true` env var skips ServiceMonitor creation for local development with `make run-local`. diff --git a/.ai/spec/how/reconciliation.md b/.ai/spec/how/reconciliation.md new file mode 100644 index 000000000..088ce7dec --- /dev/null +++ b/.ai/spec/how/reconciliation.md @@ -0,0 +1,81 @@ +# Reconciliation Architecture + +## Module Map + +| File | Key Symbols | Responsibility | +|---|---|---| +| `internal/controller/olsconfig_controller.go` | `OLSConfigReconciler`, `Reconcile()`, `SetupWithManager()` | Main reconciler, orchestration, watcher setup | +| `internal/controller/olsconfig_helpers.go` | `UpdateStatusCondition()`, `checkDeploymentStatus()`, `annotateExternalResources()` | Status management, diagnostics, resource annotation | +| `internal/controller/operator_assets.go` | `ReconcileServiceMonitorForOperator()`, `ReconcileNetworkPolicyForOperator()` | Operator-level resources | +| `internal/controller/reconciler/interface.go` | `Reconciler` interface | Dependency injection for component packages | + +## Data Flow + +Main reconciliation loop: +``` +Reconcile(ctx, req) + -> getAndValidateCR() # Fetch CR, validate name == "cluster" + -> handleFinalizer() # Add/remove finalizer, run cleanup + -> reconcileOperatorResources() # ServiceMonitor, NetworkPolicy (operator-level) + -> annotateExternalResources() # Validate secrets, annotate for watching + -> reconcileIndependentResources() # Phase 1: console, postgres, backend resources + | |-- console.ReconcileConsoleUIResources() + | |-- postgres.ReconcilePostgresResources() + | +-- appserver.ReconcileAppServerResources() + -> reconcileDeploymentsAndStatus() # Phase 2: deployments + status update + |-- console.ReconcileConsoleUIDeploymentAndPlugin() + |-- postgres.ReconcilePostgresDeployment() + |-- appserver.ReconcileAppServerDeployment() + |-- checkDeploymentStatus() for each # Collect diagnostics + +-- UpdateStatusCondition() # Single status update +``` + +## Key Abstractions + +### Reconciler Interface +The `reconciler.Reconciler` interface breaks the circular dependency between the main controller and component packages. Component packages (appserver, postgres, console) receive this interface instead of importing the controller package directly. It embeds `client.Client` and adds getter methods for images, namespace, and OpenShift version. + +### ReconcileSteps Pattern +Both phases use a slice of `ReconcileSteps` structs, each containing a Name, reconcile function, and (for Phase 2) a ConditionType and Deployment name. Phase 1 iterates with continue-on-error; Phase 2 iterates but tracks all conditions and diagnostics. + +### Resource Ownership +Two ownership models: +1. **Owned resources**: Controller-runtime Owns() declarations. Owner references set on creation. Changes trigger reconciliation automatically. +2. **External resources**: Watches() with custom predicates. Annotation-based filtering. Secret/ConfigMap handlers compare data and trigger deployment restarts. + +### Finalizer Cleanup +The `finalizeOLSConfig()` method uses `listOwnedResources()` which queries every resource type by owner reference UID (not labels). This is more reliable than label-based cleanup. The wait loop polls with a fixed interval and timeout, using `wait.PollUntilContextTimeout`. + +### Status Update Mechanics +`UpdateStatusCondition()` uses `retry.RetryOnConflict` with `client.MergeFrom` patch. It preserves `LastTransitionTime` for conditions whose status hasn't changed. It re-fetches the CR before each update attempt to get the latest ResourceVersion. + +### Deployment Health Check +`checkDeploymentStatus()` returns one of three states: +- "Ready": `DeploymentAvailable` condition is True +- "Failed": Terminal pod failures detected (CrashLoopBackOff, ImagePullBackOff, etc.) +- "Progressing": Not ready but no terminal failures + +`collectDeploymentDiagnostics()` lists pods matching the deployment's selector and inspects: +- Container statuses (Waiting with reason, Terminated with non-zero exit) +- Last termination state (for CrashLoopBackOff context) +- Init container statuses +- Pod scheduling conditions (Unschedulable) +- Pod readiness conditions +- Pod phase (Failed, Unknown) + +## Integration Points + +| Consumer | Provider | Mechanism | +|---|---|---| +| Component packages | Main controller | `reconciler.Reconciler` interface | +| Watcher handlers | Component restart functions | `watchers.SecretUpdateHandler`, `watchers.ConfigMapUpdateHandler` | +| Status updates | Kubernetes API | `retry.RetryOnConflict` with `client.MergeFrom` patch | +| Finalizer cleanup | Kubernetes API | Owner reference UID matching + explicit delete | + +## Implementation Notes + +- `SetupWithManager()` registers Owns() for 12 resource types and Watches() for Secrets and ConfigMaps with custom predicates. +- Secret watch predicates: Create events allowed for all secrets in operator namespace (handles recreated secrets); Update events filtered by watcher annotation; Delete events ignored. +- ConfigMap watch predicates: Same pattern as secrets. +- The `LOCAL_DEV_MODE` environment variable skips ServiceMonitor creation when running locally. +- Phase 1 failures update status with `ResourceReconciliation` condition type (not the component-specific types used in Phase 2). diff --git a/.ai/spec/what/README.md b/.ai/spec/what/README.md new file mode 100644 index 000000000..ce1d2f627 --- /dev/null +++ b/.ai/spec/what/README.md @@ -0,0 +1,38 @@ +# Behavioral Specifications + +Defines what the operator must do. Each spec contains numbered behavioral rules, configuration surface tables, constraints, and planned changes. + +## Spec Index + +| Spec | Description | +|---|---| +| `system-overview.md` | Operator role, component inventory, lifecycle, deployment model, integration points, and top-level constraints. Start here. | +| `crd-api.md` | OLSConfig CRD field-by-field specification. Field paths, types, defaults, validation rules, and status conditions. | +| `reconciliation.md` | Reconciliation behavioral rules: ordering, idempotency, error handling, status updates, finalizer semantics. | +| `app-server.md` | Application server backend behavioral rules: deployment shape, configuration generation, health checks, resource requirements. | +| `postgres.md` | PostgreSQL component behavioral rules: deployment, secret management, TLS, connection parameters, PVC lifecycle. | +| `console-ui.md` | Console UI plugin behavioral rules: ConsolePlugin CR, service proxy, OCP version-based image selection, enable/disable lifecycle. | +| `tls.md` | TLS behavioral rules: service-ca integration, custom certificate support, TLS profile inheritance, CA bundle management. | +| `security.md` | Security behavioral rules: RBAC, network policies, secret handling, security contexts, pod security standards. | +| `resource-lifecycle.md` | Resource lifecycle: owned resources (OwnerReferences, Owns()), external resources (annotation-based watching, change detection, restart mapping). | +| `observability.md` | Observability behavioral rules: ServiceMonitor, PrometheusRule, metrics endpoints, status conditions, diagnostic info. | + +## How to Use + +1. Start with `system-overview.md` for orientation. +2. Read the spec for the component you are working on. +3. Cross-reference with `reconciliation.md` for ordering and error handling rules. +4. Check `tls.md` and `security.md` for cross-cutting constraints that apply to all components. + +## Relationship to how/ + +Each `what/` spec has a corresponding implementation guide in `how/` where applicable: + +| what/ | how/ | +|---|---| +| `reconciliation.md` | `how/reconciliation.md` -- implementation patterns, code locations, task registration | +| `app-server.md`, `postgres.md`, `console-ui.md` | `how/deployment-generation.md` -- how deployments/services/configmaps are generated | +| `crd-api.md` | `how/config-generation.md` -- how CRD fields map to generated configuration | +| `system-overview.md` | `how/project-structure.md` -- codebase layout, package responsibilities | + +The `what/` specs are authoritative for behavior. The `how/` specs are authoritative for implementation. When they conflict, the `what/` spec wins and the `how/` spec should be updated. diff --git a/.ai/spec/what/app-server.md b/.ai/spec/what/app-server.md new file mode 100644 index 000000000..32d749db6 --- /dev/null +++ b/.ai/spec/what/app-server.md @@ -0,0 +1,80 @@ +# App Server + +The App Server is the backend deployment for OpenShift Lightspeed. It runs the lightspeed-service Python/FastAPI application that handles LLM queries, RAG retrieval, conversation management, and tool execution. + +## Behavioral Rules + +### Deployment Composition +1. The deployment contains a primary API container and up to two optional sidecar containers. +2. The primary container (lightspeed-service-api) runs the OLS service, listening on HTTPS. +3. The data collector sidecar (lightspeed-to-dataverse-exporter) is added when data collection is enabled AND the telemetry pull secret exists in the openshift-config namespace with a cloud.openshift.com auth entry. +4. The OpenShift MCP server sidecar is added when `spec.ols.introspectionEnabled` is true. It provides Kubernetes resource access via MCP protocol. +5. A PostgreSQL wait init container always runs before the main containers to ensure database readiness. +6. When `spec.ols.rag` is configured, additional init containers copy RAG data from container images into a shared volume. + +### Configuration Mapping +7. The operator generates an OLS config file (olsconfig.yaml) from the CR spec. This ConfigMap is the primary interface between the operator and the service. +8. LLM provider credentials are mounted as files from their respective secrets, at a path derived from the secret name. +9. The default credential key read from each provider's secret is "apitoken", overridable by `spec.llm.providers[].credentialKey`. +10. PostgreSQL connection settings are hardcoded to point to the operator-managed PostgreSQL service within the same namespace. +11. If `spec.ols.querySystemPrompt` is set, the custom prompt is written as a second key in the config ConfigMap and referenced by file path in the config. +12. RAG reference content indexes are ordered: user-provided (BYOK) indexes first, then the OCP documentation index (unless `spec.ols.byokRAGOnly` is true). +13. The OCP documentation RAG index path is derived from the detected OpenShift cluster version. + +### MCP Server Integration +14. When `spec.ols.introspectionEnabled` is true, an "openshift" MCP server entry is added to the config pointing to localhost on the sidecar port. +15. When the MCPServer feature gate is enabled, user-defined servers from `spec.mcpServers` are added to the config. +16. MCP header values of type "secret" are mounted as files from the referenced secret. Types "kubernetes" and "client" use placeholder strings that the service resolves at runtime. + +### Service and Networking +17. The service exposes HTTPS on the configured port. +18. The network policy allows ingress from: Prometheus (openshift-monitoring), OpenShift Console (openshift-console), and ingress controllers. +19. Egress is unrestricted (empty egress rules). + +### RBAC +20. The service account is granted SubjectAccessReview and TokenReview permissions for user authorization. +21. The service account can read the cluster version and the telemetry pull secret. + +### Change Detection +22. Deployment updates are triggered when: the deployment spec changes, the config ConfigMap resource version changes, the MCP config ConfigMap resource version changes, or the proxy CA certificate hash changes. +23. When any of these change, the operator forces a rolling restart by updating a pod template annotation with the current timestamp. + +### Observability +24. The operator creates a ServiceMonitor for Prometheus scraping of the /metrics endpoint. +25. The operator creates a PrometheusRule with recording rules aggregating query call counts by status code class (2xx, 4xx, 5xx) and provider/model configuration. + +## Configuration Surface + +| Field path | Description | +|---|---| +| `spec.ols.deployment.api.replicas` | Number of API server replicas | +| `spec.ols.deployment.api.resources` | API container resource requirements | +| `spec.ols.deployment.api.tolerations` | Pod tolerations | +| `spec.ols.deployment.api.nodeSelector` | Node selector constraints | +| `spec.ols.deployment.api.affinity` | Pod affinity rules | +| `spec.ols.deployment.api.topologySpreadConstraints` | Topology spread constraints | +| `spec.ols.deployment.dataCollector.resources` | Data collector container resources | +| `spec.ols.deployment.mcpServer.resources` | MCP server container resources | +| `spec.ols.defaultModel` | Default LLM model name | +| `spec.ols.defaultProvider` | Default LLM provider name | +| `spec.ols.logLevel` | Logging level for all service components | +| `spec.ols.maxIterations` | Maximum agent execution iterations | +| `spec.ols.querySystemPrompt` | Custom system prompt for LLM queries | +| `spec.ols.byokRAGOnly` | Skip OCP documentation RAG index | +| `spec.ols.introspectionEnabled` | Enable OpenShift MCP server sidecar | +| `spec.ols.userDataCollection.feedbackDisabled` | Disable feedback collection | +| `spec.ols.userDataCollection.transcriptsDisabled` | Disable transcript collection | +| `spec.ols.queryFilters` | Query text pattern replacements | +| `spec.ols.rag` | RAG database image references | +| `spec.ols.imagePullSecrets` | Pull secrets for RAG images | +| `spec.ols.quotaHandlersConfig` | Token quota limiter configuration | +| `spec.ols.toolFilteringConfig` | Tool filtering parameters (requires ToolFiltering feature gate) | +| `spec.ols.toolsApprovalConfig` | Tool execution approval settings | +| `spec.mcpServers` | External MCP server definitions (requires MCPServer feature gate) | + +## Constraints + +1. Data collection requires both: at least one of feedback/transcripts enabled, AND the telemetry pull secret present with cloud.openshift.com credentials. +2. Tool filtering requires MCP servers to be configured (either introspection or user-defined). +3. The service always connects to PostgreSQL via the internal cluster service DNS. +4. RAG init containers run in index order, copying data to subdirectories of the shared RAG volume. diff --git a/.ai/spec/what/console-ui.md b/.ai/spec/what/console-ui.md new file mode 100644 index 000000000..b0426c587 --- /dev/null +++ b/.ai/spec/what/console-ui.md @@ -0,0 +1,44 @@ +# Console UI + +The operator deploys the OpenShift Lightspeed console plugin, which integrates the Lightspeed chat interface into the OpenShift web console. + +## Behavioral Rules + +### Deployment +1. The console plugin always runs as a single replica. +2. The operator selects the console plugin image based on the OpenShift cluster version: PatternFly 5 for OCP < 4.19, PatternFly 6 for OCP >= 4.19. +3. The container serves static content via nginx, listening on HTTPS. +4. TLS certificates are generated by the OpenShift service-ca operator. +5. The nginx configuration is generated by the operator as a ConfigMap. + +### Console Integration +6. The operator creates a ConsolePlugin CR that registers the plugin with the OpenShift console. +7. The ConsolePlugin CR configures a proxy backend that routes API requests from the console to the Lightspeed backend service. +8. The proxy alias is "ols" and uses UserToken authorization (the logged-in user's token is forwarded). +9. If custom TLS is configured (`spec.ols.tlsConfig`), the ConsolePlugin CR includes the CA certificate for proxy trust. +10. The operator activates the plugin by adding its name to the Console CR's spec.plugins array. +11. Activation uses retry-on-conflict to handle concurrent Console CR modifications. + +### Cleanup +12. On CR deletion, the operator deactivates the plugin (removes from Console CR plugins array), then deletes the ConsolePlugin CR. +13. Console CR modifications during cleanup are handled gracefully (NotFound errors are ignored for non-OpenShift test environments). + +### Networking +14. The network policy allows ingress only from the OpenShift Console pods (app=console in openshift-console namespace). + +## Configuration Surface + +| Field path | Description | +|---|---| +| `spec.ols.deployment.console.replicas` | Ignored; always 1 | +| `spec.ols.deployment.console.resources` | Console container resource requirements | +| `spec.ols.deployment.console.tolerations` | Pod tolerations | +| `spec.ols.deployment.console.nodeSelector` | Node selector constraints | +| `spec.ols.tlsConfig.keyCertSecretRef` | Custom TLS cert (affects ConsolePlugin proxy CA trust) | + +## Constraints + +1. Replicas are always 1 regardless of configuration. +2. The ConsolePlugin CR is cluster-scoped and cannot have namespace-scoped owner references in the standard way. +3. Image selection happens at operator startup and is based on the detected OCP version, not runtime configuration. +4. The plugin name in the Console CR must exactly match the ConsolePlugin CR name for activation. diff --git a/.ai/spec/what/crd-api.md b/.ai/spec/what/crd-api.md new file mode 100644 index 000000000..f44e8e36b --- /dev/null +++ b/.ai/spec/what/crd-api.md @@ -0,0 +1,430 @@ +# CRD API + +Specification of the OLSConfig Custom Resource Definition. Source of truth: `api/v1alpha1/olsconfig_types.go`. + +## Behavioral Rules + +### Resource Identity + +1. API group: `ols.openshift.io`, version: `v1alpha1`, kind: `OLSConfig`. +2. Cluster-scoped (not namespaced). Marker: `+kubebuilder:resource:scope=Cluster`. +3. `.metadata.name` must be `"cluster"`. Enforced by XValidation rule on the OLSConfig type: `self.metadata.name == 'cluster'`. +4. Has a status subresource (`+kubebuilder:subresource:status`). +5. Finalizer: `ols.openshift.io/finalizer` (constant `OLSConfigFinalizer` in `internal/controller/utils/constants.go`). +6. `spec` is required on the OLSConfig object. + +### Top-Level Spec Structure + +Field path | JSON key | Go type | Required | Description +---|---|---|---|--- +`spec.llm` | `llm` | `LLMSpec` | Yes | LLM provider configuration +`spec.ols` | `ols` | `OLSSpec` | Yes | OLS service settings +`spec.olsDataCollector` | `olsDataCollector` | `OLSDataCollectorSpec` | No | Data collector settings (logLevel only) +`spec.mcpServers` | `mcpServers` | `[]MCPServerConfig` | No | External MCP server configurations. MaxItems=20 +`spec.featureGates` | `featureGates` | `[]FeatureGate` | No | Feature gates. Enum values: `MCPServer`, `ToolFiltering` + +### LLM Provider Configuration (spec.llm) + +7. `spec.llm.providers` is required. Type: `[]ProviderSpec`. MaxItems=10. + +#### ProviderSpec Fields + +Field path (relative to each provider) | JSON key | Go type | Required | Description +---|---|---|---|--- +`name` | `name` | `string` | Yes | Provider name +`url` | `url` | `string` | No | Provider API URL. Pattern: `^https?://.*$` +`credentialsSecretRef` | `credentialsSecretRef` | `corev1.LocalObjectReference` | Yes | Secret containing API credentials +`models` | `models` | `[]ModelSpec` | Yes | Provider models. MaxItems=50 +`type` | `type` | `string` | Yes | Provider type enum: `azure_openai`, `bam`, `openai`, `watsonx`, `rhoai_vllm`, `rhelai_vllm`, `fake_provider` +`deploymentName` | `deploymentName` | `string` | No | Azure OpenAI deployment name +`apiVersion` | `apiVersion` | `string` | No | Azure OpenAI API version +`projectID` | `projectID` | `string` | No | Watsonx project ID +`fakeProviderMCPToolCall` | `fakeProviderMCPToolCall` | `bool` | No | Fake provider MCP tool call flag +`tlsSecurityProfile` | `tlsSecurityProfile` | `*configv1.TLSSecurityProfile` | No | TLS profile for provider connection +`credentialKey` | `credentialKey` | `string` | No | Key name within `credentialsSecretRef` to read credential from. Defaults to `"apitoken"` if unset + +#### Provider XValidation Rules + +8. Azure OpenAI requires `deploymentName`: when `type == "azure_openai"`, `deploymentName` must not be empty. +9. Watsonx requires `projectID`: when `type == "watsonx"`, `projectID` must not be empty. +10. `credentialKey` must not be empty or whitespace: if set, it must not match `^[ \t\n\r\v\f]*$`. + +#### ModelSpec Fields + +Field path (relative to each model) | JSON key | Go type | Required | Description +---|---|---|---|--- +`name` | `name` | `string` | Yes | Model name +`url` | `url` | `string` | No | Model API URL. Pattern: `^https?://.*$` +`contextWindowSize` | `contextWindowSize` | `uint` | No | Context window in tokens. Minimum=1024 +`parameters` | `parameters` | `ModelParametersSpec` | No | Model parameters + +#### ModelParametersSpec Fields + +Field path (relative to parameters) | JSON key | Go type | Required | Default | Validation +---|---|---|---|---|--- +`maxTokensForResponse` | `maxTokensForResponse` | `int` | No | (unset; application default is 2048) | None +`toolBudgetRatio` | `toolBudgetRatio` | `float64` | No | `0.25` | Minimum=0.1, Maximum=0.5 + +### OLS Configuration (spec.ols) + +#### Core Fields + +14. `spec.ols.defaultModel` -- `string`, required. The default model name for usage. +15. `spec.ols.defaultProvider` -- `string`, required. The default provider name for usage. +16. `spec.ols.logLevel` -- `LogLevel` enum, optional. Values: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. Default: `INFO`. + +#### Conversation Cache (spec.ols.conversationCache) + +17. `spec.ols.conversationCache.type` -- `CacheType` enum. Only valid value: `postgres`. Default: `postgres`. +18. `spec.ols.conversationCache.postgres.sharedBuffers` -- `string`, XIntOrString. Default: `"256MB"`. +19. `spec.ols.conversationCache.postgres.maxConnections` -- `int`. Default: `2000`. Minimum=1, Maximum=262143. + +#### Deployment Configuration (spec.ols.deployment) + +The deployment config uses two struct types: + +- **`Config`**: has `replicas`, `resources`, `tolerations`, `nodeSelector`, `affinity`, `topologySpreadConstraints`. +- **`ContainerConfig`**: has `resources` only. + +Field path (relative to `spec.ols.deployment`) | JSON key | Go type | Notes +---|---|---|--- +`api` | `api` | `Config` | API container. Replicas configurable (default 1, min 0) +`dataCollector` | `dataCollector` | `ContainerConfig` | Data collector container. Resources only +`mcpServer` | `mcpServer` | `ContainerConfig` | MCP server container. Resources only +`console` | `console` | `Config` | Console container. Has replicas field but operator forces 1 +`database` | `database` | `Config` | Database container. Has replicas field but operator forces 1 + +20. Replicas are only user-configurable for the API container (`spec.ols.deployment.api.replicas`). For console and database, the operator always overrides replicas to 1 regardless of spec value. + +##### Config Fields + +Field path (relative to Config) | JSON key | Go type | Default | Validation +---|---|---|---|--- +`replicas` | `replicas` | `*int32` | `1` | Minimum=0 +`resources` | `resources` | `*corev1.ResourceRequirements` | (none) | Standard k8s resource requirements +`tolerations` | `tolerations` | `[]corev1.Toleration` | (none) | Standard k8s tolerations +`nodeSelector` | `nodeSelector` | `map[string]string` | (none) | Key-value label selector +`affinity` | `affinity` | `*corev1.Affinity` | (none) | Standard k8s affinity rules +`topologySpreadConstraints` | `topologySpreadConstraints` | `[]corev1.TopologySpreadConstraint` | (none) | Standard k8s topology spread + +##### ContainerConfig Fields + +Field path (relative to ContainerConfig) | JSON key | Go type +---|---|--- +`resources` | `resources` | `*corev1.ResourceRequirements` + +#### Query Filters (spec.ols.queryFilters) + +21. Type: `[]QueryFiltersSpec`. Each entry has: + +Field | JSON key | Go type | Required +---|---|---|--- +`name` | `name` | `string` | No +`pattern` | `pattern` | `string` | No +`replaceWith` | `replaceWith` | `string` | No + +#### User Data Collection (spec.ols.userDataCollection) + +22. `spec.ols.userDataCollection.feedbackDisabled` -- `bool`, optional. Disables user feedback collection. +23. `spec.ols.userDataCollection.transcriptsDisabled` -- `bool`, optional. Disables transcript collection. + +#### TLS Configuration (spec.ols.tlsConfig) + +24. `spec.ols.tlsConfig` -- `*TLSConfig`, optional. Pointer type (nil when absent). +25. `spec.ols.tlsConfig.keyCertSecretRef` -- `corev1.LocalObjectReference`. Secret must contain keys: `tls.crt` (required), `tls.key` (required), `ca.crt` (optional, for console proxy trust). + +#### Additional CA (spec.ols.additionalCAConfigMapRef) + +26. `spec.ols.additionalCAConfigMapRef` -- `*corev1.LocalObjectReference`, optional. ConfigMap with additional CA certificates for LLM provider TLS. + +#### TLS Security Profile (spec.ols.tlsSecurityProfile) + +27. `spec.ols.tlsSecurityProfile` -- `*configv1.TLSSecurityProfile`, optional. OpenShift TLS security profile for API endpoints. + +#### Introspection (spec.ols.introspectionEnabled) + +28. `spec.ols.introspectionEnabled` -- `bool`, optional. Enables introspection features. + +#### MCP Kubernetes Server (spec.ols.mcpKubeServerConfig) + +29. `spec.ols.mcpKubeServerConfig.timeout` -- `int`. Default: `60`. Minimum=5. Timeout in seconds for the built-in MCP Kubernetes server. + +#### Proxy Configuration (spec.ols.proxyConfig) + +30. `spec.ols.proxyConfig.proxyURL` -- `string`, optional. Pattern: `^https?://.*$`. If unset, cluster-wide proxy is used via `https_proxy` env var. +31. `spec.ols.proxyConfig.proxyCACertificate` -- `*ProxyCACertConfigMapRef`, optional. Struct type `atomic`. + +`ProxyCACertConfigMapRef` fields: +- Inline `corev1.LocalObjectReference` (provides `name` field for the ConfigMap name) +- `key` -- `string`. Default: `"proxy-ca.crt"`. Key within the ConfigMap holding the proxy CA certificate. + +#### RAG Configuration (spec.ols.rag) + +32. Type: `[]RAGSpec`, optional. + +Field | JSON key | Go type | Required | Default +---|---|---|---|--- +`image` | `image` | `string` | Yes | (none) +`indexPath` | `indexPath` | `string` | No | `"/rag/vector_db"` +`indexID` | `indexID` | `string` | No | `""` + +#### Quota Handlers (spec.ols.quotaHandlersConfig) + +33. `spec.ols.quotaHandlersConfig` -- `*QuotaHandlersConfig`, optional. +34. `spec.ols.quotaHandlersConfig.limitersConfig` -- `[]LimiterConfig`. +35. `spec.ols.quotaHandlersConfig.enableTokenHistory` -- `bool`, optional. + +`LimiterConfig` fields: + +Field | JSON key | Go type | Required | Validation +---|---|---|---|--- +`name` | `name` | `string` | Yes (by convention) | None +`type` | `type` | `string` | Yes (by convention) | Enum: `cluster_limiter`, `user_limiter` +`initialQuota` | `initialQuota` | `int` | Yes (by convention) | Minimum=0 +`quotaIncrease` | `quotaIncrease` | `int` | Yes (by convention) | Minimum=0 +`period` | `period` | `string` | Yes (by convention) | Pattern: `^(1\s+(second\|minute\|hour\|day\|month\|year\|s\|min\|h\|d\|m\|y)\|([2-9][0-9]*\|[1-9][0-9]{2,})\s+(seconds\|minutes\|hours\|days\|months\|years\|s\|min\|h\|d\|m\|y))$` + +36. Period pattern explanation: quantity 1 requires singular unit name or abbreviation; quantities >= 2 require plural unit name or abbreviation. Abbreviations (`s`, `min`, `h`, `d`, `m`, `y`) are accepted with any quantity. + +#### Storage (spec.ols.storage) + +37. `spec.ols.storage.size` -- `resource.Quantity`, optional. Size of the requested persistent volume. +38. `spec.ols.storage.class` -- `string`, optional. Storage class name. + +#### Boolean/String Fields + +39. `spec.ols.byokRAGOnly` -- `bool`, optional. When true, only BYOK RAG sources are used; built-in OpenShift documentation RAG is ignored. +40. `spec.ols.querySystemPrompt` -- `string`, optional. Custom system prompt for LLM queries. If unset, the default OpenShift Lightspeed prompt is used. +41. `spec.ols.maxIterations` -- `int`. Default: `5`. Minimum=1. Maximum number of iterations for agent execution. +42. `spec.ols.imagePullSecrets` -- `[]corev1.LocalObjectReference`, optional. Pull secrets for BYOK RAG images. + +#### Tool Filtering (spec.ols.toolFilteringConfig) + +43. `spec.ols.toolFilteringConfig` -- `*ToolFilteringConfig`, optional. Presence enables tool filtering; absence means all tools are used. + +Field | JSON key | Go type | Default | Validation +---|---|---|---|--- +`alpha` | `alpha` | `float64` | `0.8` | XValidation: must be >= 0.0 and <= 1.0. Weight for dense vs sparse retrieval (1.0 = full dense, 0.0 = full sparse) +`topK` | `topK` | `int` | `10` | Minimum=1, Maximum=50. Number of tools to retrieve +`threshold` | `threshold` | `float64` | `0.01` | XValidation: must be >= 0.0 and <= 1.0. Minimum similarity threshold + +44. Tool filtering requires the `ToolFiltering` feature gate to be enabled in `spec.featureGates`. + +#### Tools Approval (spec.ols.toolsApprovalConfig) + +45. `spec.ols.toolsApprovalConfig` -- `*ToolsApprovalConfig`, optional. + +Field | JSON key | Go type | Default | Validation +---|---|---|---|--- +`approvalType` | `approvalType` | `ApprovalType` | `tool_annotations` | Enum: `never`, `always`, `tool_annotations` +`approvalTimeout` | `approvalTimeout` | `int` | `600` | Minimum=1. Timeout in seconds for user approval + +46. `never`: all tools execute without approval. `always`: all tool calls require approval. `tool_annotations`: approval decision is per-tool based on annotations. + +### Data Collector Configuration (spec.olsDataCollector) + +47. `spec.olsDataCollector.logLevel` -- `LogLevel` enum. Default: `INFO`. Same enum as `spec.ols.logLevel`. + +### MCP Server Configuration (spec.mcpServers) + +48. Array of `MCPServerConfig`. MaxItems=20. + +Field | JSON key | Go type | Required | Default | Validation +---|---|---|---|---|--- +`name` | `name` | `string` | Yes | (none) | None +`url` | `url` | `string` | Yes | (none) | Pattern: `^https?://.*$` +`timeout` | `timeout` | `int` | No | `5` | None (no min/max markers) +`headers` | `headers` | `[]MCPHeader` | No | (none) | MaxItems=20 + +#### MCPHeader Fields + +Field | JSON key | Go type | Required | Validation +---|---|---|---|--- +`name` | `name` | `string` | Yes | MinLength=1, Pattern: `^[A-Za-z0-9-]+$` +`valueFrom` | `valueFrom` | `MCPHeaderValueSource` | Yes | Discriminated union (see below) + +#### MCPHeaderValueSource Fields (discriminated union) + +Field | JSON key | Go type | Required | Validation +---|---|---|---|--- +`type` | `type` | `MCPHeaderSourceType` | Yes | Enum: `secret`, `kubernetes`, `client`. Union discriminator +`secretRef` | `secretRef` | `*corev1.LocalObjectReference` | Conditional | Required with non-empty `name` when `type == "secret"`. Must not be set when `type != "secret"` + +49. XValidation: when `type == "secret"`, `secretRef` must be present with a non-empty `name`. +50. XValidation: when `type != "secret"` (i.e., `kubernetes` or `client`), `secretRef` must not be set. + +### Status (status) + +#### Conditions (status.conditions) + +51. Type: `[]metav1.Condition`. Populated after first reconciliation. + +Condition types used by the operator: +- `ApiReady` -- API server deployment health +- `CacheReady` -- PostgreSQL cache deployment health +- `ConsolePluginReady` -- Console UI plugin deployment health +- `ResourceReconciliation` -- Overall resource reconciliation status (set directly, not deployment-based) + +#### Overall Status (status.overallStatus) + +52. `status.overallStatus` -- `OverallStatus` enum. Values: `Ready`, `NotReady`. Aggregation of all component conditions. `Ready` only when all components are healthy. + +#### Diagnostic Info (status.diagnosticInfo) + +53. Type: `[]PodDiagnostic`, optional. Auto-populated during deployment failures, cleared on recovery. + +`PodDiagnostic` fields: + +Field | JSON key | Go type | Required | Description +---|---|---|---|--- +`failedComponent` | `failedComponent` | `string` | Yes | Matches condition type (e.g., `"ApiReady"`, `"CacheReady"`) +`podName` | `podName` | `string` | Yes | Name of the failing pod +`containerName` | `containerName` | `string` | No | Container within the pod (empty for pod-level issues) +`reason` | `reason` | `string` | Yes | Failure reason (e.g., `ImagePullBackOff`, `CrashLoopBackOff`, `Unschedulable`, `OOMKilled`) +`message` | `message` | `string` | Yes | Detailed error from Kubernetes +`exitCode` | `exitCode` | `*int32` | No | Exit code for terminated containers only +`type` | `type` | `DiagnosticType` | Yes | Enum: `ContainerWaiting`, `ContainerTerminated`, `PodScheduling`, `PodCondition` +`lastUpdated` | `lastUpdated` | `metav1.Time` | Yes | Timestamp of diagnostic collection + +## Configuration Surface + +Complete field reference. All paths are relative to the OLSConfig object. + +Path | Type | Default | Required | Validation | Description +---|---|---|---|---|--- +`spec` | `OLSConfigSpec` | -- | Yes | -- | Top-level spec +`spec.llm` | `LLMSpec` | -- | Yes | -- | LLM settings +`spec.llm.providers` | `[]ProviderSpec` | -- | Yes | MaxItems=10 | LLM providers +`spec.llm.providers[].name` | `string` | -- | Yes | -- | Provider name +`spec.llm.providers[].url` | `string` | -- | No | Pattern `^https?://.*$` | Provider API URL +`spec.llm.providers[].credentialsSecretRef` | `LocalObjectReference` | -- | Yes | -- | Secret with credentials +`spec.llm.providers[].models` | `[]ModelSpec` | -- | Yes | MaxItems=50 | Models +`spec.llm.providers[].models[].name` | `string` | -- | Yes | -- | Model name +`spec.llm.providers[].models[].url` | `string` | -- | No | Pattern `^https?://.*$` | Model API URL +`spec.llm.providers[].models[].contextWindowSize` | `uint` | -- | No | Min=1024 | Context window (tokens) +`spec.llm.providers[].models[].parameters` | `ModelParametersSpec` | -- | No | -- | Model parameters +`spec.llm.providers[].models[].parameters.maxTokensForResponse` | `int` | -- | No | -- | Max response tokens +`spec.llm.providers[].models[].parameters.toolBudgetRatio` | `float64` | `0.25` | No | Min=0.1, Max=0.5 | Tool token budget ratio +`spec.llm.providers[].type` | `string` | -- | Yes | Enum (see rule 7) | Provider type +`spec.llm.providers[].deploymentName` | `string` | -- | No | XValidation (rule 8) | Azure deployment name +`spec.llm.providers[].apiVersion` | `string` | -- | No | -- | Azure API version +`spec.llm.providers[].projectID` | `string` | -- | No | XValidation (rule 9) | Watsonx project ID +`spec.llm.providers[].fakeProviderMCPToolCall` | `bool` | -- | No | -- | Fake provider MCP flag +`spec.llm.providers[].tlsSecurityProfile` | `*TLSSecurityProfile` | -- | No | -- | Provider TLS profile +`spec.llm.providers[].credentialKey` | `string` | -- | No | XValidation (rule 10) | Secret key name +`spec.ols` | `OLSSpec` | -- | Yes | -- | OLS settings +`spec.ols.defaultModel` | `string` | -- | Yes | -- | Default model name +`spec.ols.defaultProvider` | `string` | -- | Yes | -- | Default provider name +`spec.ols.logLevel` | `LogLevel` | `INFO` | No | Enum: DEBUG/INFO/WARNING/ERROR/CRITICAL | Log level +`spec.ols.conversationCache` | `ConversationCacheSpec` | -- | No | -- | Cache config +`spec.ols.conversationCache.type` | `CacheType` | `postgres` | No | Enum: `postgres` | Cache type +`spec.ols.conversationCache.postgres` | `PostgresSpec` | -- | No | -- | Postgres settings +`spec.ols.conversationCache.postgres.sharedBuffers` | `string` | `"256MB"` | No | XIntOrString | Shared buffers +`spec.ols.conversationCache.postgres.maxConnections` | `int` | `2000` | No | Min=1, Max=262143 | Max connections +`spec.ols.deployment` | `DeploymentConfig` | -- | No | -- | Deployment overrides +`spec.ols.deployment.api` | `Config` | -- | No | -- | API container +`spec.ols.deployment.api.replicas` | `*int32` | `1` | No | Min=0 | API replicas (user-configurable) +`spec.ols.deployment.api.resources` | `*ResourceRequirements` | -- | No | -- | API resources +`spec.ols.deployment.api.tolerations` | `[]Toleration` | -- | No | -- | API tolerations +`spec.ols.deployment.api.nodeSelector` | `map[string]string` | -- | No | -- | API node selector +`spec.ols.deployment.api.affinity` | `*Affinity` | -- | No | -- | API affinity +`spec.ols.deployment.api.topologySpreadConstraints` | `[]TopologySpreadConstraint` | -- | No | -- | API topology spread +`spec.ols.deployment.dataCollector` | `ContainerConfig` | -- | No | -- | Data collector container +`spec.ols.deployment.dataCollector.resources` | `*ResourceRequirements` | -- | No | -- | Data collector resources +`spec.ols.deployment.mcpServer` | `ContainerConfig` | -- | No | -- | MCP server container +`spec.ols.deployment.mcpServer.resources` | `*ResourceRequirements` | -- | No | -- | MCP server resources +`spec.ols.deployment.console` | `Config` | -- | No | -- | Console container +`spec.ols.deployment.console.replicas` | `*int32` | `1` | No | Min=0 | Console replicas (operator forces 1) +`spec.ols.deployment.console.resources` | `*ResourceRequirements` | -- | No | -- | Console resources +`spec.ols.deployment.console.tolerations` | `[]Toleration` | -- | No | -- | Console tolerations +`spec.ols.deployment.console.nodeSelector` | `map[string]string` | -- | No | -- | Console node selector +`spec.ols.deployment.console.affinity` | `*Affinity` | -- | No | -- | Console affinity +`spec.ols.deployment.console.topologySpreadConstraints` | `[]TopologySpreadConstraint` | -- | No | -- | Console topology spread +`spec.ols.deployment.database` | `Config` | -- | No | -- | Database container +`spec.ols.deployment.database.replicas` | `*int32` | `1` | No | Min=0 | Database replicas (operator forces 1) +`spec.ols.deployment.database.resources` | `*ResourceRequirements` | -- | No | -- | Database resources +`spec.ols.deployment.database.tolerations` | `[]Toleration` | -- | No | -- | Database tolerations +`spec.ols.deployment.database.nodeSelector` | `map[string]string` | -- | No | -- | Database node selector +`spec.ols.deployment.database.affinity` | `*Affinity` | -- | No | -- | Database affinity +`spec.ols.deployment.database.topologySpreadConstraints` | `[]TopologySpreadConstraint` | -- | No | -- | Database topology spread +`spec.ols.queryFilters` | `[]QueryFiltersSpec` | -- | No | -- | Query filters +`spec.ols.queryFilters[].name` | `string` | -- | No | -- | Filter name +`spec.ols.queryFilters[].pattern` | `string` | -- | No | -- | Regex pattern +`spec.ols.queryFilters[].replaceWith` | `string` | -- | No | -- | Replacement text +`spec.ols.userDataCollection` | `UserDataCollectionSpec` | -- | No | -- | Data collection switches +`spec.ols.userDataCollection.feedbackDisabled` | `bool` | -- | No | -- | Disable feedback +`spec.ols.userDataCollection.transcriptsDisabled` | `bool` | -- | No | -- | Disable transcripts +`spec.ols.tlsConfig` | `*TLSConfig` | -- | No | -- | Backend HTTPS TLS config +`spec.ols.tlsConfig.keyCertSecretRef` | `LocalObjectReference` | -- | No | -- | Secret with tls.crt, tls.key, ca.crt +`spec.ols.additionalCAConfigMapRef` | `*LocalObjectReference` | -- | No | -- | Extra CA certs for LLM TLS +`spec.ols.tlsSecurityProfile` | `*TLSSecurityProfile` | -- | No | -- | API endpoint TLS profile +`spec.ols.introspectionEnabled` | `bool` | -- | No | -- | Enable introspection +`spec.ols.mcpKubeServerConfig` | `*MCPKubeServerConfiguration` | -- | No | -- | Built-in MCP kube server config +`spec.ols.mcpKubeServerConfig.timeout` | `int` | `60` | No | Min=5 | Timeout (seconds) +`spec.ols.proxyConfig` | `*ProxyConfig` | -- | No | -- | Proxy settings +`spec.ols.proxyConfig.proxyURL` | `string` | -- | No | Pattern `^https?://.*$` | Proxy URL +`spec.ols.proxyConfig.proxyCACertificate` | `*ProxyCACertConfigMapRef` | -- | No | -- | Proxy CA cert ref +`spec.ols.proxyConfig.proxyCACertificate.name` | `string` | -- | Yes (inline) | -- | ConfigMap name +`spec.ols.proxyConfig.proxyCACertificate.key` | `string` | `"proxy-ca.crt"` | No | -- | Key in ConfigMap +`spec.ols.rag` | `[]RAGSpec` | -- | No | -- | RAG databases +`spec.ols.rag[].image` | `string` | -- | Yes | -- | Container image URL +`spec.ols.rag[].indexPath` | `string` | `"/rag/vector_db"` | No | -- | Path in container +`spec.ols.rag[].indexID` | `string` | `""` | No | -- | Index ID +`spec.ols.quotaHandlersConfig` | `*QuotaHandlersConfig` | -- | No | -- | Token quota config +`spec.ols.quotaHandlersConfig.limitersConfig` | `[]LimiterConfig` | -- | No | -- | Limiter definitions +`spec.ols.quotaHandlersConfig.limitersConfig[].name` | `string` | -- | Yes | -- | Limiter name +`spec.ols.quotaHandlersConfig.limitersConfig[].type` | `string` | -- | Yes | Enum: cluster_limiter, user_limiter | Limiter type +`spec.ols.quotaHandlersConfig.limitersConfig[].initialQuota` | `int` | -- | Yes | Min=0 | Initial token quota +`spec.ols.quotaHandlersConfig.limitersConfig[].quotaIncrease` | `int` | -- | Yes | Min=0 | Quota increase step +`spec.ols.quotaHandlersConfig.limitersConfig[].period` | `string` | -- | Yes | Pattern (rule 36) | Time period +`spec.ols.quotaHandlersConfig.enableTokenHistory` | `bool` | -- | No | -- | Enable token history +`spec.ols.storage` | `*Storage` | -- | No | -- | Persistent storage +`spec.ols.storage.size` | `resource.Quantity` | -- | No | -- | Volume size +`spec.ols.storage.class` | `string` | -- | No | -- | Storage class +`spec.ols.byokRAGOnly` | `bool` | -- | No | -- | Use only BYOK RAG sources +`spec.ols.querySystemPrompt` | `string` | -- | No | -- | Custom system prompt +`spec.ols.maxIterations` | `int` | `5` | No | Min=1 | Max agent iterations +`spec.ols.imagePullSecrets` | `[]LocalObjectReference` | -- | No | -- | Image pull secrets +`spec.ols.toolFilteringConfig` | `*ToolFilteringConfig` | -- | No | -- | Tool filtering config +`spec.ols.toolFilteringConfig.alpha` | `float64` | `0.8` | No | XValidation: 0.0-1.0 | Dense/sparse weight +`spec.ols.toolFilteringConfig.topK` | `int` | `10` | No | Min=1, Max=50 | Tools to retrieve +`spec.ols.toolFilteringConfig.threshold` | `float64` | `0.01` | No | XValidation: 0.0-1.0 | Similarity threshold +`spec.ols.toolsApprovalConfig` | `*ToolsApprovalConfig` | -- | No | -- | Tool approval config +`spec.ols.toolsApprovalConfig.approvalType` | `ApprovalType` | `tool_annotations` | No | Enum: never/always/tool_annotations | Approval strategy +`spec.ols.toolsApprovalConfig.approvalTimeout` | `int` | `600` | No | Min=1 | Approval timeout (seconds) +`spec.olsDataCollector` | `OLSDataCollectorSpec` | -- | No | -- | Data collector settings +`spec.olsDataCollector.logLevel` | `LogLevel` | `INFO` | No | Enum: DEBUG/INFO/WARNING/ERROR/CRITICAL | Data collector log level +`spec.mcpServers` | `[]MCPServerConfig` | -- | No | MaxItems=20 | External MCP servers +`spec.mcpServers[].name` | `string` | -- | Yes | -- | Server name +`spec.mcpServers[].url` | `string` | -- | Yes | Pattern `^https?://.*$` | Server URL +`spec.mcpServers[].timeout` | `int` | `5` | No | -- | Timeout (seconds) +`spec.mcpServers[].headers` | `[]MCPHeader` | -- | No | MaxItems=20 | HTTP headers +`spec.mcpServers[].headers[].name` | `string` | -- | Yes | MinLen=1, Pattern `^[A-Za-z0-9-]+$` | Header name +`spec.mcpServers[].headers[].valueFrom` | `MCPHeaderValueSource` | -- | Yes | -- | Value source +`spec.mcpServers[].headers[].valueFrom.type` | `MCPHeaderSourceType` | -- | Yes | Enum: secret/kubernetes/client | Source type +`spec.mcpServers[].headers[].valueFrom.secretRef` | `*LocalObjectReference` | -- | Conditional | XValidation (rules 49-50) | Secret reference +`spec.featureGates` | `[]FeatureGate` | -- | No | Enum per item: MCPServer/ToolFiltering | Feature gates +`status.conditions` | `[]metav1.Condition` | -- | -- | -- | Component conditions +`status.overallStatus` | `OverallStatus` | -- | -- | Enum: Ready/NotReady | Aggregate health +`status.diagnosticInfo` | `[]PodDiagnostic` | -- | -- | -- | Pod failure diagnostics +`status.diagnosticInfo[].failedComponent` | `string` | -- | -- | -- | Component name +`status.diagnosticInfo[].podName` | `string` | -- | -- | -- | Pod name +`status.diagnosticInfo[].containerName` | `string` | -- | -- | -- | Container name +`status.diagnosticInfo[].reason` | `string` | -- | -- | -- | Failure reason +`status.diagnosticInfo[].message` | `string` | -- | -- | -- | Error message +`status.diagnosticInfo[].exitCode` | `*int32` | -- | -- | -- | Container exit code +`status.diagnosticInfo[].type` | `DiagnosticType` | -- | -- | Enum (see rule 53) | Diagnostic category +`status.diagnosticInfo[].lastUpdated` | `metav1.Time` | -- | -- | -- | Collection timestamp + +## Constraints + +1. `.metadata.name` must be `"cluster"` (XValidation on OLSConfig type). +2. Only `azure_openai` provider type uses `deploymentName`; it is required for that type and forbidden (by convention) for others. +3. Only `watsonx` provider type uses `projectID`; it is required for that type. +4. Replicas are only user-configurable for the API container (`spec.ols.deployment.api`). Console and database always run with 1 replica enforced by the operator. +5. Period format for quota limiters must match the regex pattern in rule 36, enforcing human-readable duration strings with correct singular/plural agreement. +6. `credentialKey` if set must contain at least one non-whitespace character. +7. Tool filtering requires the `ToolFiltering` feature gate in `spec.featureGates`. +8. MCP server functionality requires the `MCPServer` feature gate in `spec.featureGates`. +11. There is exactly one allowed CacheType value: `postgres`. +12. `ToolFilteringConfig.alpha` and `ToolFilteringConfig.threshold` are validated via XValidation (not kubebuilder min/max) to enforce 0.0-1.0 range. diff --git a/.ai/spec/what/observability.md b/.ai/spec/what/observability.md new file mode 100644 index 000000000..f252b05bd --- /dev/null +++ b/.ai/spec/what/observability.md @@ -0,0 +1,52 @@ +# Observability + +The operator configures monitoring, health probes, and status reporting for all components. + +## Behavioral Rules + +### Prometheus Metrics +1. The operator creates ServiceMonitor resources for both the operator itself (`controller-manager-metrics-monitor`) and the backend (`lightspeed-app-server-monitor`) if Prometheus Operator CRDs are available. Availability is checked at startup via `IsPrometheusOperatorAvailable()`. +2. ServiceMonitors are configured for HTTPS scraping with mTLS: CA from `/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt`, client cert from `/etc/prometheus/secrets/metrics-client-certs/tls.crt`, client key from the matching key file. `insecureSkipVerify` is set to `false`. The backend ServiceMonitor also includes Bearer token authorization from the `metrics-reader-token` Secret. +3. The operator creates a PrometheusRule (`lightspeed-app-server-prometheus-rule`) with recording rules that aggregate query call counts by HTTP status code class (`ols:rest_api_query_calls_total:2xx`, `ols:rest_api_query_calls_total:4xx`, `ols:rest_api_query_calls_total:5xx`) and track provider/model configuration (`ols:provider_model_configuration`). +4. Metrics are scraped at a fixed 30-second interval. +5. If Prometheus Operator CRDs are not installed, ServiceMonitor and PrometheusRule creation is silently skipped. The `PrometheusAvailable` flag is set at operator startup and not re-checked. + +### Health Probes +6. The application server backend uses HTTPS health probes: readiness at `/readiness` and liveness at `/liveness`, both on the `https` port (8443) with `URISchemeHTTPS`. Initial delay: 30s, period: 30s, timeout: 30s, failure threshold: 15. +7. PostgreSQL uses the standard PostgreSQL health check mechanism via the postgres container image. +8. All probe parameters are set as internal constants in the deployment generation code. They are not configurable via the CR. + +### Status Reporting +11. The operator reports status via four condition types defined in `utils/types.go`: `ApiReady`, `CacheReady`, `ConsolePluginReady`, and `ResourceReconciliation` (used for Phase 1 failures). +12. `status.overallStatus` aggregates all conditions: `Ready` when all deployment conditions are `True`, `NotReady` otherwise. The field is required (no `omitempty`). +13. When deployments fail health checks, `status.diagnosticInfo` is populated with per-pod diagnostic entries. Diagnostics are collected by listing pods matching the deployment's selector labels and inspecting container and pod statuses. +14. Each `PodDiagnostic` entry includes: `failedComponent` (matching the condition type, e.g., `ApiReady`), `podName`, `containerName` (empty string for pod-level issues), `reason`, `message`, `exitCode` (pointer, set for terminated containers), `type` (diagnostic category), and `lastUpdated` timestamp. +15. Diagnostic types categorize the failure: `ContainerWaiting` (image pull issues, CrashLoopBackOff, pending states), `ContainerTerminated` (crashes, OOM, non-zero exit codes), `PodScheduling` (unschedulable pods), `PodCondition` (readiness failures for running pods without container-level diagnostics). +16. Terminal/recurring failures (`CrashLoopBackOff`, `ImagePullBackOff`, `ErrImagePull`, `OOMKilled`, `PreviousCrash:*`) cause the deployment status to be marked as `Failed`. Other diagnostic entries result in `Progressing` status. Both trigger exponential backoff retries via returned errors. + +### Operator Metrics +17. The operator exposes its own metrics endpoint, optionally secured with mTLS via the `--secure-metrics-server` flag. +18. When mTLS is enabled, the operator reads the client CA from the `openshift-monitoring/metrics-client-ca` ConfigMap (key `client-ca.crt`). +19. The operator's TLS profile for metrics follows the OLSConfig CR's `spec.ols.tlsSecurityProfile` or falls back to the cluster API server's profile. + +### Data Collection +20. The data collector sidecar (`lightspeed-to-dataverse-exporter`) exports feedback and transcript data to the Red Hat data pipeline at `https://console.redhat.com/api/ingress/v1/upload`. It runs in `openshift` mode to use the cluster ID as identity. +21. Data collection is enabled only when both conditions are met: (a) user data collection is not fully disabled (at least one of `spec.ols.userDataCollection.feedbackDisabled` or `spec.ols.userDataCollection.transcriptsDisabled` is false), AND (b) the telemetry pull secret (`openshift-config/pull-secret`) contains valid `cloud.openshift.com` credentials in its `.dockerconfigjson` data. +22. The service ID for data collection is `ols` by default, or `rhos-lightspeed` if the OLSConfig CR has the `openstack.org/lightspeed-owner-id` label. +23. The exporter config is generated as a ConfigMap (`lightspeed-exporter-config`) with a fixed 300-second collection interval. + +## Configuration Surface + +| Field path | Description | +|---|---| +| `spec.ols.logLevel` | Log level for backend service (app, lib, uvicorn levels all set to this value) | +| `spec.olsDataCollector.logLevel` | Log level for data collector sidecar (defaults to `info`) | +| `spec.ols.userDataCollection.feedbackDisabled` | Disable feedback collection | +| `spec.ols.userDataCollection.transcriptsDisabled` | Disable transcript collection | + +## Constraints + +1. ServiceMonitor and PrometheusRule are only created when Prometheus Operator CRDs are detected at operator startup. There is no runtime re-check. +2. Data collection requires the telemetry pull secret with `cloud.openshift.com` auth; removing the secret or the auth entry disables collection. +3. Diagnostics are cleared from status when the corresponding deployment becomes healthy (the entire `diagnosticInfo` array is rebuilt from scratch on each status update). +4. Health probe parameters are internal constants and cannot be customized via the CR. diff --git a/.ai/spec/what/postgres.md b/.ai/spec/what/postgres.md new file mode 100644 index 000000000..dc392f3a2 --- /dev/null +++ b/.ai/spec/what/postgres.md @@ -0,0 +1,56 @@ +# PostgreSQL + +The operator deploys a single-replica PostgreSQL server that provides persistent storage for conversation cache and quota management. + +## Behavioral Rules + +### Deployment +1. PostgreSQL always runs as a single replica. The replica count is not configurable. +2. The deployment uses the operator-managed PostgreSQL container image. +3. TLS is always enabled. Certificates are generated by the OpenShift service-ca operator. +4. A bootstrap script runs on first startup to create databases, extensions, and schemas. + +### Database Initialization +5. The bootstrap script creates the pg_trgm extension in the default database. +6. The bootstrap script creates schemas in the default database for component isolation: quota and conversation_cache. + +### Credential Management +7. The operator generates a random password for PostgreSQL on first creation. +8. Once created, the password secret is never updated to protect database integrity. +9. Old postgres secrets with different naming conventions are cleaned up during reconciliation. + +### Storage +10. If `spec.ols.storage` is configured, a PersistentVolumeClaim is created for the data directory. +11. If `spec.ols.storage` is not configured, an EmptyDir volume is used (data does not survive pod restarts). +12. The default PVC size is applied if `spec.ols.storage.size` is not specified. +13. The PVC uses ReadWriteOnce access mode. +14. If `spec.ols.storage.class` is specified, it sets the PVC storage class. Otherwise, the cluster default storage class is used. + +### Configuration +15. PostgreSQL configuration (shared_buffers, max_connections) is written to a ConfigMap and mounted as the config file. +16. Changes to the PostgreSQL ConfigMap trigger a deployment restart via resource version annotation tracking. + +### Networking +17. The PostgreSQL service exposes the standard PostgreSQL port. +18. The network policy allows ingress only from pods matching the application server labels on the PostgreSQL port. + +## Configuration Surface + +| Field path | Description | +|---|---| +| `spec.ols.deployment.database.replicas` | Ignored; always 1 | +| `spec.ols.deployment.database.resources` | PostgreSQL container resource requirements | +| `spec.ols.deployment.database.tolerations` | Pod tolerations | +| `spec.ols.deployment.database.nodeSelector` | Node selector constraints | +| `spec.ols.deployment.database.affinity` | Pod affinity rules | +| `spec.ols.conversationCache.type` | Cache type (only "postgres" supported) | +| `spec.ols.conversationCache.postgres.sharedBuffers` | PostgreSQL shared_buffers setting | +| `spec.ols.conversationCache.postgres.maxConnections` | PostgreSQL max_connections setting | +| `spec.ols.storage.size` | PVC size for PostgreSQL data | +| `spec.ols.storage.class` | PVC storage class | + +## Constraints + +1. Replicas are always 1 regardless of configuration. +2. Password secrets are write-once; the operator never updates them after creation. +3. SSL is always enabled with certificates from the service-ca operator. diff --git a/.ai/spec/what/reconciliation.md b/.ai/spec/what/reconciliation.md new file mode 100644 index 000000000..d22460b92 --- /dev/null +++ b/.ai/spec/what/reconciliation.md @@ -0,0 +1,60 @@ +# Reconciliation + +The operator reconciles the OLSConfig CR into Kubernetes resources through a two-phase process with finalizer-based lifecycle management. + +## Behavioral Rules + +### Reconciliation Trigger +1. Reconciliation is triggered by changes to the OLSConfig CR, any owned resource, or annotated external resources. No periodic reconciliation. +2. The controller handles error retries via controller-runtime exponential backoff. No custom retry logic. + +### Reconciliation Order +3. Step 1: Fetch and validate CR (ignore if name != "cluster", return silently if not found) +4. Step 2: Handle finalizer (add if missing, run cleanup if CR being deleted) +5. Step 3: Reconcile operator-level resources (ServiceMonitor, NetworkPolicy) +6. Step 4: Annotate external resources for watching (validate LLM credentials and TLS secrets first) +7. Step 5 (Phase 1): Reconcile independent resources -- ConfigMaps, Secrets, ServiceAccounts, Roles, NetworkPolicies for all components. Uses continue-on-error: reconcile as many as possible, report all failures. +8. Step 6 (Phase 2): Reconcile deployments and dependent resources -- Deployments, Services, TLS certificates, ServiceMonitors, PrometheusRules. After reconciliation, check deployment health and update CR status. + +### Phase 1: Independent Resources +9. Three component groups are reconciled in Phase 1: Console UI, PostgreSQL, and the application server. +10. All Phase 1 resource groups are independent and can be reconciled in any order. +11. If any Phase 1 resource fails, the operator continues reconciling the remaining resources, then reports all failures in the CR status with ResourceReconciliation conditions. + +### Phase 2: Deployments and Status +12. Three deployments are reconciled in Phase 2: Console UI (condition: ConsolePluginReady), PostgreSQL (condition: CacheReady), and the active backend (condition: ApiReady). +13. After each deployment reconciliation, the operator checks the deployment's health status. +14. Deployment health has three states: Ready (Available condition true), Progressing (not yet available, no terminal failures), Failed (terminal pod failures detected). +15. Terminal pod failures include: CrashLoopBackOff, ImagePullBackOff, ErrImagePull, OOMKilled, and containers terminated with non-zero exit codes after CrashLoopBackOff. +16. If any deployment has pod failures, the operator returns an error to trigger exponential backoff retry. +17. If deployments are progressing, the operator returns an error to trigger retry, enabling early issue detection rather than relying solely on deployment watch events. +18. Status is updated once per reconciliation cycle, covering all component conditions. + +### Finalizer Lifecycle +19. On CR creation: add finalizer, return immediately (controller-runtime auto-requeues). +20. On CR deletion: run finalizer cleanup before removing finalizer. +21. Finalizer cleanup sequence: remove Console UI from Console CR, delete ConsolePlugin CR, list all owned resources by owner reference, explicitly delete them, wait for deletion (polling with timeout). +22. If cleanup times out, the finalizer is removed anyway to prevent the CR from being stuck in Terminating state. +23. Console UI removal errors during finalization are logged but do not block finalization. + +### Status Conditions +24. The operator sets these condition types: ApiReady, CacheReady, ConsolePluginReady, ResourceReconciliation. +25. OverallStatus is Ready only when all deployment conditions are True. +26. OverallStatus is NotReady if any condition is False. +27. When deployments are not ready, diagnosticInfo is populated with per-pod failure details including container name, reason, message, exit code, and diagnostic type. +28. Status updates preserve LastTransitionTime for unchanged conditions. +29. Status updates use retry-on-conflict to handle concurrent modifications. + +### Resource Lifecycle +30. The operator tracks resources through two mechanisms: owned resources (via OwnerReferences and `Owns()`) and external resources (via annotation-based watching). See `what/resource-lifecycle.md` for the full specification of both models, including change detection, restart triggers, and cleanup behavior. + +## Configuration Surface + +Reconciliation behavior is not directly user-configurable. It is driven by the OLSConfig CR spec (see `what/crd-api.md`) and operator startup flags (see `what/system-overview.md`). + +## Constraints + +1. Phase 1 must complete before Phase 2 begins (deployments depend on ConfigMaps, Secrets, etc.). +2. Finalizer removal must succeed even if cleanup partially fails, to prevent stuck CRs. +3. The operator must not create ServiceMonitor or PrometheusRule resources if Prometheus Operator CRDs are not installed. +4. Status updates must always set OverallStatus (required field after first reconciliation). diff --git a/.ai/spec/what/resource-lifecycle.md b/.ai/spec/what/resource-lifecycle.md new file mode 100644 index 000000000..51a8ec9a8 --- /dev/null +++ b/.ai/spec/what/resource-lifecycle.md @@ -0,0 +1,60 @@ +# Resource Lifecycle + +The operator manages two categories of Kubernetes resources: owned resources (created by the operator) and external resources (created by users or other controllers). Each category uses a different mechanism for change detection and reconciliation triggering. + +## Behavioral Rules + +### Owned Resources + +1. The operator creates resources with an OwnerReference pointing to the OLSConfig CR. Controller-runtime detects changes to these resources automatically via `Owns()` registrations and triggers reconciliation. +2. Owned resource types: Deployments, ServiceAccounts, ClusterRoles, ClusterRoleBindings, Services, ConfigMaps, Secrets, PersistentVolumeClaims, ConsolePlugins, ServiceMonitors, PrometheusRules, ImageStreams. +3. The ConsolePlugin CR is cluster-scoped and cannot use standard namespace-scoped owner references. It is cleaned up explicitly during finalizer processing. +4. On CR deletion, the finalizer lists all owned resources by matching OwnerReference UID (not labels), explicitly deletes them, and waits for deletion to complete before removing the finalizer. See `what/reconciliation.md` for finalizer sequencing. +5. Owned resource changes (e.g., someone manually edits a managed ConfigMap) trigger reconciliation, and the operator overwrites them with the desired state. + +### External Resources + +6. External resources fall into two categories: system resources (fixed, known at compile time) and user-provided resources (derived from the CR spec at runtime). +7. System secrets: the telemetry pull secret (`openshift-config/pull-secret`), console UI service cert (`lightspeed-console-plugin-cert`), PostgreSQL certs (`lightspeed-postgres-certs`). +8. System configmaps: the OpenShift root CA (`kube-root-ca.crt`), the service CA bundle (`openshift-service-ca.crt`). +9. User-provided secrets: LLM provider credential secrets (`spec.llm.providers[].credentialsSecretRef`), custom TLS secret (`spec.ols.tlsConfig.keyCertSecretRef`), MCP server header secrets (`spec.mcpServers[].headers[].valueFrom.secretRef`). +10. User-provided configmaps: additional CA ConfigMap (`spec.ols.additionalCAConfigMapRef`), proxy CA ConfigMap (`spec.ols.proxyConfig.proxyCACertificate`). + +### Annotation-Based Watching + +11. The operator annotates each user-provided external resource with `ols.openshift.io/watcher: cluster` to mark it for watching. +12. On each reconciliation, the operator clears the `AnnotatedSecretMapping` and `AnnotatedConfigMapMapping` in `WatcherConfig` and repopulates them from the current CR spec via `ForEachExternalSecret()` and `ForEachExternalConfigMap()`, then annotates any resources that lack the annotation. +13. The watcher predicate on Update events checks for two conditions: (a) the resource has the `ols.openshift.io/watcher` annotation, or (b) the resource is a configured system resource. Create events are allowed for all resources in the operator namespace (to handle recreated resources that have not been annotated yet). Create events also verify the resource is referenced in the CR before acting. Delete events are always ignored. + +### Change Detection and Restart + +14. When a watched secret's `.data` changes (compared via `apiequality.Semantic.DeepEqual`), the `SecretUpdateHandler` triggers restarts of affected deployments directly, without triggering a full reconciliation. +15. When a watched configmap's `.data` or `.binaryData` changes, the `ConfigMapUpdateHandler` triggers restarts of affected deployments directly. +16. Each external resource has a list of affected deployments configured in `WatcherConfig`. The special value `ACTIVE_BACKEND` resolves to the application server deployment name (`lightspeed-app-server`). +17. Restarts are triggered by updating the `ols.openshift.io/force-reload` annotation on the deployment's pod template with the current timestamp (RFC3339Nano), causing a rolling update. +18. TLS secrets are mapped to affect both `lightspeed-console-plugin` and `ACTIVE_BACKEND` deployments. All other user-provided secrets default to `ACTIVE_BACKEND` only. + +### Validation + +19. Before annotating resources, the operator validates LLM provider credential secrets via `ValidateLLMCredentials()` (secret must exist and contain expected key) and custom TLS secrets via `ValidateTLSSecret()` (must contain `tls.crt` and `tls.key`). +20. Missing secrets for user-provided resources during annotation are not treated as errors. If a secret does not exist, `annotateSecretIfNeeded()` returns nil, and the resource will be picked up on the next reconciliation when it appears. + +## Configuration Surface + +Resource lifecycle behavior is not directly user-configurable. External resources are derived from CRD fields: + +| CR field | Resulting external resource | +|---|---| +| `spec.llm.providers[].credentialsSecretRef` | Provider credential secret | +| `spec.ols.tlsConfig.keyCertSecretRef` | Custom TLS secret | +| `spec.ols.additionalCAConfigMapRef` | Additional CA ConfigMap | +| `spec.ols.proxyConfig.proxyCACertificate` | Proxy CA ConfigMap | +| `spec.mcpServers[].headers[].valueFrom.secretRef` | MCP header secret | + +## Constraints + +1. The operator can only watch resources in its own namespace and in fixed external namespaces (`openshift-config` for the pull secret, `openshift-monitoring` for the client CA). +2. Delete events on external resources do not trigger restarts or reconciliation. The operator detects the absence during the next reconciliation triggered by other events. +3. System resources are always watched regardless of CR configuration. They are defined in `WatcherConfig.Secrets.SystemResources` and `WatcherConfig.ConfigMaps.SystemResources`. +4. Owned resources with an OwnerReference are skipped by the external resource Create handler to avoid redundant processing; they are handled via the `Owns()` relationship. +5. Owned resources are not deleted individually during normal operation. They are only explicitly deleted during finalizer cleanup on CR deletion. diff --git a/.ai/spec/what/security.md b/.ai/spec/what/security.md new file mode 100644 index 000000000..e27cd385e --- /dev/null +++ b/.ai/spec/what/security.md @@ -0,0 +1,47 @@ +# Security + +The operator enforces security boundaries through RBAC, network policies, pod security contexts, and credential management. + +## Behavioral Rules + +### RBAC +1. The operator creates a ClusterRole (`lightspeed-app-server-sar-role`) and ClusterRoleBinding for the backend service account with permissions for: SubjectAccessReview (create), TokenReview (create), ClusterVersion (get, list), and pull-secret Secret (get by resourceName). +2. These permissions enable the backend service to authenticate users via Kubernetes TokenReview and authorize API access via SubjectAccessReview. +3. The operator controller itself requires RBAC including: managing deployments, services, configmaps, secrets, PVCs, network policies, RBAC resources (clusterroles, clusterrolebindings, roles, rolebindings), console plugins, image streams, and monitoring resources (servicemonitors, prometheusrules). It also has NonResourceURL permissions for `/ls-access` and `/ols-metrics-access`. +4. The backend service account also receives a NonResourceURL permission for `/ls-access` to control Lightspeed API access (declared via kubebuilder RBAC markers on the controller). + +### Network Policies +5. Each component has its own NetworkPolicy restricting ingress: + - Operator (`lightspeed-operator`): allows Prometheus scraping from `openshift-monitoring` namespace on port 8443. + - Backend/AppServer (`lightspeed-app-server`): allows Prometheus from `openshift-monitoring`, OpenShift Console pods from `openshift-console`, and ingress controllers (namespaces with `network.openshift.io/policy-group: ingress`), all on port 8443. + - PostgreSQL (`lightspeed-postgres-server`): allows only backend pods (matched by `app.kubernetes.io/name: lightspeed-service-api` label). + - Console UI (`lightspeed-console-plugin`): allows only OpenShift Console pods from `openshift-console` namespace. +6. Network policies use combined pod label selectors and namespace selectors for source filtering. +7. Egress is unrestricted for all components. PolicyTypes includes only `Ingress`; egress rules are empty (`[]`), meaning no egress restrictions. + +### Pod Security +8. All containers (main containers and sidecars) run with restricted security context: `allowPrivilegeEscalation: false`, `readOnlyRootFilesystem: true`, `runAsNonRoot: true`, `seccompProfile: RuntimeDefault`, `capabilities: {drop: [ALL]}`. This is enforced via `utils.RestrictedContainerSecurityContext()`. +9. Writable paths (`/tmp`, llama-cache, user-data) use `emptyDir` volumes to provide write access on an otherwise read-only root filesystem. + +### Credential Management +10. LLM provider credentials are validated during the annotation phase via `ValidateLLMCredentials()`. The operator verifies that each referenced secret exists and contains the expected key before proceeding with reconciliation. +11. Standard providers must have a secret with the `apitoken` key (or the key specified by `credentialKey`). Azure OpenAI providers must have either `apitoken` or all three of `client_id`, `tenant_id`, `client_secret`. +12. Custom TLS secrets are validated via `ValidateTLSSecret()` to ensure they contain `tls.crt` and `tls.key`. +13. Provider credentials are mounted as read-only volume files at `/etc/apikeys//`, never exposed as environment variables. +14. PostgreSQL passwords are generated randomly on first creation (via the postgres reconciler) and never updated on subsequent reconciliations. +15. MCP server header secrets must contain a specific key `header` (constant `MCPSECRETDATAPATH`) and are mounted read-only at `/etc/mcp/headers//`. + +### OpenShift MCP Server Security +16. The shipped OpenShift MCP server runs with the `--read-only` flag and is configured via a TOML config file that blocks access to Secret and RBAC resources, preventing secret data from reaching the LLM. +17. The denied resources are configured in the `openshift-mcp-server-config` ConfigMap as a TOML config with entries blocking `core/v1/secrets`, `rbac.authorization.k8s.io/v1/roles`, `rbac.authorization.k8s.io/v1/rolebindings`, `rbac.authorization.k8s.io/v1/clusterroles`, and `rbac.authorization.k8s.io/v1/clusterrolebindings`. +18. User-defined MCP servers (via `spec.mcpServers`) are the user's responsibility to secure. + +## Configuration Surface + +Security behavior is not directly user-configurable beyond the TLS and network-related fields documented in `what/tls.md`. RBAC, network policies, and pod security contexts are fixed by the operator implementation. + +## Constraints + +1. The operator must not store credentials in ConfigMaps or environment variables directly. Secrets are always file-mounted as read-only volumes. +2. Network policies require a CNI plugin that supports NetworkPolicy enforcement. +3. All containers must run as non-root with read-only root filesystems. diff --git a/.ai/spec/what/system-overview.md b/.ai/spec/what/system-overview.md new file mode 100644 index 000000000..3fe63f19c --- /dev/null +++ b/.ai/spec/what/system-overview.md @@ -0,0 +1,76 @@ +# System Overview + +The OpenShift Lightspeed Operator is a Kubernetes operator that manages the lifecycle of the OpenShift Lightspeed (OLS) AI assistant stack on an OpenShift cluster. It reconciles a single cluster-scoped OLSConfig custom resource into a set of Kubernetes resources that form the complete Lightspeed deployment. + +## Behavioral Rules + +### Operator Role + +1. OLSConfig is treated as a singleton per cluster: the operator only reconciles the cluster-scoped instance named "cluster". Any other OLSConfig objects are ignored. Reconciled workloads are created in the openshift-lightspeed namespace. +2. The operator deploys and manages three components: an application server backend, a PostgreSQL database, and a Console UI plugin, plus operator-level monitoring/networking resources. +3. The operator is fully event-driven. It does not use periodic/timer-based reconciliation. All changes are detected via Kubernetes watches on owned resources and annotated external resources. + +### Component Inventory + +4. Application server backend: Python/FastAPI application (lightspeed-service) that handles LLM queries, RAG retrieval, conversation management, and tool execution. Talks to LLM providers directly. +5. PostgreSQL: single-replica database providing conversation cache and quota state. Always deployed. +6. Console UI Plugin: OpenShift console extension that provides the Lightspeed chat interface. Integrates via ConsolePlugin CR and proxies requests to the backend. +7. Operator-level resources: ServiceMonitor for operator metrics, NetworkPolicy restricting operator pod access. + +### Lifecycle + +8. On CR creation: the operator adds a finalizer, then reconciles all component resources in two phases. +9. On CR update: the operator re-reconciles, detecting changes via resource version tracking and content hashing. +10. On CR deletion: the operator runs finalizer cleanup -- removes console UI from the Console CR, explicitly deletes all owned resources, waits for deletion to complete, then removes the finalizer. +11. The operator reports status via conditions (ApiReady, CacheReady, ConsolePluginReady, ResourceReconciliation) and an aggregate OverallStatus (Ready/NotReady). +12. When deployments are unhealthy, the operator collects pod-level diagnostics and populates status.diagnosticInfo with container failure details. + +### Deployment Model + +13. The operator runs as a single-instance deployment in the openshift-lightspeed namespace (configurable). +14. It supports leader election for HA deployments. +15. Images for all operands are configurable via command-line flags, with defaults embedded in the binary. + +### Integration Points + +16. The operator reads OpenShift cluster version to select the correct console plugin image (PatternFly 5 for OCP < 4.19, PatternFly 6 for OCP >= 4.19). +17. The operator detects Prometheus Operator availability and conditionally creates ServiceMonitor and PrometheusRule resources. +18. The operator uses the OpenShift service-ca operator for automatic TLS certificate generation (unless custom certificates are provided). +19. The operator watches the telemetry pull secret in openshift-config namespace to determine whether data collection is enabled. + +## Configuration Surface + +### Operator Startup Flags + +| Flag | Type | Default | Description | +|---|---|---|---| +| `--namespace` | string | `WATCH_NAMESPACE` env or `openshift-lightspeed` | Operator namespace | +| `--leader-elect` | bool | `false` | Enable leader election | +| `--secure-metrics-server` | bool | `false` | Enable mTLS for operator metrics | +| `--service-image` | string | built-in default | Override lightspeed-service container image | +| `--console-image` | string | built-in default | Override console plugin image (PatternFly 6) | +| `--console-image-pf5` | string | built-in default | Override console plugin image (PatternFly 5) | +| `--postgres-image` | string | built-in default | Override PostgreSQL image | +| `--openshift-mcp-server-image` | string | built-in default | Override OpenShift MCP server image | +| `--dataverse-exporter-image` | string | built-in default | Override dataverse exporter image | +| `--ocp-rag-image` | string | built-in default | Override OCP RAG database image | + +### Environment Variables + +| Variable | Description | +|---|---| +| `WATCH_NAMESPACE` | Fallback namespace when `--namespace` is not set | + +## Constraints + +1. Only one OLSConfig CR named "cluster" is processed. Others are silently ignored. +2. The operator must be able to run in disconnected (air-gapped) environments. All image references must be overridable. +3. The operator must function correctly with or without Prometheus Operator installed. + +## Planned Changes + +| Jira | Summary | +|---|---| +| OLS-2322 | Streamline OLSConfig CR deployment configuration | +| OLS-2323 | Extend OLSConfig CR to report specific deployment errors | +| OLS-2325 | Create type-safe log-level definition in the operator CR | diff --git a/.ai/spec/what/tls.md b/.ai/spec/what/tls.md new file mode 100644 index 000000000..b95a2f611 --- /dev/null +++ b/.ai/spec/what/tls.md @@ -0,0 +1,49 @@ +# TLS + +The operator manages TLS certificates for all inter-component communication and external endpoints. + +## Behavioral Rules + +### Certificate Sources +1. The operator supports two TLS certificate sources for the backend API endpoint: OpenShift service-ca (default) and user-provided custom certificates. +2. When no custom TLS is configured, the operator annotates the backend Service with `service.beta.openshift.io/serving-cert-secret-name: lightspeed-tls`. The service-ca operator then generates and manages the certificate automatically. +3. When `spec.ols.tlsConfig.keyCertSecretRef` is set, the operator uses the referenced Secret containing tls.crt, tls.key, and optionally ca.crt. The service-ca annotation is omitted from the Service. +4. PostgreSQL TLS certificates are always generated by the service-ca operator (not configurable by the user). The PostgreSQL config file references `/etc/certs/tls.crt`, `/etc/certs/tls.key`, and the service-ca CA at `/etc/certs/cm-olspostgresca/service-ca.crt`. +5. Console UI TLS certificates are always generated by the service-ca operator via the `lightspeed-console-plugin-cert` Secret. + +### Certificate Mounting +6. Backend TLS certificates (custom or service-ca) are mounted at `/etc/certs/lightspeed-tls/` in the API container. Both tls.crt and tls.key are read from this path. +7. The TLS secret must contain at minimum `tls.key` and `tls.crt` keys. For service-ca secrets, the operator relies on the secret being created asynchronously by the service-ca operator. +8. PostgreSQL CA certificates are mounted from the `openshift-service-ca.crt` ConfigMap into the backend container at `/etc/certs/postgres-ca/` for verifying PostgreSQL connections. The SSL mode is fixed to `require`. +9. The OpenShift root CA (`kube-root-ca.crt`) is always mounted at `/etc/certs/ols-additional-ca/` in the backend container. + +### Additional CA Certificates +10. If `spec.ols.additionalCAConfigMapRef` is set, the operator mounts the referenced ConfigMap in the backend container at `/etc/certs/ols-user-ca/` and each certificate file path is added to the `extra_ca` list in the OLS config. +11. If `spec.ols.proxyConfig.proxyCACertificate` is set, the referenced ConfigMap key is mounted in the backend container at `/etc/certs/proxy-ca/`. The key defaults to `proxy-ca.crt` if not specified via `spec.ols.proxyConfig.proxyCACertificate.key`. + +### TLS Security Profiles +12. If `spec.ols.tlsSecurityProfile` is set, the operator applies the specified OpenShift TLS profile (minimum TLS version, cipher suites) to the backend API endpoints. +13. If no TLS security profile is set, the operator falls back to the cluster API server's TLS profile, retrieved via the `config.openshift.io/v1` APIServer resource. +14. Per-provider TLS security profiles can be configured via `spec.llm.providers[].tlsSecurityProfile` for outgoing connections to LLM providers. +15. The operator's own metrics endpoint TLS profile follows the same logic as the backend: CR-specified profile, or fallback to the cluster API server's profile. + +### Certificate Rotation +16. Service-ca certificates are automatically rotated by the service-ca operator. The operator's watchers detect the Secret data change and trigger a deployment rolling restart via the `ols.openshift.io/force-reload` pod template annotation. +17. Custom certificate rotation requires the user to update the referenced Secret. The operator detects the data change via the watcher and triggers a rolling restart using the same annotation mechanism. + +## Configuration Surface + +| Field path | Description | +|---|---| +| `spec.ols.tlsConfig.keyCertSecretRef` | Secret with custom TLS cert/key (tls.crt, tls.key, optional ca.crt) | +| `spec.ols.additionalCAConfigMapRef` | ConfigMap with additional CA certificates for LLM provider connections | +| `spec.ols.tlsSecurityProfile` | OpenShift TLS security profile for API endpoints | +| `spec.ols.proxyConfig.proxyCACertificate` | ConfigMap ref + key for proxy CA certificate | +| `spec.llm.providers[].tlsSecurityProfile` | Per-provider TLS security profile for outgoing connections | + +## Constraints + +1. Service-ca certificates are only available on OpenShift clusters. The operator requires the OpenShift service-ca operator. +2. Custom TLS secrets must be in the operator's namespace (`openshift-lightspeed` by default). +3. PostgreSQL always uses SSL mode `require` with the service-ca certificate. This is not configurable. +4. The proxy CA certificate ConfigMap key is validated as a valid x509 PEM certificate during reconciliation. diff --git a/AGENTS.md b/AGENTS.md index a5bebb4e8..6283c4502 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -112,6 +112,32 @@ Available skills for code review: Invoke by typing `/skill-name` in chat. +## Git and PR Workflow + +### Commit Messages +- Start with the Jira ticket reference: `OLS-XXXX description` +- Keep the first line under 72 characters +- Use imperative mood + +### Pull Requests +This repo uses a **fork-based workflow**: + +1. **Push to your fork**, not to `origin` (openshift/lightspeed-operator) +2. **Create the PR** against `origin/main` using your fork's branch: + ```bash + git push + gh pr create --repo openshift/lightspeed-operator --head : --base main + ``` +3. **PR title** must start with the Jira reference: `OLS-XXXX description` +4. **Squash commits** before pushing -- one logical commit per PR unless the PR explicitly tracks multiple independent changes + +### Branch Completion +When finishing a development branch: +1. Remove any process artifacts (design docs, plans in `docs/superpowers/`) +2. Squash commits with the Jira-prefixed message +3. Push to the contributor's fork remote (not `origin`) +4. Create the PR against `origin/main` using `--head :` + ## Maintaining This Document Always suggest AGENTS.md edits when architectural, structural, or conventional changes are made to the codebase.