Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Copy skills to your agent's skills location:

| Plugin | Description |
|--------|-------------|
| [sapcc](plugins/sapcc/) | All SAP CC skills and MCP server configuration. Covers compute, networking, storage, identity, quota, audit, metrics, registry, and endpoint services. |
| [sapcc](plugins/sapcc/) | All SAP CC skills and MCP server configuration. Covers compute, networking, storage, identity, quota, audit, metrics, registry, endpoint services, DNS, secrets, object storage, file systems, load balancing, images, bare metal, and autoscaling. |

### Skills

Expand All @@ -60,6 +60,14 @@ Copy skills to your agent's skills location:
| [sapcc-metrics](plugins/sapcc/skills/sapcc-metrics/) | Maia | PromQL queries, metric discovery, monitoring |
| [sapcc-registry](plugins/sapcc/skills/sapcc-registry/) | Keppel | Container images, vulnerability status, federation |
| [sapcc-connectivity](plugins/sapcc/skills/sapcc-connectivity/) | Archer | Private endpoint services, service discovery |
| [sapcc-dns](plugins/sapcc/skills/sapcc-dns/) | Designate | DNS zones, recordsets, FQDN management |
| [sapcc-secrets](plugins/sapcc/skills/sapcc-secrets/) | Barbican | Secret metadata, certificate inventory, audit |
| [sapcc-object-storage](plugins/sapcc/skills/sapcc-object-storage/) | Swift | Containers, objects, storage inspection |
| [sapcc-filesystems](plugins/sapcc/skills/sapcc-filesystems/) | Manila | Shared NFS/CIFS file systems |
| [sapcc-loadbalancer](plugins/sapcc/skills/sapcc-loadbalancer/) | Octavia | Load balancers, listeners, pools |
| [sapcc-images](plugins/sapcc/skills/sapcc-images/) | Glance | VM images, snapshots, boot sources |
| [sapcc-baremetal](plugins/sapcc/skills/sapcc-baremetal/) | Ironic | Physical server nodes, provisioning |
| [sapcc-autoscaling](plugins/sapcc/skills/sapcc-autoscaling/) | Castellum | Automatic quota scaling, resize operations |
| [credential-setup](plugins/sapcc/skills/credential-setup/) | Keystone | Guided auth setup with keychain storage |

### Rules
Expand All @@ -85,7 +93,7 @@ Skills use progressive disclosure:
3. Reference files load on-demand for deep-dive content
4. Skill context releases when the task completes

10 skills installed = ~500 tokens at startup. Full context only when needed.
18 skills installed = ~900 tokens at startup. Full context only when needed.

## Security Philosophy

Expand Down
168 changes: 168 additions & 0 deletions plugins/sapcc/skills/sapcc-autoscaling/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
---
name: sapcc-autoscaling
description: >
Autoscaling operations via Castellum in SAP Converged Cloud.
Triggers: autoscaling, castellum, auto-resize, quota scaling, capacity management, resize operation
version: 1.0.0
metadata:
service: [castellum]
task: [inspect, monitor, debug]
persona: [platform-engineer, devops]
---

# SAP CC Autoscaling (Castellum)

Inspect Castellum autoscaling: check resource configurations, view pending operations, and diagnose failed resize attempts. Castellum automatically adjusts project quotas and resource sizes based on configured thresholds.

## MCP Tools

| Tool | Purpose | Key Parameters |
|------|---------|----------------|
| `castellum_get_project_resources` | Get autoscaling config and status for a project | `project_id` (UUID, required) |
| `castellum_list_pending_operations` | List scheduled but incomplete resize operations | `project_id`, `asset_type` |
| `castellum_list_recently_failed_operations` | List recent resize failures | `project_id`, `asset_type`, `max_age` (default: 1d) |

## What Castellum Does

Castellum watches resource usage and automatically resizes when thresholds are hit:

```
Usage crosses HIGH threshold → Castellum schedules UPSIZE
Usage drops below LOW threshold → Castellum schedules DOWNSIZE
```

Assets it manages:
- `project-quota:compute:cores` — vCPU quota
- `project-quota:compute:ram` — RAM quota
- `project-quota:compute:instances` — instance count quota
- `project-quota:block-storage:capacity` — volume storage
- NFS share sizes, server root disks, etc.

## Gotchas

### 1. Castellum manages QUOTA, not individual resources

Castellum doesn't scale your application. It adjusts quotas and resource sizes. For example, it can increase your project's compute cores quota when usage exceeds 80%, but it doesn't create new servers.

### 2. project_id is required — you must know which project

Unlike other tools, Castellum requires an explicit project_id. Get it from `keystone_token_info` or `keystone_list_projects`.

### 3. Operations can be PENDING without issues

A pending operation means a resize is scheduled. This is normal — Castellum batches operations and may wait for cooldown periods between resizes.

### 4. Failed operations have a reason

`castellum_list_recently_failed_operations` shows WHY a resize failed. Common reasons:
- Quota exceeded at the domain level (project can't grow)
- Backend capacity exhausted
- Conflicting resize already in progress

### 5. max_age controls the failure lookback window

Default is `1d` (24 hours). Use `7d` for broader investigation, `12h` for recent issues only. Accepts Go duration format: `s`, `m`, `h`, `d`.

### 6. asset_type filters are specific strings

Format: `project-quota:<service>:<resource>`. Examples:
- `project-quota:compute:cores`
- `project-quota:compute:ram`
- `project-quota:block-storage:capacity`

### 7. Castellum only acts on configured resources

Not all resources have autoscaling configured. `castellum_get_project_resources` shows which resources ARE configured and their thresholds. No configuration = no autoscaling.

### 8. Cooldown prevents thrashing

After a resize, Castellum waits before acting again. If you see a resource at threshold but no pending operation, it may be in cooldown.

## Common Workflows

### Check Autoscaling Configuration

```
1. keystone_token_info → get current project_id
2. castellum_get_project_resources(project_id=<uuid>)
3. Review: which resources are configured, thresholds, current status
```

### Are There Pending Resizes?

```
1. castellum_list_pending_operations(project_id=<uuid>)
2. If empty: no scheduled resizes (normal)
3. If populated: review what's being resized and when
```

### Diagnose Autoscaling Failures

```
1. castellum_list_recently_failed_operations(project_id=<uuid>, max_age=7d)
2. Review failure reasons
3. Common: domain quota ceiling hit, backend capacity
4. Cross-reference with Limes: limes_get_project_quota → is project at domain cap?
```

### "Why didn't my quota grow?"

```
1. castellum_get_project_resources(project_id) → is the resource configured?
2. If not configured: autoscaling won't act
3. If configured: check thresholds — is usage actually above HIGH?
4. castellum_list_recently_failed_operations → did it try and fail?
5. Check cooldown — did it resize recently and is waiting?
```

### Correlate with Quota

```
1. castellum_get_project_resources(project_id) → autoscaling config
2. limes_get_project_quota → current quota and usage
3. Compare: is usage near threshold? Has quota been growing?
```

## Troubleshooting

### No autoscaling configured for a resource

- Castellum configuration is per-resource, per-project
- Not all projects have autoscaling enabled
- Configuration requires admin or project-admin access (not via MCP tools)

### Operations keep failing

- Check `castellum_list_recently_failed_operations(max_age=7d)` for patterns
- If "domain quota exceeded": the project has hit its domain-level cap. Need domain admin to increase domain quota.
- If "backend capacity": physical capacity exhausted in the region/AZ
- If "conflicting operation": wait for the existing operation to complete

### Autoscaling seems too slow

- Castellum has deliberate cooldown periods (typically 5-15 minutes)
- It batches multiple threshold crossings
- For urgent needs: manual quota adjustment via Limes is faster

### Resource at threshold but no pending operation

- Cooldown period active (recent resize within last N minutes)
- Castellum may not have polled yet (typical interval: 5 minutes)
- Resource may not be configured for autoscaling

## Security Considerations

- Autoscaling configuration reveals capacity management strategy
- Failed operations reveal infrastructure limits and bottlenecks
- project_id in URLs is validated (UUID format) to prevent injection
- Autoscaling policies are read-only via MCP — no configuration changes possible

## Cross-Service References

| Need | Service | Tool |
|------|---------|------|
| Current quota and usage | Limes | `limes_get_project_quota(project_id=<uuid>)` |
| Domain-level quota cap | Limes | `limes_get_domain_quota(domain_id=<uuid>)` |
| Project identity | Keystone | `keystone_token_info`, `keystone_list_projects` |
| Who changed autoscaling config | Hermes | `hermes_list_events(target_type=castellum)` |
| Compute resources being scaled | Nova | `nova_list_servers` (to see actual usage) |
152 changes: 152 additions & 0 deletions plugins/sapcc/skills/sapcc-baremetal/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
---
name: sapcc-baremetal
description: >
Bare metal node management via Ironic in SAP Converged Cloud.
Triggers: bare metal, ironic, physical server, baremetal, BMC, IPMI, redfish, hardware
version: 1.0.0
metadata:
service: [ironic]
task: [inspect, manage, debug]
persona: [platform-engineer]
---

# SAP CC Bare Metal (Ironic)

Inspect Ironic bare metal nodes: list nodes, check provision/power states, and understand maintenance status. Ironic manages physical servers in the cloud, enabling bare metal provisioning alongside VMs.

## MCP Tools

| Tool | Purpose | Key Parameters |
|------|---------|----------------|
| `ironic_list_nodes` | List baremetal nodes | `provision_state`, `maintenance`, `driver`, `resource_class`, `instance_uuid`, `fault`, `owner` |
| `ironic_get_node` | Full detail for a single node | `node_id` (UUID or name) |

## Security Note

**BMC credentials (IPMI/Redfish passwords) are intentionally excluded from responses.** The MCP server strips `driver_info` and `properties` fields that may contain hardware management credentials. This is a security boundary.

## Gotchas

### 1. Provision state is the lifecycle — not power state

| provision_state | Meaning |
|----------------|---------|
| `available` | Ready to be provisioned (no instance) |
| `active` | Running an instance |
| `deploying` | Instance being deployed onto the node |
| `cleaning` | Being wiped between tenants |
| `error` | Failed operation — needs investigation |
| `manageable` | Enrolled but not yet made available |

### 2. Power state is separate from provision state

A node can be `provision_state=active` but `power_state=power off` (unexpected shutdown). Power states: `power on`, `power off`, `None` (unknown).

### 3. Maintenance mode = node excluded from scheduling

When `maintenance=true`, Nova will not schedule new instances to this node. Existing instances may still be running. Maintenance is set manually by operators or automatically on repeated failures.

### 4. instance_uuid links to Nova

When a node has an instance deployed, `instance_uuid` contains the Nova server UUID. Use `nova_get_server(instance_uuid)` to see the VM running on this hardware.

### 5. resource_class determines scheduling

Nodes declare their resource class (e.g., `baremetal`, `baremetal.large`). Nova flavors reference these classes to match workloads to appropriate hardware.

### 6. driver indicates management protocol

Common drivers: `ipmi` (legacy BMC), `redfish` (modern REST-based BMC). The driver determines how the node is powered on/off and booted.

### 7. fault indicates why a node is broken

When a node enters error/maintenance, the `fault` field explains why: `power failure`, `clean failure`, `deploy failure`, etc. This is the first thing to check for broken nodes.

### 8. Nodes are owned by projects

The `owner` field shows which project can provision instances on this node. Filter by owner to see nodes allocated to your project.

## Common Workflows

### Inventory Bare Metal Nodes

```
1. ironic_list_nodes()
2. Review: name/UUID, provision_state, power_state, maintenance
3. Flag any in error state or maintenance
```

### Find Available Nodes

```
1. ironic_list_nodes(provision_state=available, maintenance=false)
2. These nodes are ready for instance deployment
3. Check resource_class to match with desired flavor
```

### Check What Instance Runs on a Node

```
1. ironic_get_node(node_id=<uuid>) → note instance_uuid
2. nova_get_server(server_id=<instance_uuid>) → instance details
```

### Troubleshoot a Node in Error

```
1. ironic_get_node(node_id=<uuid>) → check fault field
2. Check maintenance flag — was it set automatically?
3. Check provision_state for last failed transition
4. hermes_list_events(target_type=baremetal/node, target_id=<uuid>)
```

### Find Nodes by Owner Project

```
1. ironic_list_nodes(owner=<project_uuid>)
2. Shows all nodes allocated to that project
3. Combine with provision_state filter for specific views
```

## Troubleshooting

### Node in error state

- Check `fault` field first — it describes the failure
- Common faults: `power failure` (BMC unreachable), `clean failure` (disk wipe failed), `deploy failure` (image deployment failed)
- Check Hermes for the triggering event

### Node stuck in "deploying"

- Deployment may have timed out
- Check if the node lost network connectivity during deploy
- BMC may be unresponsive — the node can't be power-cycled

### Node in maintenance unexpectedly

- May have been set automatically after repeated failures
- Check `maintenance_reason` in node details
- Requires operator intervention to clear maintenance flag

### Power state is "None"

- BMC is unreachable — can't determine actual power state
- Check network connectivity to the BMC/management network
- Driver may need reconfiguration

## Security Considerations

- Node listings reveal physical infrastructure topology
- resource_class and driver info reveal hardware types and management protocols
- instance_uuid mapping reveals which workloads run on which physical hardware
- Physical access to BMC = full control of hardware — BMC credential exposure is critical
- Maintenance patterns reveal infrastructure health/reliability

## Cross-Service References

| Need | Service | Tool |
|------|---------|------|
| Instance on a node | Nova | `nova_get_server(<instance_uuid>)` |
| Who modified node state | Hermes | `hermes_list_events(target_type=baremetal/node)` |
| Node network ports | Neutron | `neutron_list_ports(device_id=<node_uuid>)` |
| Compute quota for baremetal | Limes | `limes_get_project_quota(service=compute)` |
Loading
Loading