Skip to content

feat: add discovery + attacks for 11 GCP services (GKE, MIG, Cloud NAT, Cloud SQL, Spanner, Pub/Sub, Memorystore, Cloud Run, Persistent Disk)#334

Open
achoimet wants to merge 3 commits into
mainfrom
feat/expand-gcp-targets-and-attacks
Open

feat: add discovery + attacks for 11 GCP services (GKE, MIG, Cloud NAT, Cloud SQL, Spanner, Pub/Sub, Memorystore, Cloud Run, Persistent Disk)#334
achoimet wants to merge 3 commits into
mainfrom
feat/expand-gcp-targets-and-attacks

Conversation

@achoimet
Copy link
Copy Markdown
Member

@achoimet achoimet commented May 12, 2026

Summary

Expands the GCP extension with 11 new opt-in target types and 5 attacks. Mirrors the AWS / Azure expansion playbook: stable, GCP-field-style attribute names; per-module enable flag so existing installations stay byte-identical until toggled on.

New target types

Module Target ID Notes
GKE cluster com.steadybit.extension_gcp.gke.cluster Surfaces k8s.cluster-name + enrichment rules → joins extension-kubernetes targets.
GKE node pool com.steadybit.extension_gcp.gke.nodepool Surfaces underlying gcp.gke.nodepool.instance-group-urls (MIGs).
Managed Instance Group com.steadybit.extension_gcp.mig Zonal + regional via AggregatedList.
Cloud NAT com.steadybit.extension_gcp.cloud-nat One target per (router, nat) pair.
Persistent Disk com.steadybit.extension_gcp.persistent-disk Zonal + regional. Discovery only.
Cloud SQL com.steadybit.extension_gcp.cloud-sql.instance Surfaces availability-type so HA-only attacks gate cleanly.
Spanner com.steadybit.extension_gcp.spanner.instance Discovery only.
Pub/Sub topic com.steadybit.extension_gcp.pubsub.topic Uses pubsub v2 SDK TopicAdminClient.
Pub/Sub subscription com.steadybit.extension_gcp.pubsub.subscription Surfaces delivery-type (pull/push/bigquery/cloud-storage/bigtable) + DLQ + retry.
Memorystore Redis com.steadybit.extension_gcp.memorystore.redis Surfaces tier so failover gates STANDARD_HA only.
Cloud Run service com.steadybit.extension_gcp.cloudrun.service Multi-region via locations/- wildcard.

New attacks — and their reversibility (read this!)

Only one of the five attacks is truly reversible. The rest are either destructive-but-self-healing or genuine fault injection. The descriptions, README, and parameter UI now reflect that honestly.

Attack Reversibility Guardrail
GKE node pool: terminate-instances Destructive, self-healing. Deleted instances are gone forever; the MIG behind the pool creates new ones per its scaling/heal config. Recovery time depends on cluster-autoscaler and surge config — a misconfigured pool may stay undersized indefinitely. confirmHighImpact flag required for percentages > 50%.
MIG: delete-instances Destructive, self-healing. Same model: the MIG creates new replacements. A MIG with autoscaling disabled stays undersized until someone fixes it. confirmHighImpact flag required for percentages > 50%.
Cloud NAT: disassociate subnetworks Truly reversible. ActionWithStop snapshots the original Subnetworks list at Prepare and restores it at Stop. Re-fetches the router on every patch so concurrent edits to other NATs on the same router survive. If Stop never runs (agent crash, abandoned experiment), the NAT stays disassociated until an operator restores it. Other NATs on the same router untouched.
Cloud SQL: failover Not reversible. Promotes the REGIONAL standby; Cloud SQL rebuilds a new HA standby behind it. Exercises the same code path as a real zonal outage. Gated on availability-type=REGIONAL.
Memorystore Redis: failover Not reversible. Promotes the standby for STANDARD_HA instances; exercises a real primary-node outage code path. FORCE_DATA_LOSS may drop in-flight writes that have not yet been replicated. Gated on tier=STANDARD_HA. Parameter selection is honored (was buggy in earlier commits; fixed).

Opt-in by default

All new modules ship disabled. Operators flip them on per-module via STEADYBIT_EXTENSION_DISCOVERY_ENABLE_* (or Helm discovery.enable.*). See the README table for the full list. This keeps the IAM and API-call footprint zero for users upgrading from the previous chart version.

Attribute design

  • Stable / config-only — no volatile operational counters that would update on every refresh and churn the platform.
  • Field names mirror the GCP API (e.g. gcp.cloudsql.availability-type, gcp.memorystore.tier, gcp.gke.cluster.private-cluster-config.enable-private-nodes).
  • Per-module STEADYBIT_EXTENSION_DISCOVERY_ATTRIBUTES_EXCLUDES_* lists for operators who want to drop attributes (e.g. labels) wholesale.

IAM additions

See README "IAM Permissions" section. Discovery needs read-only *.list / *.get per module; attacks need the specific mutating verb (e.g. cloudsql.instances.failover, redis.instances.failover, compute.routers.patch).

Chart

  • Bumped chart version to 1.2.0 (no app-version bump).
  • Helm discovery.enable.* + per-module discovery.attributes.excludes.* exposed.
  • helm lint + helm unittest pass (existing 25 snapshots untouched, 26 test cases pass).

Test plan

  • gofmt -l . clean
  • go vet ./... clean
  • go build ./... clean
  • go test ./... — 33 unit tests + 4 e2e tests pass (Docker running locally).
  • go mod verify — all modules verified
  • helm lint charts/steadybit-extension-gcp — 1 chart linted, 0 failures
  • helm unittest charts/steadybit-extension-gcp — 26 tests / 25 snapshots pass
  • CI run on this branch (please confirm green before merging)
  • Manual smoke test of each new discovery against a real GCP project (operator responsibility)

Notes

  • No backward-incompatible changes: every new module defaults to disabled, existing VM discovery and reset/stop/suspend attacks are untouched.
  • Audit follow-up commit fix: honest reversibility framing on new GCP attacks fixes a Memorystore parameter bug (FORCE_DATA_LOSS was silently downgraded to LIMITED_DATA_LOSS), adds the confirmHighImpact gate for >50% on GKE/MIG, and rewrites attack descriptions to match what the code actually does.

achoimet added 3 commits May 12, 2026 10:57
New targets (all opt-in via STEADYBIT_EXTENSION_DISCOVERY_ENABLE_* flags):
GKE cluster, GKE node pool, Managed Instance Group, Cloud NAT,
Persistent Disk, Cloud SQL, Spanner, Pub/Sub topic, Pub/Sub subscription,
Memorystore Redis, Cloud Run service.

New attacks (all reversible by design or self-recovering via GCP control loops):
- GKE node pool: terminate a percentage of running instances (MIG replaces them)
- MIG: delete a percentage of instances (zonal + regional)
- Cloud NAT: disassociate all subnetworks from a NAT, restored on stop
- Cloud SQL: failover (REGIONAL/HA instances only)
- Memorystore Redis: failover (STANDARD_HA tier only)

GKE cluster surfaces k8s.cluster-name plus enrichment rules so extension-kubernetes
attributes flow back onto the GKE cluster target. Discovery attributes mirror
the GCP field names (gcp.gke.cluster.private-cluster-config.enable-private-nodes,
gcp.cloud-sql.instance.availability-type, gcp.memorystore.redis.tier, etc.).
Volatile operational counters are excluded from discovery to keep the target
table stable.
- Memorystore failover: wire dataProtectionMode parameter through state.
  The Start function was passing an empty string to dataProtectionFromString,
  silently defaulting to LIMITED_DATA_LOSS regardless of the user's selection.
- GKE node pool terminate-instances + MIG delete-instances: add a
  confirmHighImpact boolean parameter. Percentages above 50% now require
  explicit acknowledgement that more than half the pool/MIG will be deleted
  at once.
- Tighten attack Descriptions and the README to drop the misleading
  "reversible / no automatic rollback" framing. The honest picture:
  only Cloud NAT disassociate is truly reversible (snapshot + restore on Stop);
  GKE/MIG delete-instances are destructive and self-healing via the MIG;
  Cloud SQL and Memorystore failovers are not reversible — they exercise
  the same code path as a real zonal/primary-node outage.
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant