feat: add discovery + attacks for 11 GCP services (GKE, MIG, Cloud NAT, Cloud SQL, Spanner, Pub/Sub, Memorystore, Cloud Run, Persistent Disk) by achoimet · Pull Request #334 · steadybit/extension-gcp

achoimet · 2026-05-12T08:58:13Z

Summary

Expands the GCP extension with 11 new opt-in target types and 5 attacks. Mirrors the AWS / Azure expansion playbook: stable, GCP-field-style attribute names; per-module enable flag so existing installations stay byte-identical until toggled on.

New target types

Module	Target ID	Notes
GKE cluster	`com.steadybit.extension_gcp.gke.cluster`	Surfaces `k8s.cluster-name` + enrichment rules → joins extension-kubernetes targets.
GKE node pool	`com.steadybit.extension_gcp.gke.nodepool`	Surfaces underlying `gcp.gke.nodepool.instance-group-urls` (MIGs).
Managed Instance Group	`com.steadybit.extension_gcp.mig`	Zonal + regional via `AggregatedList`.
Cloud NAT	`com.steadybit.extension_gcp.cloud-nat`	One target per `(router, nat)` pair.
Persistent Disk	`com.steadybit.extension_gcp.persistent-disk`	Zonal + regional. Discovery only.
Cloud SQL	`com.steadybit.extension_gcp.cloud-sql.instance`	Surfaces `availability-type` so HA-only attacks gate cleanly.
Spanner	`com.steadybit.extension_gcp.spanner.instance`	Discovery only.
Pub/Sub topic	`com.steadybit.extension_gcp.pubsub.topic`	Uses pubsub v2 SDK `TopicAdminClient`.
Pub/Sub subscription	`com.steadybit.extension_gcp.pubsub.subscription`	Surfaces `delivery-type` (pull/push/bigquery/cloud-storage/bigtable) + DLQ + retry.
Memorystore Redis	`com.steadybit.extension_gcp.memorystore.redis`	Surfaces `tier` so failover gates STANDARD_HA only.
Cloud Run service	`com.steadybit.extension_gcp.cloudrun.service`	Multi-region via `locations/-` wildcard.

New attacks — and their reversibility (read this!)

Only one of the five attacks is truly reversible. The rest are either destructive-but-self-healing or genuine fault injection. The descriptions, README, and parameter UI now reflect that honestly.

Attack	Reversibility	Guardrail
GKE node pool: terminate-instances	Destructive, self-healing. Deleted instances are gone forever; the MIG behind the pool creates new ones per its scaling/heal config. Recovery time depends on cluster-autoscaler and surge config — a misconfigured pool may stay undersized indefinitely.	`confirmHighImpact` flag required for percentages > 50%.
MIG: delete-instances	Destructive, self-healing. Same model: the MIG creates new replacements. A MIG with autoscaling disabled stays undersized until someone fixes it.	`confirmHighImpact` flag required for percentages > 50%.
Cloud NAT: disassociate subnetworks	Truly reversible. `ActionWithStop` snapshots the original `Subnetworks` list at Prepare and restores it at Stop. Re-fetches the router on every patch so concurrent edits to other NATs on the same router survive. If Stop never runs (agent crash, abandoned experiment), the NAT stays disassociated until an operator restores it.	Other NATs on the same router untouched.
Cloud SQL: failover	Not reversible. Promotes the REGIONAL standby; Cloud SQL rebuilds a new HA standby behind it. Exercises the same code path as a real zonal outage.	Gated on `availability-type=REGIONAL`.
Memorystore Redis: failover	Not reversible. Promotes the standby for STANDARD_HA instances; exercises a real primary-node outage code path. `FORCE_DATA_LOSS` may drop in-flight writes that have not yet been replicated.	Gated on `tier=STANDARD_HA`. Parameter selection is honored (was buggy in earlier commits; fixed).

Opt-in by default

All new modules ship disabled. Operators flip them on per-module via STEADYBIT_EXTENSION_DISCOVERY_ENABLE_* (or Helm discovery.enable.*). See the README table for the full list. This keeps the IAM and API-call footprint zero for users upgrading from the previous chart version.

Attribute design

Stable / config-only — no volatile operational counters that would update on every refresh and churn the platform.
Field names mirror the GCP API (e.g. gcp.cloudsql.availability-type, gcp.memorystore.tier, gcp.gke.cluster.private-cluster-config.enable-private-nodes).
Per-module STEADYBIT_EXTENSION_DISCOVERY_ATTRIBUTES_EXCLUDES_* lists for operators who want to drop attributes (e.g. labels) wholesale.

IAM additions

See README "IAM Permissions" section. Discovery needs read-only *.list / *.get per module; attacks need the specific mutating verb (e.g. cloudsql.instances.failover, redis.instances.failover, compute.routers.patch).

Chart

Bumped chart version to 1.2.0 (no app-version bump).
Helm discovery.enable.* + per-module discovery.attributes.excludes.* exposed.
helm lint + helm unittest pass (existing 25 snapshots untouched, 26 test cases pass).

Test plan

gofmt -l . clean
go vet ./... clean
go build ./... clean
go test ./... — 33 unit tests + 4 e2e tests pass (Docker running locally).
go mod verify — all modules verified
helm lint charts/steadybit-extension-gcp — 1 chart linted, 0 failures
helm unittest charts/steadybit-extension-gcp — 26 tests / 25 snapshots pass
CI run on this branch (please confirm green before merging)
Manual smoke test of each new discovery against a real GCP project (operator responsibility)

Notes

No backward-incompatible changes: every new module defaults to disabled, existing VM discovery and reset/stop/suspend attacks are untouched.
Audit follow-up commit fix: honest reversibility framing on new GCP attacks fixes a Memorystore parameter bug (FORCE_DATA_LOSS was silently downgraded to LIMITED_DATA_LOSS), adds the confirmHighImpact gate for >50% on GKE/MIG, and rewrites attack descriptions to match what the code actually does.

New targets (all opt-in via STEADYBIT_EXTENSION_DISCOVERY_ENABLE_* flags): GKE cluster, GKE node pool, Managed Instance Group, Cloud NAT, Persistent Disk, Cloud SQL, Spanner, Pub/Sub topic, Pub/Sub subscription, Memorystore Redis, Cloud Run service. New attacks (all reversible by design or self-recovering via GCP control loops): - GKE node pool: terminate a percentage of running instances (MIG replaces them) - MIG: delete a percentage of instances (zonal + regional) - Cloud NAT: disassociate all subnetworks from a NAT, restored on stop - Cloud SQL: failover (REGIONAL/HA instances only) - Memorystore Redis: failover (STANDARD_HA tier only) GKE cluster surfaces k8s.cluster-name plus enrichment rules so extension-kubernetes attributes flow back onto the GKE cluster target. Discovery attributes mirror the GCP field names (gcp.gke.cluster.private-cluster-config.enable-private-nodes, gcp.cloud-sql.instance.availability-type, gcp.memorystore.redis.tier, etc.). Volatile operational counters are excluded from discovery to keep the target table stable.

- Memorystore failover: wire dataProtectionMode parameter through state. The Start function was passing an empty string to dataProtectionFromString, silently defaulting to LIMITED_DATA_LOSS regardless of the user's selection. - GKE node pool terminate-instances + MIG delete-instances: add a confirmHighImpact boolean parameter. Percentages above 50% now require explicit acknowledgement that more than half the pool/MIG will be deleted at once. - Tighten attack Descriptions and the README to drop the misleading "reversible / no automatic rollback" framing. The honest picture: only Cloud NAT disassociate is truly reversible (snapshot + restore on Stop); GKE/MIG delete-instances are destructive and self-healing via the MIG; Cloud SQL and Memorystore failovers are not reversible — they exercise the same code path as a real zonal/primary-node outage.

sonarqubecloud · 2026-05-12T09:25:15Z

Quality Gate passed

Issues
66 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.7% Coverage on New Code
10.5% Duplication on New Code

See analysis details on SonarQube Cloud

achoimet added 3 commits May 12, 2026 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add discovery + attacks for 11 GCP services (GKE, MIG, Cloud NAT, Cloud SQL, Spanner, Pub/Sub, Memorystore, Cloud Run, Persistent Disk)#334

feat: add discovery + attacks for 11 GCP services (GKE, MIG, Cloud NAT, Cloud SQL, Spanner, Pub/Sub, Memorystore, Cloud Run, Persistent Disk)#334
achoimet wants to merge 3 commits into
mainfrom
feat/expand-gcp-targets-and-attacks

achoimet commented May 12, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

achoimet commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New target types

New attacks — and their reversibility (read this!)

Opt-in by default

Attribute design

IAM additions

Chart

Test plan

Notes

Uh oh!

sonarqubecloud Bot commented May 12, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

achoimet commented May 12, 2026 •

edited

Loading