feat: add discovery + attacks for 11 GCP services (GKE, MIG, Cloud NAT, Cloud SQL, Spanner, Pub/Sub, Memorystore, Cloud Run, Persistent Disk)#334
Open
achoimet wants to merge 3 commits into
Conversation
New targets (all opt-in via STEADYBIT_EXTENSION_DISCOVERY_ENABLE_* flags): GKE cluster, GKE node pool, Managed Instance Group, Cloud NAT, Persistent Disk, Cloud SQL, Spanner, Pub/Sub topic, Pub/Sub subscription, Memorystore Redis, Cloud Run service. New attacks (all reversible by design or self-recovering via GCP control loops): - GKE node pool: terminate a percentage of running instances (MIG replaces them) - MIG: delete a percentage of instances (zonal + regional) - Cloud NAT: disassociate all subnetworks from a NAT, restored on stop - Cloud SQL: failover (REGIONAL/HA instances only) - Memorystore Redis: failover (STANDARD_HA tier only) GKE cluster surfaces k8s.cluster-name plus enrichment rules so extension-kubernetes attributes flow back onto the GKE cluster target. Discovery attributes mirror the GCP field names (gcp.gke.cluster.private-cluster-config.enable-private-nodes, gcp.cloud-sql.instance.availability-type, gcp.memorystore.redis.tier, etc.). Volatile operational counters are excluded from discovery to keep the target table stable.
- Memorystore failover: wire dataProtectionMode parameter through state. The Start function was passing an empty string to dataProtectionFromString, silently defaulting to LIMITED_DATA_LOSS regardless of the user's selection. - GKE node pool terminate-instances + MIG delete-instances: add a confirmHighImpact boolean parameter. Percentages above 50% now require explicit acknowledgement that more than half the pool/MIG will be deleted at once. - Tighten attack Descriptions and the README to drop the misleading "reversible / no automatic rollback" framing. The honest picture: only Cloud NAT disassociate is truly reversible (snapshot + restore on Stop); GKE/MIG delete-instances are destructive and self-healing via the MIG; Cloud SQL and Memorystore failovers are not reversible — they exercise the same code path as a real zonal/primary-node outage.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Expands the GCP extension with 11 new opt-in target types and 5 attacks. Mirrors the AWS / Azure expansion playbook: stable, GCP-field-style attribute names; per-module enable flag so existing installations stay byte-identical until toggled on.
New target types
com.steadybit.extension_gcp.gke.clusterk8s.cluster-name+ enrichment rules → joins extension-kubernetes targets.com.steadybit.extension_gcp.gke.nodepoolgcp.gke.nodepool.instance-group-urls(MIGs).com.steadybit.extension_gcp.migAggregatedList.com.steadybit.extension_gcp.cloud-nat(router, nat)pair.com.steadybit.extension_gcp.persistent-diskcom.steadybit.extension_gcp.cloud-sql.instanceavailability-typeso HA-only attacks gate cleanly.com.steadybit.extension_gcp.spanner.instancecom.steadybit.extension_gcp.pubsub.topicTopicAdminClient.com.steadybit.extension_gcp.pubsub.subscriptiondelivery-type(pull/push/bigquery/cloud-storage/bigtable) + DLQ + retry.com.steadybit.extension_gcp.memorystore.redistierso failover gates STANDARD_HA only.com.steadybit.extension_gcp.cloudrun.servicelocations/-wildcard.New attacks — and their reversibility (read this!)
Only one of the five attacks is truly reversible. The rest are either destructive-but-self-healing or genuine fault injection. The descriptions, README, and parameter UI now reflect that honestly.
confirmHighImpactflag required for percentages > 50%.confirmHighImpactflag required for percentages > 50%.ActionWithStopsnapshots the originalSubnetworkslist at Prepare and restores it at Stop. Re-fetches the router on every patch so concurrent edits to other NATs on the same router survive. If Stop never runs (agent crash, abandoned experiment), the NAT stays disassociated until an operator restores it.availability-type=REGIONAL.FORCE_DATA_LOSSmay drop in-flight writes that have not yet been replicated.tier=STANDARD_HA. Parameter selection is honored (was buggy in earlier commits; fixed).Opt-in by default
All new modules ship disabled. Operators flip them on per-module via
STEADYBIT_EXTENSION_DISCOVERY_ENABLE_*(or Helmdiscovery.enable.*). See the README table for the full list. This keeps the IAM and API-call footprint zero for users upgrading from the previous chart version.Attribute design
gcp.cloudsql.availability-type,gcp.memorystore.tier,gcp.gke.cluster.private-cluster-config.enable-private-nodes).STEADYBIT_EXTENSION_DISCOVERY_ATTRIBUTES_EXCLUDES_*lists for operators who want to drop attributes (e.g. labels) wholesale.IAM additions
See README "IAM Permissions" section. Discovery needs read-only
*.list/*.getper module; attacks need the specific mutating verb (e.g.cloudsql.instances.failover,redis.instances.failover,compute.routers.patch).Chart
1.2.0(no app-version bump).discovery.enable.*+ per-modulediscovery.attributes.excludes.*exposed.helm lint+helm unittestpass (existing 25 snapshots untouched, 26 test cases pass).Test plan
gofmt -l .cleango vet ./...cleango build ./...cleango test ./...— 33 unit tests + 4 e2e tests pass (Docker running locally).go mod verify— all modules verifiedhelm lint charts/steadybit-extension-gcp— 1 chart linted, 0 failureshelm unittest charts/steadybit-extension-gcp— 26 tests / 25 snapshots passNotes
fix: honest reversibility framing on new GCP attacksfixes a Memorystore parameter bug (FORCE_DATA_LOSS was silently downgraded to LIMITED_DATA_LOSS), adds theconfirmHighImpactgate for >50% on GKE/MIG, and rewrites attack descriptions to match what the code actually does.