Skip to content

feat: support topology-aware Service controls [sc-18128]#295

Merged
devkoriel merged 2 commits into
mainfrom
sc-18128-erpc-traffic-distribution
May 14, 2026
Merged

feat: support topology-aware Service controls [sc-18128]#295
devkoriel merged 2 commits into
mainfrom
sc-18128-erpc-traffic-distribution

Conversation

@devkoriel
Copy link
Copy Markdown
Contributor

@devkoriel devkoriel commented May 13, 2026

Summary

Adds the chart interfaces needed for the SC-18128 FinOps rollout:

  • eRPC Service trafficDistribution support from the original PR.
  • spire Service support for loadBalancerClass, externalTrafficPolicy, and trafficDistribution.
  • validator ghost/VAO Service support for loadBalancerClass, externalTrafficPolicy, and trafficDistribution.
  • spire private metricsService so metrics can stay on ClusterIP while public LoadBalancer Services expose only libp2p.

This PR is the chart dependency for the app-of-apps rollout PR.

Before architecture

flowchart LR
  Internet[Internet clients] --> LB[Classic ELB / NLB instance target]
  LB --> Node[Any worker node in LB AZ]
  Node --> Proxy[kube-proxy / Cluster routing]
  Proxy --> Pod[spire / ghost pod, possibly another AZ]
  Pod --> Metrics[Metrics port on same public Service]
  Pod --> Svc[ClusterIP Services without topology hints]
  Svc --> RemotePod[Backend pod in another AZ]
  Pod --> NAT[NAT gateway egress]
  NAT --> External[External registries / IPFS / GitHub]
Loading

After architecture

flowchart LR
  Internet[Internet clients] --> NLB[NLB managed by AWS LBC]
  NLB --> PodIP[Pod IP target]
  PodIP --> Libp2p[libp2p port only]
  PodIP --> InternalSvc[ClusterIP Services with PreferSameZone]
  InternalSvc --> LocalPod[Same-zone backend when available]
  PodIP --> MetricsSvc[Private ClusterIP metrics Service]
  MetricsSvc --> Prom[Prometheus ServiceMonitor]
  PodIP -. follow-up infra PR .-> ECR[ECR API/DKR VPC endpoints]
  PodIP -. follow-up app/cache work .-> Cache[S3/CloudFront or registry cache]
Loading

Expected cost reduction

Baseline from May 2026 investigation, account 609117668403.

Driver Measured baseline This PR enables Expected impact Validation signal
EU-DataTransfer-Regional-Bytes US$7,787.27 MTD in EC2-Other, plus US$2,651.97 MTD in ELB regional transfer AZ-local Service hints and pod-IP/load-balancer controls Medium: app rollout should reduce cross-AZ forwarding for migrated Services Cost Explorer usage type, Grafana regional transfer panels
EU-NatGateway-Bytes US$9,231.92 MTD; NAT response side was about 224 TB over 14 days Private metrics split and chart support for egress/cache rollout Indirect here; direct NAT reduction lands in Terraform/app follow-ups NAT BytesInFromDestination, BytesOutToSource
ELB LCU / public surface US$2,279.93 MTD LCU Public Services can expose libp2p only, metrics stay private Low-to-medium cost impact; primary gain is security and cleaner traffic shape ELB LCU, target health, ServiceMonitor scrape health
EC2 compute US$3,729.27 MTD No compute downsizing in this PR No direct reduction; right-size only after network baseline stabilizes Node CPU/memory and 7-14 day post-rollout baseline

These are expected reduction ranges, not guarantees. The first measurable target is lower NAT and regional-transfer slope without increasing libp2p errors, p95/p99 latency, or pod restarts.

Verification

  • helm lint charts/erpc
  • helm lint charts/spire
  • helm lint charts/validator
  • helm template erpc-test charts/erpc --set service.trafficDistribution=PreferSameZone
  • helm template spire-test charts/spire --set service.type=LoadBalancer --set service.loadBalancerClass=service.k8s.aws/nlb --set service.externalTrafficPolicy=Local --set service.trafficDistribution=PreferSameZone --set metricsService.enabled=true --set serviceMonitor.enabled=true
  • helm template validator-test charts/validator --set ghost.service.loadBalancerClass=service.k8s.aws/nlb --set ghost.service.externalTrafficPolicy=Local --set ghost.service.trafficDistribution=PreferSameZone --set vao.service.loadBalancerClass=service.k8s.aws/nlb --set vao.service.externalTrafficPolicy=Local --set vao.service.trafficDistribution=PreferSameZone
  • ct lint --config ct.yaml

@devkoriel devkoriel added the enhancement New feature or request label May 13, 2026
@devkoriel devkoriel self-assigned this May 13, 2026
@devkoriel devkoriel requested a review from a team May 13, 2026 21:59
@devkoriel devkoriel changed the title feat(erpc): support Service traffic distribution [sc-18128] feat: support topology-aware Service controls [sc-18128] May 13, 2026
@devkoriel devkoriel merged commit 21a34c1 into main May 14, 2026
2 checks passed
@devkoriel devkoriel deleted the sc-18128-erpc-traffic-distribution branch May 14, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants