NetShop — A Multi-Zone Kubernetes Microservice Platform

An event-driven e-commerce platform on multi-zone Kubernetes.

An online store built as a full platform: 16 services across four tiers, a service mesh with mutual TLS, a three-pillar observability stack, a highly-available replicated data tier, a GitOps delivery pipeline with supply-chain security, multi-zone high availability, multi-cloud infrastructure-as-code, 35+ diagrams generated from the code, and a detailed companion report.

The storefront UI (Next.js + Tailwind): live star ratings from the reviews-service, an animated hero band of live cluster metrics, and a real basket → checkout → saga flow.

What this project is

NetShop is an event-driven online store that exercises the full networking stack of a real Kubernetes platform. It is a working shop (browse, search, add to cart, check out) backed by sixteen services across an edge, application, worker and data tier, and it runs the way a production system does: multi-zone, autoscaled, and locked down with zero-trust network policies.

In a microservice system the network is the system. The calls that used to happen in memory now cross the network, so service discovery, load balancing, routing, access control and traffic observability decide whether it works at all. NetShop makes those mechanics explicit, then adds what a real platform needs on top: a service mesh, a three-pillar observability stack, a replicated data tier, GitOps delivery and supply-chain security.

The picture below shows the whole platform: all 16 services across the four tiers, the HA data stores, the observability and GitOps/policy control planes, all inside one mutually-authenticated mesh.

The complete platform on one canvas — edge, application, worker and data tiers, plus observability and delivery, inside the mTLS mesh.

The live UI

The web app has three views. The storefront is the shop (browse, search, recommendations, ratings, basket, checkout). The cluster console reads live state from the Kubernetes API and draws the service graph as traffic flows. The platform view lists every service and enterprise capability, with a working JWT sign-in against the auth-service.

The cluster console: a service-mesh graph of all 16 services coloured by availability zone, a control deck to send traces, place orders and drive the autoscaler, and live traffic-by-zone bars.

The platform view: a live JWT sign-in, a catalog of all 16 services with their tier and datastore, and a card for each enterprise capability — mesh, observability, HA data, GitOps, supply-chain security and zero-trust.

Demo videos

Fifteen short recordings of the running system, each bolding a networking concept. The GIFs below autoplay; click any title for the full-quality MP4. Full write-ups and "how it works" explainers are in demos/README.md.

Platform & networking

Service mesh & multi-zone — service discovery + load balancing across zone-a / zone-b / zone-c

Order saga — east/west traffic — a checkout fans out across the microservices

Platform & identity — JWT sign-in against auth-service + the service & capability catalog

Load balancing — a traffic burst balanced across pods and zones

Search across services — synchronous gateway → search-service → products-service calls

Recommendations & ratings — recommendations-service composition + reviews-service ratings

Full tour — the whole system end to end

Live service mesh — the mesh graph pulsing under continuous traffic

Trace path — one request's path lighting up the graph, hop by hop

Control deck — the console controls — trace, order, load burst — in action

Observability & messaging

Grafana — golden signals — the shipped dashboard: request rate, p95 latency and the zone pie, live

Grafana — request rate — the rate-per-service panel ramping under a traffic burst

Prometheus — all 14 /metrics targets scraped + a live PromQL rate query

Jaeger — distributed tracing — one request's 55-span waterfall across 10 services, with timing

RabbitMQ — event bus — the netshop.events exchange fanning order.created to 3 worker queues, live

System architecture

The platform is organised into four layers. At the edge sit the browser-facing web app (Next.js), the frontend backend-for-frontend, and the api-gateway. The application layer holds the domain services. The worker layer reacts to events asynchronously and never sits on the request path. The data layer provides PostgreSQL, Redis, and a RabbitMQ event bus.

The full event-driven architecture: synchronous request paths (solid) and the asynchronous order.created fan-out to the worker tier (dashed).

The sixteen services

Service	Tier	Role
`web`	edge	Next.js storefront + live cluster console
`frontend`	edge	backend-for-frontend; aggregates + reads live cluster state
`api-gateway`	edge	north/south routing, JWT enforcement, trace endpoints
`users-service`	app	customer profiles (PostgreSQL)
`products-service`	app	product catalogue (PostgreSQL)
`cart-service`	app	shopping cart (Redis)
`orders-service`	app	order saga coordinator (PostgreSQL + RabbitMQ)
`payments-service`	app	synchronous payment authorisation (PostgreSQL)
`search-service`	app	product search
`recommendations-service`	app	recommendations from catalogue + event analytics
`auth-service`	app	issues & verifies JWTs — the identity authority
`reviews-service`	app	product reviews & ratings (PostgreSQL)
`inventory-service`	worker	stock; consumes `order.created`
`notifications-service`	worker	notifications; consumes `order.created`
`analytics-service`	worker	revenue/sales aggregation (Redis)
`postgres` / `redis` / `rabbitmq`	data	database, cache, event bus

Request & event flows

A request enters at the edge and is forwarded same-origin to the BFF and then the gateway, which routes it to the right service.

The edge path: browser → ingress → web → BFF → api-gateway, with the BFF also reading live cluster state from the Kubernetes API.

A synchronous order request as it traverses the gateway and the application services down to the datastore.

The most instructive flow is checkout, which runs a saga: the orders service validates the user, reads the basket, prices the items, authorises payment, and only then persists the order — after which it simply publishes an order.created event and returns. Three independent workers pick that event up on their own schedule.

The order saga: synchronous validation + persistence, then an asynchronous event fan-out so slow background work never blocks the customer.

The RabbitMQ topic exchange routing order.created to the inventory, notifications and analytics workers — each on its own queue, resilient to brief outages.

The networking story

Services find each other by name, not IP: CoreDNS resolves a service name to a stable ClusterIP and kube-proxy load-balances each connection across the ready pods behind it.

Service discovery: CoreDNS resolves svc.cluster.local names; kube-proxy spreads connections across pods (which live in different zones).

Just as important as enabling traffic is restricting it. A default-deny policy covers the namespace, and only the specific edges that genuinely exist in the service graph are opened — so a compromised service cannot wander toward the database.

The layered, zero-trust NetworkPolicies — a default-deny baseline plus explicit allows that mirror the service graph exactly, with an egress lock.

To survive the loss of an entire zone, every workload is spread across zones and nodes with topology-spread constraints and anti-affinity.

Multi-zone topology: control plane and workers across three availability zones, with pods spread evenly so losing a zone removes only part of the capacity.

Namespace isolation boundaries that complement the NetworkPolicies and RBAC.

Scaling, storage & governance

Every application and worker carries a Horizontal Pod Autoscaler.

Horizontal pod autoscaling: metrics-server → HPA → deployment, with separate stabilisation windows for fast scale-up and cautious scale-down.

Going beyond CPU: autoscaling on custom/external metrics (KEDA / Prometheus adapter) plus the cluster autoscaler adding nodes.

Persistent storage: PVCs bound to PVs via StorageClasses, with per-cloud zonal disks.

Namespace governance: resource quotas, limit ranges, Pod Security Admission and least-privilege RBAC.

The enterprise platform

Five cross-cutting capabilities round out the platform. Each is independent, declarative, and validated.

Service mesh & zero-trust networking

L3/L4 NetworkPolicies leave two gaps: east/west traffic is plaintext, and identity is tied to IPs rather than cryptographic workload identity. A service mesh closes both. NetShop ships two equivalent meshes (pick one) under k8s/service-mesh/: Linkerd (auto-injection, automatic mTLS, Server/AuthorizationPolicy locking the data tier to in-mesh identities) and Istio (PeerAuthentication STRICT, deny-all + per-service authorization, an L7 rule on the gateway). The edge moves from classic Ingress to the Gateway API (HTTPS listener, HTTP→HTTPS redirect, weighted canary), and Cilium adds L7 HTTP allow-lists and DNS-aware egress.

Envoy/Linkerd sidecars give every pod mutual TLS and a cryptographic identity; the Gateway API replaces classic Ingress at the edge.

Observability — the three pillars

Under k8s/observability/ the platform gains metrics, logs and traces, all correlated. An OpenTelemetry Collector enriches spans, derives RED metrics, and exports to Jaeger; the shared library auto-instruments FastAPI and httpx so one trace spans the whole call chain. Loki + Promtail ship structured logs correlated by trace id, and Prometheus carries multi-window burn-rate SLO rules feeding a golden-signals Grafana dashboard.

Distributed tracing: every service exports OTLP spans to the OpenTelemetry Collector, which forwards them to Jaeger.

Metrics, logs and traces flowing into Prometheus, Loki/Tempo and Jaeger, unified in Grafana with metric↔trace↔log correlation.

The metrics pipeline: ServiceMonitors scrape every pod; alert rules and a Grafana dashboard surface the golden signals.

Highly-available data tier

The default single-instance datastores become a replicated, self-healing tier under k8s/data-ha/: CloudNativePG runs Postgres as 1 primary + 2 replicas with synchronous quorum replication (RPO≈0) fronted by PgBouncer; Redis runs with Sentinel; RabbitMQ runs as a 3-node quorum cluster; and Barman archives WAL continuously for point-in-time recovery.

The HA data tier: CloudNativePG primary + replicas with a connection pooler, Redis + Sentinel, and a RabbitMQ quorum cluster — all spread across zones.

Backup & DR: continuous WAL archiving + scheduled backups to object storage, with point-in-time restore into a recovery cluster.

Identity & JWT authentication

auth-service is the platform's identity authority: it issues and verifies short-lived HS256 JWTs, with the signing secret injected from a Kubernetes Secret. The gateway passes /api/auth/* through and can verify tokens before forwarding protected calls.

Authentication / authorization: the gateway obtains and verifies JWTs via auth-service before granting access to downstream services.

The end-to-end JWT lifecycle: login issues a signed token; a later request carries it, the gateway verifies it with auth-service, and only then reaches the protected resource.

Reviews & ratings

reviews-service owns product reviews on PostgreSQL (with an in-memory fallback so it runs offline). The storefront shows live star ratings pulled straight from it.

The reviews read/write paths flowing through the HA Postgres pooler, with the products-service consulted for validity.

GitOps & supply-chain security

Delivery is declarative and self-healing under gitops/ and security/. Argo CD uses an app-of-apps pattern with an AppProject security boundary; Argo Rollouts drives canary/blue-green deploys with automated analysis; and the supply chain runs lint → smoke → build → Trivy scan → SBOM → cosign sign → push → GitOps deploy, with Kyverno enforcing signed images and policy.

GitOps: Git is the single source of truth; Argo CD's app-of-apps syncs and self-heals the cluster to match it.

The supply chain: build → Trivy scan → cosign sign → SBOM → Kyverno admission verify, so only signed, scanned images ever run.

Progressive delivery: Argo Rollouts shifts traffic to a canary while an AnalysisRun watches SLO metrics, promoting or rolling back automatically.

The full pipeline end to end: lint → smoke → build → scan → sign → push → GitOps deploy.

The CI/CD overview that the GitHub Actions workflow implements.

Security & defense in depth

Defense in depth: edge security, zero-trust NetworkPolicies, Pod Security, RBAC, and non-root, read-only containers.

The enterprise security layers (L1–L5), from the network edge down to workload identity and policy.

Multi-cluster DR & the control plane

Multi-cluster / disaster-recovery topology spanning regions for resilience to a full-region loss.

The platform control plane: GitOps bootstrapping the mesh, secrets, observability and policy add-ons onto a fleet of clusters.

Multi-cloud infrastructure

The entire cluster can be provisioned as code on four major clouds.

Terraform provisions a regional, multi-zone cluster on GKE, EKS, DOKS or AKS — so the multi-zone design is a configured fact, not a claim.

Rendered straight from the manifests

Two diagrams are drawn automatically from the actual rendered Kubernetes manifests by KubeDiagrams, so they match what really deploys.

Auto-generated from k8s/rendered/netshop.yaml — every Deployment, Service, HPA, PDB, Secret and NetworkPolicy the chart produces.

The same, auto-generated from the GKE kustomize overlay.

Deploying it

On a freshly-bought single VM (Ubuntu 22.04/24.04, 4 vCPU / 16 GB), the whole demo is one command — it installs Docker, kubectl, Helm and kind, builds the images, creates a 3-zone cluster, deploys everything, and prints the public URL:

git clone <repo> && cd netshop
sudo ./scripts/vm-demo.sh            # fresh server → live demo in ~5–10 min

Or use the unified deploy wrapper:

./scripts/deploy.sh local            # build images + 3-zone kind cluster + deploy
./scripts/deploy.sh cloud gke        # helm/kustomize to your current kube-context
./scripts/deploy.sh enterprise       # add mesh + observability + HA data + GitOps
./scripts/deploy.sh status | down    # inspect / tear down

To run the whole mesh without any Kubernetes (pure uvicorn, simulated zones):

./scripts/run_local.sh start         # http://localhost:8080
docker compose up --build            # full stack incl. datastores + the Next.js UI

Three deployment representations are kept in lock-step: the Helm chart (helm/netshop, the templated source of truth, 93 objects), a Kustomize base with five hardened cloud overlays derived from it, and Ansible playbooks for the add-ons and rollout. The enterprise bundles layer on top and are applied by Argo CD in production.

Repository layout

Path	What lives there
`apps/`	the 14 FastAPI microservices, the shared library, and the Next.js `web` UI
`helm/netshop/`	the templated chart that is the source of truth
`kustomize/`	the base, components, and five cloud overlays derived from the chart
`terraform/`	regional multi-zone cluster configs for GKE, EKS, DOKS, AKS
`ansible/`	deployment playbooks and add-on installation
`k8s/rendered/`	the rendered manifest, validated by kubeconform
`k8s/service-mesh/`	Linkerd + Istio meshes, Gateway API edge, Cilium L7 policies
`k8s/observability/`	OpenTelemetry + Jaeger, Loki + Promtail, Prometheus SLOs, Grafana
`k8s/data-ha/`	CloudNativePG, PgBouncer, Redis Sentinel, RabbitMQ quorum, backups
`gitops/`	Argo CD app-of-apps and Argo Rollouts
`security/`	Kyverno policies, Trivy, cosign, SBOM, sealed/external secrets, CI
`diagrams/`	the diagram-as-code scripts and their PNG output
`scripts/`	smoke test, local runner, `deploy.sh`, `vm-demo.sh`

Verifying everything

The project is built to be checked:

make smoke        # exercise every service in-process (incl. JWT login/verify + circuit breaker)
make validate     # render the chart + kubeconform the output and all five overlays
make web-build    # compile the Next.js UI
make doc          # compile the Persian report (Tectonic)

Every standalone enterprise bundle (k8s/service-mesh, k8s/observability, k8s/data-ha) is kubeconform-clean with operator CRDs skipped. A GitHub Actions workflow runs the smoke tests, manifest validation, the web build, and terraform validate on every push.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NetShop — A Multi-Zone Kubernetes Microservice Platform

Contents

What this project is

The live UI

Demo videos

Platform & networking

Observability & messaging

System architecture

The sixteen services

Request & event flows

The networking story

Scaling, storage & governance

The enterprise platform

Service mesh & zero-trust networking

Observability — the three pillars

Highly-available data tier

Identity & JWT authentication

Reviews & ratings

GitOps & supply-chain security

Security & defense in depth

Multi-cluster DR & the control plane

Multi-cloud infrastructure

Rendered straight from the manifests

Deploying it

Repository layout

Verifying everything

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github/workflows		.github/workflows
ansible		ansible
apps		apps
demos		demos
diagrams/output		diagrams/output
gitops		gitops
helm/netshop		helm/netshop
k8s		k8s
kustomize		kustomize
scripts		scripts
security		security
terraform		terraform
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

NetShop — A Multi-Zone Kubernetes Microservice Platform

Contents

What this project is

The live UI

Demo videos

Platform & networking

Observability & messaging

System architecture

The sixteen services

Request & event flows

The networking story

Scaling, storage & governance

The enterprise platform

Service mesh & zero-trust networking

Observability — the three pillars

Highly-available data tier

Identity & JWT authentication

Reviews & ratings

GitOps & supply-chain security

Security & defense in depth

Multi-cluster DR & the control plane

Multi-cloud infrastructure

Rendered straight from the manifests

Deploying it

Repository layout

Verifying everything

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages