An event-driven e-commerce platform on multi-zone Kubernetes.
An online store built as a full platform: 16 services across four tiers, a service mesh with mutual TLS, a three-pillar observability stack, a highly-available replicated data tier, a GitOps delivery pipeline with supply-chain security, multi-zone high availability, multi-cloud infrastructure-as-code, 35+ diagrams generated from the code, and a detailed companion report.
The storefront UI (Next.js + Tailwind): live star ratings from the reviews-service, an animated hero band of live cluster metrics, and a real basket → checkout → saga flow.
- What this project is
- The live UI
- Demo videos
- System architecture
- The sixteen services
- Request & event flows
- The networking story
- Scaling, storage & governance
- The enterprise platform
- Multi-cloud infrastructure
- Rendered straight from the manifests
- Deploying it
- Repository layout
- Verifying everything
NetShop is an event-driven online store that exercises the full networking stack of a real Kubernetes platform. It is a working shop (browse, search, add to cart, check out) backed by sixteen services across an edge, application, worker and data tier, and it runs the way a production system does: multi-zone, autoscaled, and locked down with zero-trust network policies.
In a microservice system the network is the system. The calls that used to happen in memory now cross the network, so service discovery, load balancing, routing, access control and traffic observability decide whether it works at all. NetShop makes those mechanics explicit, then adds what a real platform needs on top: a service mesh, a three-pillar observability stack, a replicated data tier, GitOps delivery and supply-chain security.
The picture below shows the whole platform: all 16 services across the four tiers, the HA data stores, the observability and GitOps/policy control planes, all inside one mutually-authenticated mesh.
The complete platform on one canvas — edge, application, worker and data tiers, plus observability and delivery, inside the mTLS mesh.
The web app has three views. The storefront is the shop (browse, search,
recommendations, ratings, basket, checkout). The cluster console reads live
state from the Kubernetes API and draws the service graph as traffic flows. The
platform view lists every service and enterprise capability, with a working
JWT sign-in against the auth-service.
The cluster console: a service-mesh graph of all 16 services coloured by availability zone, a control deck to send traces, place orders and drive the autoscaler, and live traffic-by-zone bars.
The platform view: a live JWT sign-in, a catalog of all 16 services with their tier and datastore, and a card for each enterprise capability — mesh, observability, HA data, GitOps, supply-chain security and zero-trust.
Fifteen short recordings of the running system, each bolding a networking
concept. The GIFs below autoplay; click any title for the full-quality MP4.
Full write-ups and "how it works" explainers are in demos/README.md.
Service mesh & multi-zone — service discovery + load balancing across zone-a / zone-b / zone-c
Order saga — east/west traffic — a checkout fans out across the microservices
Platform & identity — JWT sign-in against auth-service + the service & capability catalog
Load balancing — a traffic burst balanced across pods and zones
Search across services — synchronous gateway → search-service → products-service calls
Recommendations & ratings — recommendations-service composition + reviews-service ratings
Full tour — the whole system end to end
Live service mesh — the mesh graph pulsing under continuous traffic
Trace path — one request's path lighting up the graph, hop by hop
Control deck — the console controls — trace, order, load burst — in action
Grafana — golden signals — the shipped dashboard: request rate, p95 latency and the zone pie, live
Grafana — request rate — the rate-per-service panel ramping under a traffic burst
Prometheus — all 14 /metrics targets scraped + a live PromQL rate query
Jaeger — distributed tracing — one request's 55-span waterfall across 10 services, with timing
RabbitMQ — event bus — the netshop.events exchange fanning order.created to 3 worker queues, live
The platform is organised into four layers. At the edge sit the browser-facing
web app (Next.js), the frontend backend-for-frontend, and the api-gateway.
The application layer holds the domain services. The worker layer reacts to
events asynchronously and never sits on the request path. The data layer
provides PostgreSQL, Redis, and a RabbitMQ event bus.
The full event-driven architecture: synchronous request paths (solid) and the asynchronous order.created fan-out to the worker tier (dashed).
| Service | Tier | Role |
|---|---|---|
web |
edge | Next.js storefront + live cluster console |
frontend |
edge | backend-for-frontend; aggregates + reads live cluster state |
api-gateway |
edge | north/south routing, JWT enforcement, trace endpoints |
users-service |
app | customer profiles (PostgreSQL) |
products-service |
app | product catalogue (PostgreSQL) |
cart-service |
app | shopping cart (Redis) |
orders-service |
app | order saga coordinator (PostgreSQL + RabbitMQ) |
payments-service |
app | synchronous payment authorisation (PostgreSQL) |
search-service |
app | product search |
recommendations-service |
app | recommendations from catalogue + event analytics |
auth-service |
app | issues & verifies JWTs — the identity authority |
reviews-service |
app | product reviews & ratings (PostgreSQL) |
inventory-service |
worker | stock; consumes order.created |
notifications-service |
worker | notifications; consumes order.created |
analytics-service |
worker | revenue/sales aggregation (Redis) |
postgres / redis / rabbitmq |
data | database, cache, event bus |
A request enters at the edge and is forwarded same-origin to the BFF and then the gateway, which routes it to the right service.
The edge path: browser → ingress → web → BFF → api-gateway, with the BFF also reading live cluster state from the Kubernetes API.
A synchronous order request as it traverses the gateway and the application services down to the datastore.
The most instructive flow is checkout, which runs a saga: the orders service
validates the user, reads the basket, prices the items, authorises payment, and
only then persists the order — after which it simply publishes an order.created
event and returns. Three independent workers pick that event up on their own
schedule.
The order saga: synchronous validation + persistence, then an asynchronous event fan-out so slow background work never blocks the customer.
The RabbitMQ topic exchange routing order.created to the inventory, notifications and analytics workers — each on its own queue, resilient to brief outages.
Services find each other by name, not IP: CoreDNS resolves a service name to a stable ClusterIP and kube-proxy load-balances each connection across the ready pods behind it.
Service discovery: CoreDNS resolves svc.cluster.local names; kube-proxy spreads connections across pods (which live in different zones).
Just as important as enabling traffic is restricting it. A default-deny policy covers the namespace, and only the specific edges that genuinely exist in the service graph are opened — so a compromised service cannot wander toward the database.
The layered, zero-trust NetworkPolicies — a default-deny baseline plus explicit allows that mirror the service graph exactly, with an egress lock.
To survive the loss of an entire zone, every workload is spread across zones and nodes with topology-spread constraints and anti-affinity.
Multi-zone topology: control plane and workers across three availability zones, with pods spread evenly so losing a zone removes only part of the capacity.
Namespace isolation boundaries that complement the NetworkPolicies and RBAC.
Every application and worker carries a Horizontal Pod Autoscaler.
Horizontal pod autoscaling: metrics-server → HPA → deployment, with separate stabilisation windows for fast scale-up and cautious scale-down.
Going beyond CPU: autoscaling on custom/external metrics (KEDA / Prometheus adapter) plus the cluster autoscaler adding nodes.
Persistent storage: PVCs bound to PVs via StorageClasses, with per-cloud zonal disks.
Namespace governance: resource quotas, limit ranges, Pod Security Admission and least-privilege RBAC.
Five cross-cutting capabilities round out the platform. Each is independent, declarative, and validated.
L3/L4 NetworkPolicies leave two gaps: east/west traffic is plaintext, and identity
is tied to IPs rather than cryptographic workload identity. A service mesh closes
both. NetShop ships two equivalent meshes (pick one) under
k8s/service-mesh/: Linkerd (auto-injection, automatic
mTLS, Server/AuthorizationPolicy locking the data tier to in-mesh identities)
and Istio (PeerAuthentication STRICT, deny-all + per-service authorization,
an L7 rule on the gateway). The edge moves from classic Ingress to the Gateway
API (HTTPS listener, HTTP→HTTPS redirect, weighted canary), and Cilium
adds L7 HTTP allow-lists and DNS-aware egress.
Envoy/Linkerd sidecars give every pod mutual TLS and a cryptographic identity; the Gateway API replaces classic Ingress at the edge.
Under k8s/observability/ the platform gains metrics, logs
and traces, all correlated. An OpenTelemetry Collector enriches spans, derives RED
metrics, and exports to Jaeger; the shared library auto-instruments FastAPI and
httpx so one trace spans the whole call chain. Loki + Promtail ship structured
logs correlated by trace id, and Prometheus carries multi-window burn-rate SLO
rules feeding a golden-signals Grafana dashboard.
Distributed tracing: every service exports OTLP spans to the OpenTelemetry Collector, which forwards them to Jaeger.
Metrics, logs and traces flowing into Prometheus, Loki/Tempo and Jaeger, unified in Grafana with metric↔trace↔log correlation.
The metrics pipeline: ServiceMonitors scrape every pod; alert rules and a Grafana dashboard surface the golden signals.
The default single-instance datastores become a replicated, self-healing tier
under k8s/data-ha/: CloudNativePG runs Postgres as 1 primary + 2
replicas with synchronous quorum replication (RPO≈0) fronted by PgBouncer; Redis
runs with Sentinel; RabbitMQ runs as a 3-node quorum cluster; and Barman archives
WAL continuously for point-in-time recovery.
The HA data tier: CloudNativePG primary + replicas with a connection pooler, Redis + Sentinel, and a RabbitMQ quorum cluster — all spread across zones.
Backup & DR: continuous WAL archiving + scheduled backups to object storage, with point-in-time restore into a recovery cluster.
auth-service is the platform's identity authority: it issues and verifies
short-lived HS256 JWTs, with the signing secret injected from a Kubernetes
Secret. The gateway passes /api/auth/* through and can verify tokens before
forwarding protected calls.
Authentication / authorization: the gateway obtains and verifies JWTs via auth-service before granting access to downstream services.
The end-to-end JWT lifecycle: login issues a signed token; a later request carries it, the gateway verifies it with auth-service, and only then reaches the protected resource.
reviews-service owns product reviews on PostgreSQL (with an in-memory fallback so
it runs offline). The storefront shows live star ratings pulled straight from it.
The reviews read/write paths flowing through the HA Postgres pooler, with the products-service consulted for validity.
Delivery is declarative and self-healing under gitops/ and
security/. Argo CD uses an app-of-apps pattern with an AppProject
security boundary; Argo Rollouts drives canary/blue-green deploys with automated
analysis; and the supply chain runs lint → smoke → build → Trivy scan → SBOM →
cosign sign → push → GitOps deploy, with Kyverno enforcing signed images and
policy.
GitOps: Git is the single source of truth; Argo CD's app-of-apps syncs and self-heals the cluster to match it.
The supply chain: build → Trivy scan → cosign sign → SBOM → Kyverno admission verify, so only signed, scanned images ever run.
Progressive delivery: Argo Rollouts shifts traffic to a canary while an AnalysisRun watches SLO metrics, promoting or rolling back automatically.
The full pipeline end to end: lint → smoke → build → scan → sign → push → GitOps deploy.
The CI/CD overview that the GitHub Actions workflow implements.
Defense in depth: edge security, zero-trust NetworkPolicies, Pod Security, RBAC, and non-root, read-only containers.
The enterprise security layers (L1–L5), from the network edge down to workload identity and policy.
Multi-cluster / disaster-recovery topology spanning regions for resilience to a full-region loss.
The platform control plane: GitOps bootstrapping the mesh, secrets, observability and policy add-ons onto a fleet of clusters.
The entire cluster can be provisioned as code on four major clouds.
Terraform provisions a regional, multi-zone cluster on GKE, EKS, DOKS or AKS — so the multi-zone design is a configured fact, not a claim.
Two diagrams are drawn automatically from the actual rendered Kubernetes manifests by KubeDiagrams, so they match what really deploys.
Auto-generated from k8s/rendered/netshop.yaml — every Deployment, Service, HPA, PDB, Secret and NetworkPolicy the chart produces.
The same, auto-generated from the GKE kustomize overlay.
On a freshly-bought single VM (Ubuntu 22.04/24.04, 4 vCPU / 16 GB), the whole demo is one command — it installs Docker, kubectl, Helm and kind, builds the images, creates a 3-zone cluster, deploys everything, and prints the public URL:
git clone <repo> && cd netshop
sudo ./scripts/vm-demo.sh # fresh server → live demo in ~5–10 minOr use the unified deploy wrapper:
./scripts/deploy.sh local # build images + 3-zone kind cluster + deploy
./scripts/deploy.sh cloud gke # helm/kustomize to your current kube-context
./scripts/deploy.sh enterprise # add mesh + observability + HA data + GitOps
./scripts/deploy.sh status | down # inspect / tear downTo run the whole mesh without any Kubernetes (pure uvicorn, simulated zones):
./scripts/run_local.sh start # http://localhost:8080
docker compose up --build # full stack incl. datastores + the Next.js UIThree deployment representations are kept in lock-step: the Helm chart
(helm/netshop, the templated source of truth, 93 objects), a Kustomize
base with five hardened cloud overlays derived from it, and Ansible playbooks
for the add-ons and rollout. The enterprise bundles layer on top and are applied
by Argo CD in production.
| Path | What lives there |
|---|---|
apps/ |
the 14 FastAPI microservices, the shared library, and the Next.js web UI |
helm/netshop/ |
the templated chart that is the source of truth |
kustomize/ |
the base, components, and five cloud overlays derived from the chart |
terraform/ |
regional multi-zone cluster configs for GKE, EKS, DOKS, AKS |
ansible/ |
deployment playbooks and add-on installation |
k8s/rendered/ |
the rendered manifest, validated by kubeconform |
k8s/service-mesh/ |
Linkerd + Istio meshes, Gateway API edge, Cilium L7 policies |
k8s/observability/ |
OpenTelemetry + Jaeger, Loki + Promtail, Prometheus SLOs, Grafana |
k8s/data-ha/ |
CloudNativePG, PgBouncer, Redis Sentinel, RabbitMQ quorum, backups |
gitops/ |
Argo CD app-of-apps and Argo Rollouts |
security/ |
Kyverno policies, Trivy, cosign, SBOM, sealed/external secrets, CI |
diagrams/ |
the diagram-as-code scripts and their PNG output |
scripts/ |
smoke test, local runner, deploy.sh, vm-demo.sh |
The project is built to be checked:
make smoke # exercise every service in-process (incl. JWT login/verify + circuit breaker)
make validate # render the chart + kubeconform the output and all five overlays
make web-build # compile the Next.js UI
make doc # compile the Persian report (Tectonic)Every standalone enterprise bundle (k8s/service-mesh, k8s/observability,
k8s/data-ha) is kubeconform-clean with operator CRDs skipped. A GitHub Actions
workflow runs the smoke tests, manifest validation, the web build, and
terraform validate on every push.














