Skip to content

dwin-gharibi/netshop

Repository files navigation

NetShop — A Multi-Zone Kubernetes Microservice Platform

An event-driven e-commerce platform on multi-zone Kubernetes.

An online store built as a full platform: 16 services across four tiers, a service mesh with mutual TLS, a three-pillar observability stack, a highly-available replicated data tier, a GitOps delivery pipeline with supply-chain security, multi-zone high availability, multi-cloud infrastructure-as-code, 35+ diagrams generated from the code, and a detailed companion report.

NetShop storefront The storefront UI (Next.js + Tailwind): live star ratings from the reviews-service, an animated hero band of live cluster metrics, and a real basket → checkout → saga flow.


Contents


What this project is

NetShop is an event-driven online store that exercises the full networking stack of a real Kubernetes platform. It is a working shop (browse, search, add to cart, check out) backed by sixteen services across an edge, application, worker and data tier, and it runs the way a production system does: multi-zone, autoscaled, and locked down with zero-trust network policies.

In a microservice system the network is the system. The calls that used to happen in memory now cross the network, so service discovery, load balancing, routing, access control and traffic observability decide whether it works at all. NetShop makes those mechanics explicit, then adds what a real platform needs on top: a service mesh, a three-pillar observability stack, a replicated data tier, GitOps delivery and supply-chain security.

The picture below shows the whole platform: all 16 services across the four tiers, the HA data stores, the observability and GitOps/policy control planes, all inside one mutually-authenticated mesh.

Full integrated platform The complete platform on one canvas — edge, application, worker and data tiers, plus observability and delivery, inside the mTLS mesh.


The live UI

The web app has three views. The storefront is the shop (browse, search, recommendations, ratings, basket, checkout). The cluster console reads live state from the Kubernetes API and draws the service graph as traffic flows. The platform view lists every service and enterprise capability, with a working JWT sign-in against the auth-service.

Cluster console The cluster console: a service-mesh graph of all 16 services coloured by availability zone, a control deck to send traces, place orders and drive the autoscaler, and live traffic-by-zone bars.

Platform overview The platform view: a live JWT sign-in, a catalog of all 16 services with their tier and datastore, and a card for each enterprise capability — mesh, observability, HA data, GitOps, supply-chain security and zero-trust.


Demo videos

Fifteen short recordings of the running system, each bolding a networking concept. The GIFs below autoplay; click any title for the full-quality MP4. Full write-ups and "how it works" explainers are in demos/README.md.

Platform & networking

Service mesh & multi-zone — service discovery + load balancing across zone-a / zone-b / zone-c

Service mesh & multi-zone

Order saga — east/west traffic — a checkout fans out across the microservices

Order saga

Platform & identity — JWT sign-in against auth-service + the service & capability catalog

Platform & identity

Load balancing — a traffic burst balanced across pods and zones

Load balancing

Search across services — synchronous gateway → search-serviceproducts-service calls

Search across services

Recommendations & ratingsrecommendations-service composition + reviews-service ratings

Recommendations & ratings

Full tour — the whole system end to end

Full tour

Live service mesh — the mesh graph pulsing under continuous traffic

Live service mesh

Trace path — one request's path lighting up the graph, hop by hop

Trace path

Control deck — the console controls — trace, order, load burst — in action

Control deck

Observability & messaging

Grafana — golden signals — the shipped dashboard: request rate, p95 latency and the zone pie, live

Grafana — golden signals

Grafana — request rate — the rate-per-service panel ramping under a traffic burst

Grafana — request rate

Prometheus — all 14 /metrics targets scraped + a live PromQL rate query

Prometheus

Jaeger — distributed tracing — one request's 55-span waterfall across 10 services, with timing

Jaeger — distributed tracing

RabbitMQ — event bus — the netshop.events exchange fanning order.created to 3 worker queues, live

RabbitMQ — event bus


System architecture

The platform is organised into four layers. At the edge sit the browser-facing web app (Next.js), the frontend backend-for-frontend, and the api-gateway. The application layer holds the domain services. The worker layer reacts to events asynchronously and never sits on the request path. The data layer provides PostgreSQL, Redis, and a RabbitMQ event bus.

System architecture The full event-driven architecture: synchronous request paths (solid) and the asynchronous order.created fan-out to the worker tier (dashed).


The sixteen services

Service Tier Role
web edge Next.js storefront + live cluster console
frontend edge backend-for-frontend; aggregates + reads live cluster state
api-gateway edge north/south routing, JWT enforcement, trace endpoints
users-service app customer profiles (PostgreSQL)
products-service app product catalogue (PostgreSQL)
cart-service app shopping cart (Redis)
orders-service app order saga coordinator (PostgreSQL + RabbitMQ)
payments-service app synchronous payment authorisation (PostgreSQL)
search-service app product search
recommendations-service app recommendations from catalogue + event analytics
auth-service app issues & verifies JWTs — the identity authority
reviews-service app product reviews & ratings (PostgreSQL)
inventory-service worker stock; consumes order.created
notifications-service worker notifications; consumes order.created
analytics-service worker revenue/sales aggregation (Redis)
postgres / redis / rabbitmq data database, cache, event bus

Request & event flows

A request enters at the edge and is forwarded same-origin to the BFF and then the gateway, which routes it to the right service.

Edge / BFF path The edge path: browser → ingress → web → BFF → api-gateway, with the BFF also reading live cluster state from the Kubernetes API.

Order request flow A synchronous order request as it traverses the gateway and the application services down to the datastore.

The most instructive flow is checkout, which runs a saga: the orders service validates the user, reads the basket, prices the items, authorises payment, and only then persists the order — after which it simply publishes an order.created event and returns. Three independent workers pick that event up on their own schedule.

Order saga The order saga: synchronous validation + persistence, then an asynchronous event fan-out so slow background work never blocks the customer.

Event-driven workers The RabbitMQ topic exchange routing order.created to the inventory, notifications and analytics workers — each on its own queue, resilient to brief outages.


The networking story

Services find each other by name, not IP: CoreDNS resolves a service name to a stable ClusterIP and kube-proxy load-balances each connection across the ready pods behind it.

CoreDNS & kube-proxy Service discovery: CoreDNS resolves svc.cluster.local names; kube-proxy spreads connections across pods (which live in different zones).

Just as important as enabling traffic is restricting it. A default-deny policy covers the namespace, and only the specific edges that genuinely exist in the service graph are opened — so a compromised service cannot wander toward the database.

Zero-trust NetworkPolicies The layered, zero-trust NetworkPolicies — a default-deny baseline plus explicit allows that mirror the service graph exactly, with an egress lock.

To survive the loss of an entire zone, every workload is spread across zones and nodes with topology-spread constraints and anti-affinity.

Multi-zone topology Multi-zone topology: control plane and workers across three availability zones, with pods spread evenly so losing a zone removes only part of the capacity.

Namespace isolation Namespace isolation boundaries that complement the NetworkPolicies and RBAC.


Scaling, storage & governance

Every application and worker carries a Horizontal Pod Autoscaler.

HPA pipeline Horizontal pod autoscaling: metrics-server → HPA → deployment, with separate stabilisation windows for fast scale-up and cautious scale-down.

Custom-metric autoscaling Going beyond CPU: autoscaling on custom/external metrics (KEDA / Prometheus adapter) plus the cluster autoscaler adding nodes.

Volumes & storage Persistent storage: PVCs bound to PVs via StorageClasses, with per-cloud zonal disks.

Governance Namespace governance: resource quotas, limit ranges, Pod Security Admission and least-privilege RBAC.


The enterprise platform

Five cross-cutting capabilities round out the platform. Each is independent, declarative, and validated.

Service mesh & zero-trust networking

L3/L4 NetworkPolicies leave two gaps: east/west traffic is plaintext, and identity is tied to IPs rather than cryptographic workload identity. A service mesh closes both. NetShop ships two equivalent meshes (pick one) under k8s/service-mesh/: Linkerd (auto-injection, automatic mTLS, Server/AuthorizationPolicy locking the data tier to in-mesh identities) and Istio (PeerAuthentication STRICT, deny-all + per-service authorization, an L7 rule on the gateway). The edge moves from classic Ingress to the Gateway API (HTTPS listener, HTTP→HTTPS redirect, weighted canary), and Cilium adds L7 HTTP allow-lists and DNS-aware egress.

Service mesh & mTLS Envoy/Linkerd sidecars give every pod mutual TLS and a cryptographic identity; the Gateway API replaces classic Ingress at the edge.

Observability — the three pillars

Under k8s/observability/ the platform gains metrics, logs and traces, all correlated. An OpenTelemetry Collector enriches spans, derives RED metrics, and exports to Jaeger; the shared library auto-instruments FastAPI and httpx so one trace spans the whole call chain. Loki + Promtail ship structured logs correlated by trace id, and Prometheus carries multi-window burn-rate SLO rules feeding a golden-signals Grafana dashboard.

Distributed tracing Distributed tracing: every service exports OTLP spans to the OpenTelemetry Collector, which forwards them to Jaeger.

The three pillars Metrics, logs and traces flowing into Prometheus, Loki/Tempo and Jaeger, unified in Grafana with metric↔trace↔log correlation.

Prometheus / Grafana The metrics pipeline: ServiceMonitors scrape every pod; alert rules and a Grafana dashboard surface the golden signals.

Highly-available data tier

The default single-instance datastores become a replicated, self-healing tier under k8s/data-ha/: CloudNativePG runs Postgres as 1 primary + 2 replicas with synchronous quorum replication (RPO≈0) fronted by PgBouncer; Redis runs with Sentinel; RabbitMQ runs as a 3-node quorum cluster; and Barman archives WAL continuously for point-in-time recovery.

HA data tier The HA data tier: CloudNativePG primary + replicas with a connection pooler, Redis + Sentinel, and a RabbitMQ quorum cluster — all spread across zones.

Backup & disaster recovery Backup & DR: continuous WAL archiving + scheduled backups to object storage, with point-in-time restore into a recovery cluster.

Identity & JWT authentication

auth-service is the platform's identity authority: it issues and verifies short-lived HS256 JWTs, with the signing secret injected from a Kubernetes Secret. The gateway passes /api/auth/* through and can verify tokens before forwarding protected calls.

Auth & authorization Authentication / authorization: the gateway obtains and verifies JWTs via auth-service before granting access to downstream services.

JWT lifecycle The end-to-end JWT lifecycle: login issues a signed token; a later request carries it, the gateway verifies it with auth-service, and only then reaches the protected resource.

Reviews & ratings

reviews-service owns product reviews on PostgreSQL (with an in-memory fallback so it runs offline). The storefront shows live star ratings pulled straight from it.

Reviews flow The reviews read/write paths flowing through the HA Postgres pooler, with the products-service consulted for validity.

GitOps & supply-chain security

Delivery is declarative and self-healing under gitops/ and security/. Argo CD uses an app-of-apps pattern with an AppProject security boundary; Argo Rollouts drives canary/blue-green deploys with automated analysis; and the supply chain runs lint → smoke → build → Trivy scan → SBOM → cosign sign → push → GitOps deploy, with Kyverno enforcing signed images and policy.

GitOps GitOps: Git is the single source of truth; Argo CD's app-of-apps syncs and self-heals the cluster to match it.

Supply chain The supply chain: build → Trivy scan → cosign sign → SBOM → Kyverno admission verify, so only signed, scanned images ever run.

Progressive delivery Progressive delivery: Argo Rollouts shifts traffic to a canary while an AnalysisRun watches SLO metrics, promoting or rolling back automatically.

Full CI/CD The full pipeline end to end: lint → smoke → build → scan → sign → push → GitOps deploy.

CI/CD overview The CI/CD overview that the GitHub Actions workflow implements.

Security & defense in depth

Defense in depth Defense in depth: edge security, zero-trust NetworkPolicies, Pod Security, RBAC, and non-root, read-only containers.

Enterprise defense layers The enterprise security layers (L1–L5), from the network edge down to workload identity and policy.

Multi-cluster DR & the control plane

Multi-cluster DR Multi-cluster / disaster-recovery topology spanning regions for resilience to a full-region loss.

Platform control plane The platform control plane: GitOps bootstrapping the mesh, secrets, observability and policy add-ons onto a fleet of clusters.


Multi-cloud infrastructure

The entire cluster can be provisioned as code on four major clouds.

Multi-cloud Terraform Terraform provisions a regional, multi-zone cluster on GKE, EKS, DOKS or AKS — so the multi-zone design is a configured fact, not a claim.


Rendered straight from the manifests

Two diagrams are drawn automatically from the actual rendered Kubernetes manifests by KubeDiagrams, so they match what really deploys.

KubeDiagrams — full Auto-generated from k8s/rendered/netshop.yaml — every Deployment, Service, HPA, PDB, Secret and NetworkPolicy the chart produces.

KubeDiagrams — GKE overlay The same, auto-generated from the GKE kustomize overlay.


Deploying it

On a freshly-bought single VM (Ubuntu 22.04/24.04, 4 vCPU / 16 GB), the whole demo is one command — it installs Docker, kubectl, Helm and kind, builds the images, creates a 3-zone cluster, deploys everything, and prints the public URL:

git clone <repo> && cd netshop
sudo ./scripts/vm-demo.sh            # fresh server → live demo in ~5–10 min

Or use the unified deploy wrapper:

./scripts/deploy.sh local            # build images + 3-zone kind cluster + deploy
./scripts/deploy.sh cloud gke        # helm/kustomize to your current kube-context
./scripts/deploy.sh enterprise       # add mesh + observability + HA data + GitOps
./scripts/deploy.sh status | down    # inspect / tear down

To run the whole mesh without any Kubernetes (pure uvicorn, simulated zones):

./scripts/run_local.sh start         # http://localhost:8080
docker compose up --build            # full stack incl. datastores + the Next.js UI

Three deployment representations are kept in lock-step: the Helm chart (helm/netshop, the templated source of truth, 93 objects), a Kustomize base with five hardened cloud overlays derived from it, and Ansible playbooks for the add-ons and rollout. The enterprise bundles layer on top and are applied by Argo CD in production.


Repository layout

Path What lives there
apps/ the 14 FastAPI microservices, the shared library, and the Next.js web UI
helm/netshop/ the templated chart that is the source of truth
kustomize/ the base, components, and five cloud overlays derived from the chart
terraform/ regional multi-zone cluster configs for GKE, EKS, DOKS, AKS
ansible/ deployment playbooks and add-on installation
k8s/rendered/ the rendered manifest, validated by kubeconform
k8s/service-mesh/ Linkerd + Istio meshes, Gateway API edge, Cilium L7 policies
k8s/observability/ OpenTelemetry + Jaeger, Loki + Promtail, Prometheus SLOs, Grafana
k8s/data-ha/ CloudNativePG, PgBouncer, Redis Sentinel, RabbitMQ quorum, backups
gitops/ Argo CD app-of-apps and Argo Rollouts
security/ Kyverno policies, Trivy, cosign, SBOM, sealed/external secrets, CI
diagrams/ the diagram-as-code scripts and their PNG output
scripts/ smoke test, local runner, deploy.sh, vm-demo.sh

Verifying everything

The project is built to be checked:

make smoke        # exercise every service in-process (incl. JWT login/verify + circuit breaker)
make validate     # render the chart + kubeconform the output and all five overlays
make web-build    # compile the Next.js UI
make doc          # compile the Persian report (Tectonic)

Every standalone enterprise bundle (k8s/service-mesh, k8s/observability, k8s/data-ha) is kubeconform-clean with operator CRDs skipped. A GitHub Actions workflow runs the smoke tests, manifest validation, the web build, and terraform validate on every push.

About

NetShop is an enterprise-grade, event-driven e-commerce platform built to demonstrate the architecture, networking, security, and operational characteristics of modern cloud-native systems at production scale. The platform comprises 16 microservices distributed across edge, application, worker, and data tiers, deployed on a multi-zone Kubernetes...

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors