Learning path: SWE -> Infra Engineer. Ordered by dependency. ~16 weeks.
- Processes: fork/exec, signals, PID namespaces
- cgroups v2: cpu/memory/io limits
- Filesystems: ext4, overlayfs, inodes,
/proc,/sys - System calls:
strace,lsof,perf - Exercise: Write a minimal container in ~50 lines bash using
unshare - Tools: strace, lsof, bpftrace, unshare
- OSI layers 3-7, TCP handshake/congestion control, UDP
- DNS resolution chain,
dig, tcpdump - HTTP/1.1, HTTP/2, TLS 1.3 handshake
- Load balancers (L4 vs L7), reverse proxies, NAT
- Exercise: Capture & analyze a TLS handshake with
tcpdump+ Wireshark - Tools: tcpdump, Wireshark, dig, curl, netcat
- bash scripting,
jq,awk,sed - Process management, signal trapping
- Exercise: Write a deployment health-check script polling an API with retries
- Tools: bash, jq, awk, sed, xargs
- Docker internals: overlayfs layers, OCI runtime spec, runc
- Image building: multi-stage builds, layer caching, distroless
- Container networking: bridge, host, CNI
- Exercise: Build multi-service app with Docker Compose, then break networking
- Tools: Docker, Docker Compose, dive (image layer inspector)
- Declarative vs imperative. State management.
- Terraform: providers, resources, data sources, modules, remote state
- Pulumi: general-purpose language for infra
- Exercise: Provision VPC + subnet + EC2, destroy, re-provision
- Tools: Terraform, Pulumi, OpenTofu
- Pipeline stages: build, test, scan, deploy
- Artifact management, container registries
- Deployment strategies: rolling, blue-green, canary
- Exercise: GitHub Actions pipeline builds + tests + pushes Docker image
- Tools: GitHub Actions, GitLab CI, ArgoCD (intro)
- Declarative config in git as source of truth
- Pull-based reconciliation (ArgoCD/Flux)
- Branching models for infra repos
- Exercise: Deploy app via ArgoCD with automatic sync from git
- Tools: ArgoCD, Flux, Helm
- Architecture: control plane, kubelet, etcd, scheduler
- Core objects: Pod, Deployment, Service, Ingress, ConfigMap, Secret
- Scheduling: affinity, taints/tolerations, topology spread
- Storage: PV, PVC, StorageClass, CSI
- RBAC: roles, bindings, service accounts
- Exercise: Run 3-tier app (frontend, API, DB) on kind/k3s with proper probes
- Tools: kind, k3s, kubectl, k9s, kubectx
- Compute: EC2, Auto Scaling Groups, spot instances
- Networking: VPC, subnets, security groups, NAT gateway
- Storage: S3, EBS, EFS
- IAM: policies, roles, trust relationships
- Managed services: RDS, EKS, SQS, Lambda
- Exercise: Architect & deploy same 3-tier app using AWS managed services
- Tools: AWS CLI, aws-vault, LocalStack
- Sidecar pattern, Envoy proxy
- Traffic management: routing, retries, circuit breaking, timeouts
- mTLS, authentication policies
- Exercise: Install Istio on k8s cluster, configure traffic splitting
- Tools: Istio, Linkerd, Envoy, Cilium (intro)
- Pillars: logs, metrics, traces
- Structured logging (JSON), log aggregation (Loki/ELK)
- Metrics: RED (Rate/Errors/Duration), USE (Utilization/Saturation/Errors)
- Prometheus: metrics types, PromQL, alerting rules
- Distributed tracing: OpenTelemetry, Jaeger
- Exercise: Instrument 3-tier app with OpenTelemetry, dashboards in Grafana
- Tools: Prometheus, Grafana, Loki, OpenTelemetry, Jaeger
- SLI -> SLO -> SLA -> Error Budget
- Incident management: runbooks, blameless postmortems
- Toil reduction: automate repetitive ops work
- Exercise: Define SLIs for 3-tier app, simulate failures, write postmortem
- Tools: PagerDuty (concept), incident.io (concept)
- Least privilege: IAM roles, k8s RBAC, network policies
- Secrets management: Vault, Sealed Secrets, SOPS
- Supply chain: image signing (cosign), SBOM (syft/grype)
- Vulnerability scanning in CI
- Exercise: Scan container images, enforce network policies, rotate secrets with Vault
- Tools: HashiCorp Vault, cosign, trivy, kyverno
- Resource requests/limits, HPA, VPA, cluster autoscaler
- Right-sizing, idle resource detection
- FinOps: tagging, cost allocation, reserved instances
- Exercise: Set up HPA based on custom Prometheus metrics, analyze cost with Kubecost
- Tools: Kubecost, kube-state-metrics, Goldilocks
- Read docs, then break things. Understanding comes from fixing failures.
- Every tool: install, configure, break, fix, then read source code.
- Build from scratch before using managed services. Know what abstraction hides.
- Debug in production mindset: learn
strace,tcpdump,perfearly. - Write about it. Blogging forces clarity.