This project simulates a distributed operating system across three virtual nodes and demonstrates AI-driven, load-aware process migration using Docker and CRIU.
It combines:
- Virtualization — Vagrant + VirtualBox
- Containerization — Docker
- Live process checkpointing — CRIU
- Real-time telemetry — Python (psutil) + Flask
- Visualization — Streamlit + Matplotlib
- Machine Learning — Random Forest (scikit-learn)
The system continuously monitors CPU, memory, and load across nodes.
An ML model predicts overloads and triggers live container migration from an overloaded node to a healthier one.
Built as an academic systems project (Jan–Apr 2025).
-
Node1 – 192.168.33.11
Central control plane:- Receives metrics from all nodes
- Stores telemetry in
merged_metrics.csv - Hosts Flask server
- Runs Streamlit dashboard
- Trains AI model
- Executes migration orchestrator
-
Node2 – 192.168.33.12
Worker node:- Runs stress container (
load_gen) - Sends metrics to Node1
- Acts as source of migration
- Runs stress container (
-
Node3 – 192.168.33.13
Worker node:- Receives checkpoint
- Restores container
- Becomes destination of migration
Node2 ─┐
├── HTTP Metrics ──> Node1 (Flask + CSV + Streamlit)
Node3 ─┘
Node1 ── AI Decision ──> Node2 (Checkpoint via CRIU)
Node2 ── SCP ─────────> Node3
Node3 ── Restore ─────> Container continues execution
ai-process-migration/
│
├── Vagrantfile
│
├── node_common/
│ ├── Dockerfile
│ ├── load_generator.py
│ └── requirements.txt
│
├── node2_node3/
│ └── metrics_sender.py
│
├── node1_server/
│ ├── metrics_server.py
│ ├── live_dashboard.py
│ ├── metrics_plotter.py
│ ├── train_ai_model.py
│ └── ai_orchestrator.py
vagrant upOpen three terminals:
vagrant ssh node1
vagrant ssh node2
vagrant ssh node3cd ~/app
pip3 install flask pandas matplotlib streamlit scikit-learn joblib psutil requests
python3 metrics_server.pyIn another terminal on Node1:
streamlit run live_dashboard.py --server.address=192.168.33.11Open in browser:
http://192.168.33.11:8501
On both Node2 and Node3:
cd ~/app
pip3 install psutil requests
python3 metrics_sender.py &Build and run the load generator:
docker build -t load_gen .
docker run -d --name generator load_genAfter some metrics accumulate:
python3 train_ai_model.pyThis produces:
load_classifier.joblib
python3 ai_orchestrator.pyWhat happens:
-
Latest metrics are read
-
AI model predicts overload
-
If overloaded:
- Node2 checkpoints container using CRIU
- Checkpoint is SCP-ed to Node3
- Node3 restores container
- Live CPU/memory/load graphs in Streamlit
- Real-time metrics flowing over HTTP
- Stress containers consuming resources
- Checkpoint creation on Node2
- Transfer of CRIU state
- Restore attempt on Node3
- AI model training & inference
Even where CRIU restore faces kernel limitations, the entire pipeline is real and functional.
- Distributed OS design using virtualization
- Container internals and live process state
- Real-time telemetry pipelines
- OS-level checkpoint/restore
- AI-driven orchestration logic
- Debugging low-level Linux systems
This project bridges Operating Systems, Distributed Systems, and Machine Learning into one cohesive system.
| Component | Technology Used |
|---|---|
| Virtual Machines | Vagrant + VirtualBox |
| Containers | Docker |
| Process Migration | CRIU |
| Monitoring | psutil + Flask |
| Visualization | Streamlit + Matplotlib |
| AI Model | Random Forest (scikit-learn) |
This project was built collaboratively by:
- Bind Pratap Singh
- Krishna Garg
- Parth Agrawal
- Ayush Tiwari
“This project explores a future where the operating system itself becomes intelligent - predicting overloads and moving live workloads across machines in real time.”