A robust, real-time human action recognition system powered by MediaPipe pose estimation, a PyTorch LSTM sequence model, FastAPI, and Gradio.
This project provides an end-to-end Machine Learning pipeline for real-time human action recognition. It extracts 3D skeletal keypoints from a continuous video stream using the modern MediaPipe Tasks API, processes them through a sliding window temporal buffer, and predicts the current action using a highly accurate PyTorch LSTM model.
The system includes tools for custom data collection, model training with MLflow tracking, an invisible FastAPI inference backend, and interactive user interfaces (both web and desktop).
- Real-Time Pose Extraction: Uses MediaPipe's robust
PoseLandmarkerfor high-speed, lightweight skeletal tracking. - Temporal Action Modeling: A multi-layer LSTM model that understands the sequence and timing of movements over a 30-frame window.
- End-to-End Pipeline: Includes a custom dataset collector, automated training loops, and inference engines.
- Dual Interfaces: Test your model via a beautiful Gradio Web App or a low-latency OpenCV Desktop Demo.
- Production Ready: Fully containerized backend using Docker and
uvicorn. - Experiment Tracking: Built-in MLflow integration to log metrics, confusion matrices, and model artifacts.
graph TD
A[Webcam / Video File] -->|Raw Frames| B(MediaPipe Tasks API)
B -->|132-dim Keypoints| C{Sliding Window Buffer}
C -->|Sequence of 30 Frames| D[PyTorch LSTM Model]
D -->|Action Probabilities| E(Action Classifier Engine)
E --> F[FastAPI Backend Server]
E --> G[Gradio Web Interface]
E --> H[OpenCV Realtime Demo]
I[MLflow Tracking Server] -.->|Logs & Artifacts| D
I -.->|Model Checkpoints| E
The model is pre-configured to classify the following 6 actions:
| Action | Description |
|---|---|
| push_up | Distinguishing motion in shoulders, elbows, and wrists moving up and down in a plank position. |
| squat | Distinguishing motion in hips, knees, and ankles with vertical body translation. |
| wave | Distinguishing repetitive side-to-side motion of an arm/wrist above shoulder level. |
| jumping_jack | Distinguishing synchronized outward and inward motion of both arms and legs. |
| sitting | Static downward compression of the lower body. |
| idle | Minimal to no distinguishing motion; standing still. |
| Layer | Tool | Purpose |
|---|---|---|
| Pose Extraction | MediaPipe | Extract 3D skeletal keypoints from video frames in real-time. |
| Deep Learning | PyTorch | Implement and train the LSTM sequence classification model. |
| API Backend | FastAPI | Serve the trained model via high-performance REST endpoints. |
| Web UI | Gradio | Provide a beautiful web interface for uploading and analyzing videos. |
| Tracking | MLflow | Track training runs, metrics, and download best model versions. |
| Containerization | Docker | Package the API server and MLflow tracking into portable containers. |
pose-action-recognition/
├── data/
│ └── sequences/ # Raw .npy keypoint sequence arrays
├── models/ # Saved PyTorch artifacts (best_lstm.pt)
├── app/
│ ├── main.py # FastAPI application entrypoint
│ ├── inference.py # Inference engine & ActionClassifier class
│ └── model.py # PyTorch LSTM architecture definition
├── train/
│ ├── dataset.py # PyTorch Dataset for loading sequence arrays
│ └── train.py # Training loop, evaluation, and MLflow logging
├── data_collector.py # Script to record webcam keypoints to disk
├── realtime_demo.py # OpenCV live webcam inference demonstration
├── gradio_app.py # Gradio web interface for video file analysis
├── mlflow_utils.py # MLflow helper scripts (setup, download artifact)
├── requirements.txt # Local/CPU Python dependencies
├── requirements_colab.txt # Colab/GPU training dependencies
├── Dockerfile # Container definition for the FastAPI server
└── docker-compose.yml # Orchestration for API and MLflow services
git clone https://github.com/anantha037/pose-action-recognition.git
cd pose-action-recognitionpython -m venv venv
# Windows
.\venv\Scripts\activate
# Linux/Mac
source venv/bin/activateHardware Note: This installation specifically uses the CPU-only PyTorch wheel to save >2GB of local disk space, as CPU inference is exceptionally fast (<10ms per frame) for this architecture.
pip install -r requirements.txtLaunch the beautiful, interactive web interface to upload action videos, adjust confidence thresholds, and generate annotated summaries.
python gradio_app.pyAccess at
http://localhost:7860
Run inference locally against your live webcam feed using an OpenCV overlay.
python realtime_demo.pyRun the backend server and MLflow tracking UI natively without polluting your host machine.
# Start all services
docker-compose up --build -d
# Check API health
curl http://localhost:8000/docsAPI available at
http://localhost:8000MLflow dashboard available athttp://localhost:5000
Want to add a new action? Stand in front of your webcam and record sequences directly to disk.
python data_collector.py --action my_new_action --samples 50Train the model locally or on Google Colab. Metrics are automatically logged to MLflow.
python train/train.py --epochs 60 --batch_size 32To fetch the best model from a Colab MLflow run to your local machine:
python mlflow_utils.py --experiment pose-action-recognition --output_dir models- Python 3.13 & MediaPipe Compatibility: The codebase leverages the modern
mediapipe.tasks.python.visionAPI. The legacymediapipe.solutionsAPI is fundamentally broken on Python 3.13. This project entirely bypasses the legacy API, ensuring forward-compatibility with all modern Python 3 environments. - Performance: The LSTM model contains roughly 500k parameters. When combined with MediaPipe's
Litelandmarker, the entire pipeline operates comfortably at 30-60 FPS on standard consumer CPUs (e.g., Intel i5), making GPU inference strictly optional.
Anantha Krishnan
- GitHub: https://github.com/anantha037