Ready-to-use AI model that recognizes 15 different human activities from images and videos
- 🚀 Zero Training - Uses pre-trained ViT model
- 🖼️ Image & Video - Supports both formats
- 🌐 Web UI - Beautiful Gradio interface
- ⚡ CLI Tool - Command-line interface
- 🎯 15 Actions -
calling•clapping•cycling•dancing•drinking•eating•fighting•hugging•laughing•listening_to_music•running•sitting•sleeping•texting•using_laptop
# 1. Clone and install
!git clone https://github.com/azizemad-coder/human-action-recognition.git
%cd human-action-recognition
!pip install -r requirements.txt
# 2. Test with dataset sample
!python test_har.py
# 3. Launch web UI
from src.har_infer.app import build_interface
demo = build_interface()
demo.launch(share=True) # Creates public linkgit clone https://github.com/azizemad-coder/human-action-recognition.git
cd human-action-recognition
pip install -r requirements.txt
python app.py # Opens http://127.0.0.1:7860python app.pyUpload images or videos and get instant predictions!
# Single image
python -m har_infer.cli --image photo.jpg --top-k 5
# Video analysis
python -m har_infer.cli --video dance.mp4 --sample-fps 2from har_infer import load_model_and_processor, predict_image
from PIL import Image
model, processor, device = load_model_and_processor()
image = Image.open("action.jpg")
predictions = predict_image(image, model, processor, device=device)
for pred in predictions:
print(f"{pred.label}: {pred.score:.3f}")Built on Bingsu/Human_Action_Recognition dataset with 18K labeled images across 15 action classes.
- Model: ViT-Base/16 fine-tuned for HAR
- Speed: ~50ms per image (GPU), ~200ms (CPU)
- Memory: ~500MB model download
- Accuracy: High precision on real-world images
export HAR_MODEL_ID="your-custom-model-id"
python app.pypytest -q # Run all tests❌ Module not found error in Colab
import sys
sys.path.insert(0, '/content/human-action-recognition/src')❌ CUDA warnings (can be ignored)
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" # Suppress TensorFlow logs❌ Out of memory on GPU
export CUDA_VISIBLE_DEVICES="" # Force CPU usage
python app.py❌ Model download fails
# Clear cache and retry
import shutil
shutil.rmtree("~/.cache/huggingface", ignore_errors=True)❌ Video processing error
# Install additional codecs
pip install opencv-python-headless❌ Permission denied (Windows)
# Run as administrator or use:
python -m pip install --user -r requirements.txt- 🚀 Faster inference: Use GPU if available (auto-detected)
- 💾 Reduce memory: Set
torch.set_num_threads(1)for CPU - 📱 Batch processing: Process multiple images together
- 🎥 Video optimization: Reduce
sample_fpsfor faster processing
🤔 Why "Texting" instead of "Sitting"? The model may predict "texting" for sitting people because:
- Both actions have similar visual cues (sitting posture, looking down)
- Context matters - if person holds a device, it's likely texting
- Training data may have more "texting while sitting" examples
- This is normal behavior, not an error
💡 Tips for better accuracy:
- Use clear, well-lit images
- Ensure the person is the main subject
- Try different angles if first prediction seems off
📦 human-action-recognition
├── 🚀 app.py # Launch Gradio UI
├── 🧪 test_har.py # Comprehensive test script
├── 📋 requirements.txt # Dependencies
├── 📁 src/har_infer/ # Core package
│ ├── 🖥️ app.py # Gradio interface
│ ├── ⚡ cli.py # Command-line tool
│ ├── 🤖 model.py # Model loading
│ └── 🔮 predict.py # Inference logic
└── 🧪 tests/ # Unit tests
Dataset Reference: Bingsu/Human_Action_Recognition