A powerful web-based data analysis platform that uses Large Language Models and ReAct (Reasoning and Acting) agents to source, prepare, analyze, and visualize any data. This application provides both a modern web interface and a robust API for complex data analysis tasks including web scraping, statistical analysis, and data visualization.
- Natural Language Understanding: Ask questions in plain English
- Dynamic Code Generation: LLM writes Python code for your specific query
- Safe Execution: Code runs in restricted environment with only safe modules
- Intelligent Fallback: Automatically falls back to traditional methods if LLM fails
- Modern Web Interface: Beautiful, responsive UI built with Bootstrap and JavaScript
- Real-time Results: See analysis results with visualizations in real-time
- Multiple Data Sources: Support for Wikipedia, S3 parquet files, and CSV data
- FastAPI: Modern, fast web framework with automatic API documentation
- Web Scraping: Automatic data extraction from Wikipedia and other sources
- Statistical Analysis: Correlation analysis, regression, and more
- Data Visualization: Automatic plot generation with matplotlib
- Large Dataset Support: Efficient processing with DuckDB
Visit our live demo to try the application immediately!
-
Clone the repository
git clone https://github.com/dewanggandhi01/TDS-Project-2.git cd TDS-Project-2 -
Install dependencies
pip install -r requirements.txt
-
Set up OpenAI API Key
cp env_example.txt .env # Edit .env and add your OpenAI API key -
Start the server
python run_server.py
-
Open your browser
- Web Interface:
http://127.0.0.1:8000/ - API Documentation:
http://127.0.0.1:8000/docs
- Web Interface:
- Enter a data source URL (e.g., Wikipedia page)
- Add analysis tasks using natural language
- Choose LLM-powered or Traditional analysis
- View results with interactive visualizations
# LLM-Powered Analysis
curl -X POST "https://your-app.vercel.app/api/llm/analyze" \
-H "Content-Type: application/json" \
-d '{
"url": "https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
"task": "How many movies grossed over $2 billion before 2020?"
}'
# Traditional Analysis
curl -X POST "https://your-app.vercel.app/api/traditional/analyze" \
-H "Content-Type: application/json" \
-d '{
"url": "https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
"task": "Show correlation between rank and worldwide gross"
}'- One-click deploy: Click the button above
- Set environment variables:
OPENAI_API_KEY: Your OpenAI API keyENVIRONMENT: Set toproduction
- Deploy: Vercel will automatically build and deploy your app
# Install Vercel CLI
npm install -g vercel
# Deploy
vercel
# Set environment variables
vercel env add OPENAI_API_KEY
vercel env add ENVIRONMENT productionrailway login
railway init
railway up
# Set environment variables in Railway dashboardheroku create your-app-name
heroku config:set OPENAI_API_KEY=your_key
heroku config:set ENVIRONMENT=production
git push heroku mainOPENAI_API_KEY: Your OpenAI API key (required for LLM features)OPENAI_MODEL: Model to use (default: gpt-4)ENVIRONMENT: Set toproductionfor deploymentPORT: Server port (default: 8000)HOST: Server host (default: 0.0.0.0)
TDS-Project-2/
├── app/ # Core application logic
│ ├── main.py # FastAPI application with LLM endpoints
│ ├── agent.py # Main agent orchestrator
│ ├── llm_agent.py # LLM-based task execution
│ ├── data_loader.py # Data loading utilities
│ ├── analyzer.py # Data analysis functions
│ └── visualizer.py # Plot generation
├── templates/ # HTML templates
├── static/ # Static files (CSS, JS, images)
├── requirements.txt # Python dependencies
├── vercel.json # Vercel deployment config
├── run_server.py # Local development server
└── README.md # This file
- Restricted Modules: Only pandas, numpy, matplotlib allowed
- No File Access: Cannot read/write files
- No Network Access: Cannot make external requests
- Memory Limits: Prevents memory exhaustion
- Timeout Protection: Prevents infinite loops
GET /healthPOST /api/llm/analyze
{
"url": "https://example.com/data",
"task": "Analyze this data and find trends"
}POST /api/traditional/analyze
{
"url": "https://example.com/data",
"task": "Generate basic statistics"
}For complete API documentation, visit /docs on your deployed application.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for providing GPT-4 API
- FastAPI for the excellent web framework
- Vercel for seamless deployment platform
⭐ Star this repository if you find it helpful!
git clone https://github.com/dewanggandhi01/TDS-Project-2.git
cd TDS-Project-2-
Create virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up OpenAI API Key
# Option 1: Environment variable export OPENAI_API_KEY="your_openai_api_key_here" # Option 2: Create .env file (copy from env_example.txt) cp env_example.txt .env # Edit .env and add your API key
-
Start the server
python run_server.py
-
Access the web interface
- Web Interface:
http://127.0.0.1:8000/ - API Documentation:
http://127.0.0.1:8000/docs - Health check:
http://127.0.0.1:8000/health - Configuration:
http://127.0.0.1:8000/api/config
- Web Interface:
-
Test the API
python test_api.py
- Open the web interface at
http://127.0.0.1:8000/ - Enter a data source URL (optional) - e.g., Wikipedia page URL
- Add analysis tasks using natural language
- Choose analysis mode: LLM-powered or Traditional
- Click "Analyze Data" to process your tasks
- View results with visualizations and statistics
curl -X POST "http://127.0.0.1:8000/api/llm/analyze" \
-H "Content-Type: application/json" \
-d '{
"url": "https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
"task": "How many movies grossed over $2 billion before 2020?"
}'curl -X POST "http://127.0.0.1:8000/api/traditional/analyze" \
-H "Content-Type: application/json" \
-d '{
"url": "https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
"task": "How many movies grossed over $2 billion before 2020?"
}'curl "http://127.0.0.1:8000/api/" \
-F "file=@data/question.txt" \
-F "use_llm=true"- User Input: Natural language query + dataset URL
- Data Fetching: Scrape data from URL using BeautifulSoup
- Data Analysis: LLM generates Python code for the query
- Safe Execution: Code runs in restricted environment
- Result Processing: Extract results and visualizations
- Response: Return structured results to user
The system uses OpenAI GPT-4 for intelligent task execution:
# Example LLM prompt
system_prompt = """
You are a data analysis expert. Given a dataset and a natural language query,
generate executable Python code to solve the query.
IMPORTANT RULES:
1. Use ONLY these modules: pandas (as pd), numpy (as np), matplotlib.pyplot (as plt)
2. The dataset is available as a pandas DataFrame called 'df'
3. For visualizations, save plots to a BytesIO buffer and return as base64 string
4. Return ONLY the Python code, no explanations
5. Handle errors gracefully
"""Generated code runs in a restricted environment:
- Only safe modules allowed (pandas, numpy, matplotlib)
- No file system access
- No network access
- Memory limits enforced
- Timeout protection
- Automatically scrape tables from Wikipedia pages
- Clean and process scraped data
- Handle various table formats
- Query large datasets using DuckDB
- Support for partitioned data
- Efficient data processing
- Local CSV file processing
- Basic data analysis operations
# Complex statistical analysis
task = "Calculate the correlation between movie budget and worldwide gross, and create a scatter plot"
# Time series analysis
task = "Show the trend of movie budgets over the last 20 years with a line chart"
# Advanced filtering
task = "Find all movies that grossed over $1 billion and were released in summer months"
# Custom aggregations
task = "Group movies by decade and calculate average gross for each decade"# Pre-defined questions for Wikipedia film data
questions = [
"How many $2 bn movies were released before 2020?",
"Which is the earliest film that grossed over $1.5 bn?",
"What's the correlation between the Rank and Peak?",
"Draw a scatterplot of Rank and Peak with regression line."
]OPENAI_API_KEY: Your OpenAI API key (required for LLM features)OPENAI_MODEL: Model to use (default: gpt-4)OPENAI_MAX_TOKENS: Maximum tokens for code generation (default: 2000)OPENAI_TEMPERATURE: Creativity level (default: 0.1)PORT: Server port (default: 8000)HOST: Server host (default: 127.0.0.1)LOG_LEVEL: Logging level (default: info)
- Timeout: 3 minutes for analysis tasks
- File size limit: 10MB for uploaded files
- Supported formats: .txt files
- LLM fallback: Automatic fallback to traditional methods
- LLM Response Time: 10-30 seconds for complex queries
- Traditional Response Time: 2-5 seconds for predefined tasks
- Memory Usage: Optimized for large datasets
- Scalability: Supports concurrent requests
- Error Handling: Comprehensive error handling and logging
python test_api.py# Test health
curl http://127.0.0.1:8000/health
# Test LLM endpoint
curl -X POST "http://127.0.0.1:8000/api/llm/analyze" \
-H "Content-Type: application/json" \
-d '{"task": "Count rows in dataset"}'
# Test traditional endpoint
curl -X POST "http://127.0.0.1:8000/api/traditional/analyze" \
-H "Content-Type: application/json" \
-d '{"task": "Count rows in dataset"}'python run_server.py- Install Railway CLI:
npm install -g @railway/cli - Login:
railway login - Initialize:
railway init - Set environment variables:
railway variables set OPENAI_API_KEY=your_key - Deploy:
railway up
- Install Heroku CLI
- Login:
heroku login - Create app:
heroku create your-app-name - Set environment variables:
heroku config:set OPENAI_API_KEY=your_key - Deploy:
git push heroku main
- Go to render.com
- Connect your GitHub repository
- Create a new Web Service
- Set build command:
pip install -r requirements.txt - Set start command:
python run_server.py - Add environment variable:
OPENAI_API_KEY
TDS-Project-2/
├── app/ # Core application logic
│ ├── __init__.py
│ ├── main.py # FastAPI application with LLM endpoints
│ ├── agent.py # Main agent orchestrator with LLM integration
│ ├── llm_agent.py # LLM-based task execution
│ ├── react_agent.py # ReAct agent implementation
│ ├── data_loader.py # Data loading utilities
│ ├── analyzer.py # Data analysis functions
│ ├── visualizer.py # Plot generation
│ └── utils.py # Utility functions
├── templates/ # HTML templates
├── static/ # Static files (CSS, JS, images)
├── data/ # Data files
├── tests/ # Test files
├── requirements.txt # Python dependencies
├── run_server.py # Server startup script
├── test_api.py # API testing script
├── env_example.txt # Environment variables example
├── Procfile # Heroku deployment
├── runtime.txt # Python version specification
└── README.md # This file
- Restricted Modules: Only pandas, numpy, matplotlib allowed
- No File Access: Cannot read/write files
- No Network Access: Cannot make external requests
- Memory Limits: Prevents memory exhaustion
- Timeout Protection: Prevents infinite loops
- Error Handling: Graceful failure handling
- Input Validation: All inputs validated and sanitized
- Rate Limiting: Built-in rate limiting for API calls
- Error Sanitization: No sensitive information in error messages
- CORS Configuration: Proper CORS setup for web interface
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for providing GPT-4 API
- FastAPI for the excellent web framework
- Pandas for data manipulation
- Matplotlib for visualization
- BeautifulSoup for web scraping
- DuckDB for efficient data querying
For questions and support:
- Create an issue on GitHub
- Check the API documentation at
/docs - Review the test files for usage examples
- Check the health endpoint for system status
Note: This API now supports both LLM-powered and traditional analysis. The LLM features require an OpenAI API key and may incur costs based on usage. Traditional analysis is always available as a fallback option.