🧠🌐 Live Demo | 📖 API DocsLLM Data Analyst Agent

A powerful web-based data analysis platform that uses Large Language Models and ReAct (Reasoning and Acting) agents to source, prepare, analyze, and visualize any data. This application provides both a modern web interface and a robust API for complex data analysis tasks including web scraping, statistical analysis, and data visualization.

� Live Demo | 📖 API Docs

✨ Features

🤖 LLM-Powered Analysis

Natural Language Understanding: Ask questions in plain English
Dynamic Code Generation: LLM writes Python code for your specific query
Safe Execution: Code runs in restricted environment with only safe modules
Intelligent Fallback: Automatically falls back to traditional methods if LLM fails

🌐 Web Interface & API

Modern Web Interface: Beautiful, responsive UI built with Bootstrap and JavaScript
Real-time Results: See analysis results with visualizations in real-time
Multiple Data Sources: Support for Wikipedia, S3 parquet files, and CSV data
FastAPI: Modern, fast web framework with automatic API documentation

📊 Data Analysis Capabilities

Web Scraping: Automatic data extraction from Wikipedia and other sources
Statistical Analysis: Correlation analysis, regression, and more
Data Visualization: Automatic plot generation with matplotlib
Large Dataset Support: Efficient processing with DuckDB

🚀 Quick Start

🌐 Online Demo

Visit our live demo to try the application immediately!

🏠 Local Development

Clone the repository

git clone https://github.com/dewanggandhi01/TDS-Project-2.git
cd TDS-Project-2

Install dependencies
```
pip install -r requirements.txt
```

Set up OpenAI API Key

cp env_example.txt .env
# Edit .env and add your OpenAI API key

Start the server
```
python run_server.py
```
Open your browser
- Web Interface: http://127.0.0.1:8000/
- API Documentation: http://127.0.0.1:8000/docs

💡 Usage Examples

Web Interface

Enter a data source URL (e.g., Wikipedia page)
Add analysis tasks using natural language
Choose LLM-powered or Traditional analysis
View results with interactive visualizations

API Usage

# LLM-Powered Analysis
curl -X POST "https://your-app.vercel.app/api/llm/analyze" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
    "task": "How many movies grossed over $2 billion before 2020?"
  }'

# Traditional Analysis
curl -X POST "https://your-app.vercel.app/api/traditional/analyze" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
    "task": "Show correlation between rank and worldwide gross"
  }'

🚀 Deployment

Vercel (Recommended)

One-click deploy: Click the button above
Set environment variables:
- OPENAI_API_KEY: Your OpenAI API key
- ENVIRONMENT: Set to production
Deploy: Vercel will automatically build and deploy your app

Manual Vercel Deployment

# Install Vercel CLI
npm install -g vercel

# Deploy
vercel

# Set environment variables
vercel env add OPENAI_API_KEY
vercel env add ENVIRONMENT production

Other Platforms

Railway

railway login
railway init
railway up
# Set environment variables in Railway dashboard

Heroku

heroku create your-app-name
heroku config:set OPENAI_API_KEY=your_key
heroku config:set ENVIRONMENT=production
git push heroku main

🛠️ Configuration

Environment Variables

OPENAI_API_KEY: Your OpenAI API key (required for LLM features)
OPENAI_MODEL: Model to use (default: gpt-4)
ENVIRONMENT: Set to production for deployment
PORT: Server port (default: 8000)
HOST: Server host (default: 0.0.0.0)

🏗️ Architecture

TDS-Project-2/
├── app/                        # Core application logic
│   ├── main.py                # FastAPI application with LLM endpoints
│   ├── agent.py               # Main agent orchestrator
│   ├── llm_agent.py           # LLM-based task execution
│   ├── data_loader.py         # Data loading utilities
│   ├── analyzer.py            # Data analysis functions
│   └── visualizer.py          # Plot generation
├── templates/                  # HTML templates
├── static/                     # Static files (CSS, JS, images)
├── requirements.txt            # Python dependencies
├── vercel.json                # Vercel deployment config
├── run_server.py              # Local development server
└── README.md                  # This file

🔒 Security

LLM Code Execution Safety

Restricted Modules: Only pandas, numpy, matplotlib allowed
No File Access: Cannot read/write files
No Network Access: Cannot make external requests
Memory Limits: Prevents memory exhaustion
Timeout Protection: Prevents infinite loops

🧪 API Reference

Health Check

GET /health

LLM Analysis

POST /api/llm/analyze
{
  "url": "https://example.com/data",
  "task": "Analyze this data and find trends"
}

Traditional Analysis

POST /api/traditional/analyze
{
  "url": "https://example.com/data", 
  "task": "Generate basic statistics"
}

For complete API documentation, visit /docs on your deployed application.

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

� Acknowledgments

OpenAI for providing GPT-4 API
FastAPI for the excellent web framework
Vercel for seamless deployment platform

⭐ Star this repository if you find it helpful!

git clone https://github.com/dewanggandhi01/TDS-Project-2.git
cd TDS-Project-2

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Set up OpenAI API Key

# Option 1: Environment variable
export OPENAI_API_KEY="your_openai_api_key_here"

# Option 2: Create .env file (copy from env_example.txt)
cp env_example.txt .env
# Edit .env and add your API key

🚀 Quick Start

Local Development

Start the server
```
python run_server.py
```
Access the web interface
- Web Interface: http://127.0.0.1:8000/
- API Documentation: http://127.0.0.1:8000/docs
- Health check: http://127.0.0.1:8000/health
- Configuration: http://127.0.0.1:8000/api/config
Test the API
```
python test_api.py
```

Using the Web Interface

Open the web interface at http://127.0.0.1:8000/
Enter a data source URL (optional) - e.g., Wikipedia page URL
Add analysis tasks using natural language
Choose analysis mode: LLM-powered or Traditional
Click "Analyze Data" to process your tasks
View results with visualizations and statistics

Using the API

LLM-Powered Analysis (Recommended)

curl -X POST "http://127.0.0.1:8000/api/llm/analyze" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
    "task": "How many movies grossed over $2 billion before 2020?"
  }'

Traditional Analysis

curl -X POST "http://127.0.0.1:8000/api/traditional/analyze" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
    "task": "How many movies grossed over $2 billion before 2020?"
  }'

File Upload with LLM

curl "http://127.0.0.1:8000/api/" \
  -F "file=@data/question.txt" \
  -F "use_llm=true"

🏗️ Architecture

LLM Integration Flow

User Input: Natural language query + dataset URL
Data Fetching: Scrape data from URL using BeautifulSoup
Data Analysis: LLM generates Python code for the query
Safe Execution: Code runs in restricted environment
Result Processing: Extract results and visualizations
Response: Return structured results to user

LLM Agent Implementation

The system uses OpenAI GPT-4 for intelligent task execution:

# Example LLM prompt
system_prompt = """
You are a data analysis expert. Given a dataset and a natural language query, 
generate executable Python code to solve the query.

IMPORTANT RULES:
1. Use ONLY these modules: pandas (as pd), numpy (as np), matplotlib.pyplot (as plt)
2. The dataset is available as a pandas DataFrame called 'df'
3. For visualizations, save plots to a BytesIO buffer and return as base64 string
4. Return ONLY the Python code, no explanations
5. Handle errors gracefully
"""

Safe Code Execution

Generated code runs in a restricted environment:

Only safe modules allowed (pandas, numpy, matplotlib)
No file system access
No network access
Memory limits enforced
Timeout protection

📊 Supported Data Sources

1. Wikipedia Data

Automatically scrape tables from Wikipedia pages
Clean and process scraped data
Handle various table formats

2. S3 Parquet Files

Query large datasets using DuckDB
Support for partitioned data
Efficient data processing

3. CSV Files

Local CSV file processing
Basic data analysis operations

🎯 Example Use Cases

LLM-Powered Analysis Examples

# Complex statistical analysis
task = "Calculate the correlation between movie budget and worldwide gross, and create a scatter plot"

# Time series analysis
task = "Show the trend of movie budgets over the last 20 years with a line chart"

# Advanced filtering
task = "Find all movies that grossed over $1 billion and were released in summer months"

# Custom aggregations
task = "Group movies by decade and calculate average gross for each decade"

Traditional Analysis (Still Supported)

# Pre-defined questions for Wikipedia film data
questions = [
    "How many $2 bn movies were released before 2020?",
    "Which is the earliest film that grossed over $1.5 bn?",
    "What's the correlation between the Rank and Peak?",
    "Draw a scatterplot of Rank and Peak with regression line."
]

🔧 Configuration

Environment Variables

OPENAI_API_KEY: Your OpenAI API key (required for LLM features)
OPENAI_MODEL: Model to use (default: gpt-4)
OPENAI_MAX_TOKENS: Maximum tokens for code generation (default: 2000)
OPENAI_TEMPERATURE: Creativity level (default: 0.1)
PORT: Server port (default: 8000)
HOST: Server host (default: 127.0.0.1)
LOG_LEVEL: Logging level (default: info)

API Configuration

Timeout: 3 minutes for analysis tasks
File size limit: 10MB for uploaded files
Supported formats: .txt files
LLM fallback: Automatic fallback to traditional methods

📈 Performance

LLM Response Time: 10-30 seconds for complex queries
Traditional Response Time: 2-5 seconds for predefined tasks
Memory Usage: Optimized for large datasets
Scalability: Supports concurrent requests
Error Handling: Comprehensive error handling and logging

🧪 Testing

Run All Tests

python test_api.py

Test Specific Endpoints

# Test health
curl http://127.0.0.1:8000/health

# Test LLM endpoint
curl -X POST "http://127.0.0.1:8000/api/llm/analyze" \
  -H "Content-Type: application/json" \
  -d '{"task": "Count rows in dataset"}'

# Test traditional endpoint
curl -X POST "http://127.0.0.1:8000/api/traditional/analyze" \
  -H "Content-Type: application/json" \
  -d '{"task": "Count rows in dataset"}'

🚀 Deployment

Local Development

python run_server.py

Production Deployment

Railway (Recommended)

Install Railway CLI: npm install -g @railway/cli
Login: railway login
Initialize: railway init
Set environment variables: railway variables set OPENAI_API_KEY=your_key
Deploy: railway up

Heroku

Install Heroku CLI
Login: heroku login
Create app: heroku create your-app-name
Set environment variables: heroku config:set OPENAI_API_KEY=your_key
Deploy: git push heroku main

Render

Go to render.com
Connect your GitHub repository
Create a new Web Service
Set build command: pip install -r requirements.txt
Set start command: python run_server.py
Add environment variable: OPENAI_API_KEY

📁 Project Structure

TDS-Project-2/
├── app/                        # Core application logic
│   ├── __init__.py
│   ├── main.py                # FastAPI application with LLM endpoints
│   ├── agent.py               # Main agent orchestrator with LLM integration
│   ├── llm_agent.py           # LLM-based task execution
│   ├── react_agent.py         # ReAct agent implementation
│   ├── data_loader.py         # Data loading utilities
│   ├── analyzer.py            # Data analysis functions
│   ├── visualizer.py          # Plot generation
│   └── utils.py               # Utility functions
├── templates/                  # HTML templates
├── static/                     # Static files (CSS, JS, images)
├── data/                       # Data files
├── tests/                      # Test files
├── requirements.txt            # Python dependencies
├── run_server.py              # Server startup script
├── test_api.py                # API testing script
├── env_example.txt            # Environment variables example
├── Procfile                   # Heroku deployment
├── runtime.txt                # Python version specification
└── README.md                  # This file

🔒 Security

LLM Code Execution Safety

Restricted Modules: Only pandas, numpy, matplotlib allowed
No File Access: Cannot read/write files
No Network Access: Cannot make external requests
Memory Limits: Prevents memory exhaustion
Timeout Protection: Prevents infinite loops
Error Handling: Graceful failure handling

API Security

Input Validation: All inputs validated and sanitized
Rate Limiting: Built-in rate limiting for API calls
Error Sanitization: No sensitive information in error messages
CORS Configuration: Proper CORS setup for web interface

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

OpenAI for providing GPT-4 API
FastAPI for the excellent web framework
Pandas for data manipulation
Matplotlib for visualization
BeautifulSoup for web scraping
DuckDB for efficient data querying

📞 Support

For questions and support:

Create an issue on GitHub
Check the API documentation at /docs
Review the test files for usage examples
Check the health endpoint for system status

Note: This API now supports both LLM-powered and traditional analysis. The LLM features require an OpenAI API key and may incur costs based on usage. Traditional analysis is always available as a fallback option.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
.vscode		.vscode
api		api
app		app
data		data
examples		examples
templates		templates
tests		tests
.gitignore		.gitignore
.vercelignore		.vercelignore
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
data.csv		data.csv
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml
env_example.txt		env_example.txt
main.py		main.py
requirements-full.txt		requirements-full.txt
requirements-vercel.txt		requirements-vercel.txt
requirements.txt		requirements.txt
run_server.py		run_server.py
runtime.txt		runtime.txt
setup.py		setup.py
vercel.json		vercel.json

Folders and files

Latest commit

History

Repository files navigation

🧠🌐 Live Demo | 📖 **API Docs**LLM Data Analyst Agent

✨ Features

🤖 LLM-Powered Analysis

🌐 Web Interface & API

📊 Data Analysis Capabilities

🚀 Quick Start

🌐 Online Demo

🏠 Local Development

💡 Usage Examples

Web Interface

API Usage

🚀 Deployment

Vercel (Recommended)

Manual Vercel Deployment

Other Platforms

Railway

Heroku

🛠️ Configuration

Environment Variables

🏗️ Architecture

🔒 Security

LLM Code Execution Safety

🧪 API Reference

Health Check

LLM Analysis

Traditional Analysis

🤝 Contributing

📄 License

� Acknowledgments

🚀 Quick Start

Local Development

Using the Web Interface

Using the API

LLM-Powered Analysis (Recommended)

Traditional Analysis

File Upload with LLM

🏗️ Architecture

LLM Integration Flow

LLM Agent Implementation

Safe Code Execution

📊 Supported Data Sources

1. Wikipedia Data

2. S3 Parquet Files

3. CSV Files

🎯 Example Use Cases

LLM-Powered Analysis Examples

Traditional Analysis (Still Supported)

🔧 Configuration

Environment Variables

API Configuration

📈 Performance

🧪 Testing

Run All Tests

Test Specific Endpoints

🚀 Deployment

Local Development

Production Deployment

Railway (Recommended)

Heroku

Render

📁 Project Structure

🔒 Security

LLM Code Execution Safety

API Security

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

🧠🌐 Live Demo | 📖 API DocsLLM Data Analyst Agent

Packages