A Gen AI-powered tool that analyzes CSV files and generates natural language reports about data quality issues using Large Language Models.
Build a Gen AI agent that can analyze a CSV file and generate a natural language report about the file's data quality issues — like missing values, duplicate rows, outliers, wrong datatypes, etc.
In real-world projects, bad data = bad analytics. Data engineers usually write manual scripts to check quality. With this project, an LLM automates the boring initial analysis.
User Uploads CSV
↓
Python (Streamlit Backend) Reads File
↓
Summarizes data stats (missing %, duplicates, data types) using Pandas
↓
LLM (DeepSeek R1 via OpenRouter) Generates a human-readable Data Quality Report
↓
Returns the Report to User
| Component | Technology |
|---|---|
| Backend | Python + Streamlit |
| Data Processing | Pandas |
| LLM | DeepSeek R1 (via OpenRouter API) |
| Frontend | Streamlit |
Data-Quality-Checker/
├── .devcontainer/ # DevContainer config for GitHub Codespaces
│ └── devcontainer.json
├── .env # Environment variables (gitignored)
├── .gitignore # Git ignore rules
├── streamlit_app.py # Main application (single-file)
├── requirements.txt # Python dependencies
├── README.md # This file
└── LICENSE # License file
- Python 3.11 or higher
- pip (Python package manager)
- OpenRouter API key (Get one free here)
-
Clone the repository
git clone <your-repo-url> cd Data-Quality-Checker
-
Install dependencies
pip install -r requirements.txt
-
Start the Streamlit server
streamlit run streamlit_app.py
-
Open your browser
- The app will automatically open at
http://localhost:8501 - Or manually navigate to the URL shown in the terminal
- The app will automatically open at
-
Use the application
- Enter your OpenRouter API key in the input field
- Upload a CSV file
- View automatic data quality metrics
- Click "🧠 Summarize with LLM" for AI-generated insights
streamlit run streamlit_app.py --server.enableCORS false --server.enableXsrfProtection false- Total rows and columns count
- Duplicate row detection
- Missing value percentage per column
- Data type analysis for each column
- Natural language summary of data quality issues
- Highlighting of key problems
- Suggestions for fixes and improvements
- Drag-and-drop CSV upload
- Real-time data preview
- Expandable detailed reports
- Clean, modern interface
The application uses manual API key input for security and flexibility:
- Enter your OpenRouter API key directly in the UI
- Key is not stored permanently (re-enter each session)
- Prevents accidental exposure of credentials
This project includes DevContainer configuration for:
- GitHub Codespaces - Cloud-based development
- VS Code Dev Containers - Local containerized development
Features:
- Auto-installs Python 3.11 and all dependencies
- Automatically starts Streamlit server on port 8501
- Includes Python and Pylance extensions
streamlit # Web framework
pandas # Data analysis
requests # HTTP requests to OpenRouter API
python-dotenv # Environment variable management
Contributions are welcome! Feel free to:
- Report bugs
- Suggest new features
- Submit pull requests
See LICENSE file for details.