Course Extractor

A powerful web application that automatically extracts course information from educational websites. Built with Flask (Python) and modern web technologies, this tool helps educators, students, and researchers gather structured course data efficiently.

🌟 Features

Core Functionality

AI-Powered Extraction: Advanced algorithms automatically detect and extract course information
Multi-URL Support: Process multiple educational websites simultaneously
Comprehensive Data Extraction: Extract all available course details including:
- Course Name
- Institute Name
- Location
- Format (Online, On-campus, Hybrid, etc.)
- Faculty/Instructors
- Language of Instruction
- Dates (Start/End)
- Duration
- Suitable For (target audience, prerequisites)
- Fees/Cost
- Availability/Enrollment Status

User Experience

Modern Web Interface: Responsive design that works on all devices
Real-time Processing: Live feedback during extraction process
Smart Validation: URL validation and error handling
Processing History: Track and reprocess previous extractions
Export Options: Download results as CSV or Excel files

Technical Features

Privacy-First: Only extracts publicly available information
Error Handling: Graceful handling of invalid URLs and unsupported sites
Performance Optimized: Efficient scraping with rate limiting
Cross-Platform: Works on Windows, macOS, and Linux

🚀 Quick Start

Prerequisites

Python 3.8 or higher
Internet connection

Local Development

Clone the repository

git clone <repository-url>
cd course-extractor

Install Python dependencies
```
pip install -r requirements.txt
```
Run the application
```
python app.py
```
Open your browser Navigate to http://localhost:5000

🚀 Deploy to Production (Make Public)

Want to make your Course Extractor publicly accessible? Follow our deployment guide:

Quick Deploy Options:

Heroku (Recommended) - Free tier available
Railway - Simple deployment
Render - Easy setup

Step-by-step guide: See DEPLOYMENT.md for detailed instructions.

Automated deployment: Use our deployment scripts:

Windows: deploy.bat
PowerShell: deploy.ps1

📖 Usage Guide

Basic Usage

Enter Website URL
- Navigate to the home page
- Enter the URL of an educational institution's website
- Click "Extract Courses"
View Results
- Results are displayed in a structured table
- Summary statistics show key metrics
- Each course row contains all extracted information
Export Data
- Use CSV export for spreadsheet applications
- Use Excel export for detailed analysis
- Files include timestamps for organization

Advanced Features

Multiple URLs

Click "Add Multiple URLs" to process several websites at once
Each URL is processed independently
Results are combined for comprehensive analysis

Processing History

View previous extraction attempts
Reprocess URLs with updated data
Track success rates and course counts

Data Filtering

Filter by institute, location, or format
Sort results by any column
Search within extracted data

Best Practices

For Better Results

Use the main course listing page of the educational institution
Ensure the website is publicly accessible
Try department-specific course pages for specialized programs
Some websites may require multiple attempts for optimal extraction

URL Examples

University Courses: https://university.edu/courses
Department Pages: https://university.edu/computer-science/courses
Program Lists: https://college.edu/academic-programs

🏗️ Architecture

Backend (Flask)

CourseExtractor Class: Core extraction logic
Web Scraping: BeautifulSoup and Selenium integration
API Endpoints: RESTful API for frontend communication
Data Processing: Pandas for data manipulation and export

Frontend (HTML/CSS/JavaScript)

Responsive Design: Bootstrap 5 framework
Interactive UI: Modern JavaScript with ES6+ features
Real-time Updates: Asynchronous data processing
Local Storage: Persistent user preferences and history

Data Flow

User submits URL(s) via web interface
Backend processes URLs with web scraping
Extracted data is structured and validated
Results are returned to frontend
Data is displayed in table format
Export options generate downloadable files

🔧 Configuration

Environment Variables

Create a .env file in the project root:

FLASK_ENV=development
FLASK_DEBUG=True
CHROME_DRIVER_PATH=/path/to/chromedriver
MAX_CONCURRENT_REQUESTS=5
REQUEST_TIMEOUT=30

Customization Options

Scraping Patterns: Modify regex patterns in CourseExtractor class
Export Formats: Add new export formats in export functions
UI Themes: Customize CSS variables in static/css/style.css
API Rate Limiting: Adjust request limits and delays

📊 Data Structure

Course Object

{
  "course_name": "Introduction to Computer Science",
  "institute_name": "University of Technology",
  "location": "New York, NY",
  "format": "On-campus",
  "faculty": "Dr. John Smith",
  "language": "English",
  "dates": "2024-01-15 - 2024-05-15",
  "duration": "16 weeks",
  "suitable_for": "Undergraduate students",
  "fees": "$3,500",
  "availability": "Open for enrollment"
}

API Response

{
  "success": true,
  "results": [
    {
      "success": true,
      "url": "https://example.edu/courses",
      "courses_found": 25,
      "courses": [...]
    }
  ],
  "total_courses": 25
}

🛠️ Development

Project Structure

course-extractor/
├── app.py                 # Main Flask application
├── requirements.txt       # Python dependencies
├── templates/            # HTML templates
│   └── index.html       # Main page template
├── static/              # Static assets
│   ├── css/            # Stylesheets
│   │   └── style.css   # Main CSS file
│   └── js/             # JavaScript files
│       └── app.js      # Main application logic
├── README.md            # This file
└── .gitignore          # Git ignore rules

Adding New Features

New Extraction Fields

Add field to _extract_single_course_from_container method
Update HTML table headers
Modify JavaScript display logic
Update export functions

New Export Formats

Create new route in Flask app
Implement export logic
Add frontend button
Update JavaScript export function

Custom Scraping Rules

Modify regex patterns in extraction methods
Add new CSS selectors for specific websites
Implement site-specific extraction logic
Test with target websites

Testing

# Run basic tests
python -m pytest tests/

# Test specific functionality
python -m pytest tests/test_extraction.py

# Run with coverage
python -m pytest --cov=app tests/

🚨 Troubleshooting

Common Issues

No Courses Found

Check if the website is publicly accessible
Verify the URL contains course information
Try different pages within the same website
Check browser console for JavaScript errors

Extraction Errors

Ensure all dependencies are installed
Check Chrome browser installation
Verify internet connection
Review Flask application logs

Performance Issues

Reduce number of concurrent URLs
Increase request timeout values
Check website response times
Monitor system resources

Debug Mode

Enable debug mode for detailed error information:

app.run(debug=True, host='0.0.0.0', port=5000)

Logging

Application logs are available in the console. For production, configure proper logging:

import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('app.log'),
        logging.StreamHandler()
    ]
)

📈 Performance Optimization

Scraping Efficiency

Parallel Processing: Process multiple URLs concurrently
Caching: Cache previously scraped results
Rate Limiting: Respect website robots.txt and rate limits
Connection Pooling: Reuse HTTP connections

Memory Management

Streaming: Process large datasets in chunks
Cleanup: Remove temporary files after export
Garbage Collection: Optimize Python memory usage

Database Integration

For production use, consider adding a database:

# SQLite example
import sqlite3
conn = sqlite3.connect('courses.db')
cursor = conn.cursor()
cursor.execute('''
    CREATE TABLE IF NOT EXISTS courses (
        id INTEGER PRIMARY KEY,
        course_name TEXT,
        institute_name TEXT,
        extracted_at TIMESTAMP
    )
''')

🔒 Security Considerations

Data Privacy

Only extract publicly available information
Respect website terms of service
Implement rate limiting to avoid overwhelming servers
Store sensitive data securely

Input Validation

Validate all user inputs
Sanitize URLs before processing
Implement CSRF protection
Use parameterized queries

Access Control

Implement user authentication if needed
Rate limit API endpoints
Monitor for abuse patterns
Log access attempts

🤝 Contributing

How to Contribute

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

Code Style

Follow PEP 8 Python guidelines
Use meaningful variable names
Add docstrings to functions
Include type hints where appropriate

Testing Requirements

Maintain test coverage above 80%
Test edge cases and error conditions
Include integration tests for API endpoints
Test with various website structures

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

BeautifulSoup: HTML parsing and navigation
Selenium: Dynamic content extraction
Flask: Web framework
Bootstrap: UI components and responsive design
Font Awesome: Icons and visual elements

🚀 Deployment

Making Your App Public

This Course Extractor can be easily deployed to make it publicly accessible to anyone on the internet.

Quick Deploy Options

Heroku (Recommended)
- Free tier available
- Automatic deployment from GitHub
- Easy scaling options
Railway
- Simple deployment process
- Good for small to medium apps
- Automatic HTTPS
Render
- Free tier available
- Easy GitHub integration
- Good documentation

Deployment Steps

Push to GitHub

git add .
git commit -m "Ready for deployment"
git push origin main

Deploy to Platform
- Follow platform-specific instructions
- Connect your GitHub repository
- Deploy automatically
Get Public URL
- Your app will have a public URL
- Share with others to use your Course Extractor
- Monitor usage and performance

For detailed deployment instructions, see DEPLOYMENT.md.

📞 Support

Getting Help

Issues: Report bugs and feature requests on GitHub
Documentation: Check this README and inline code comments
Community: Join discussions in project forums

Contact Information

Project Maintainer: [Your Name]
Email: [your.email@example.com]
GitHub: [github.com/yourusername]

Note: This tool is designed for educational and research purposes. Always respect website terms of service and implement appropriate rate limiting when scraping websites.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
course-extractor		course-extractor
static		static
templates		templates
DEPLOYMENT.md		DEPLOYMENT.md
DEPLOYMENT_CHECKLIST.md		DEPLOYMENT_CHECKLIST.md
PROJECT.md		PROJECT.md
Procfile		Procfile
README.md		README.md
app.py		app.py
demo.py		demo.py
deploy.bat		deploy.bat
deploy.ps1		deploy.ps1
ngrok.exe		ngrok.exe
ngrok.zip		ngrok.zip
quick-deploy.bat		quick-deploy.bat
requirements.txt		requirements.txt
runtime.txt		runtime.txt
setup-git.bat		setup-git.bat
setup-git.ps1		setup-git.ps1
start.bat		start.bat
temp_courses_export_20250818_153511.csv		temp_courses_export_20250818_153511.csv
temp_courses_export_20250818_153516.xlsx		temp_courses_export_20250818_153516.xlsx
temp_courses_export_20250818_160534.csv		temp_courses_export_20250818_160534.csv
temp_courses_export_20250818_161533.csv		temp_courses_export_20250818_161533.csv
temp_courses_export_20250822_181813.csv		temp_courses_export_20250822_181813.csv
temp_courses_export_20250822_182246.csv		temp_courses_export_20250822_182246.csv
temp_courses_export_20250822_182506.csv		temp_courses_export_20250822_182506.csv
temp_courses_export_20250822_191133.csv		temp_courses_export_20250822_191133.csv
temp_courses_export_20250822_192933.csv		temp_courses_export_20250822_192933.csv
temp_courses_export_20250831_122907.csv		temp_courses_export_20250831_122907.csv
test_app.py		test_app.py
test_ngrok_bypass.py		test_ngrok_bypass.py

Folders and files

Latest commit

History

Repository files navigation

Course Extractor

🌟 Features

Core Functionality

User Experience

Technical Features

🚀 Quick Start

Prerequisites

Local Development

🚀 Deploy to Production (Make Public)

📖 Usage Guide

Basic Usage

Advanced Features

Multiple URLs

Processing History

Data Filtering

Best Practices

For Better Results

URL Examples

🏗️ Architecture

Backend (Flask)

Frontend (HTML/CSS/JavaScript)

Data Flow

🔧 Configuration

Environment Variables

Customization Options

📊 Data Structure

Course Object

API Response

🛠️ Development

Project Structure

Adding New Features

New Extraction Fields

New Export Formats

Custom Scraping Rules

Testing

🚨 Troubleshooting

Common Issues

No Courses Found

Extraction Errors

Performance Issues

Debug Mode

Logging

📈 Performance Optimization

Scraping Efficiency

Memory Management

Database Integration

🔒 Security Considerations

Data Privacy

Input Validation

Access Control

🤝 Contributing

How to Contribute

Code Style

Testing Requirements

📄 License

🙏 Acknowledgments

🚀 Deployment

Making Your App Public

Quick Deploy Options

Deployment Steps

📞 Support

Getting Help

Contact Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages