A powerful web application that automatically extracts course information from educational websites. Built with Flask (Python) and modern web technologies, this tool helps educators, students, and researchers gather structured course data efficiently.
- AI-Powered Extraction: Advanced algorithms automatically detect and extract course information
- Multi-URL Support: Process multiple educational websites simultaneously
- Comprehensive Data Extraction: Extract all available course details including:
- Course Name
- Institute Name
- Location
- Format (Online, On-campus, Hybrid, etc.)
- Faculty/Instructors
- Language of Instruction
- Dates (Start/End)
- Duration
- Suitable For (target audience, prerequisites)
- Fees/Cost
- Availability/Enrollment Status
- Modern Web Interface: Responsive design that works on all devices
- Real-time Processing: Live feedback during extraction process
- Smart Validation: URL validation and error handling
- Processing History: Track and reprocess previous extractions
- Export Options: Download results as CSV or Excel files
- Privacy-First: Only extracts publicly available information
- Error Handling: Graceful handling of invalid URLs and unsupported sites
- Performance Optimized: Efficient scraping with rate limiting
- Cross-Platform: Works on Windows, macOS, and Linux
- Python 3.8 or higher
- Internet connection
-
Clone the repository
git clone <repository-url> cd course-extractor
-
Install Python dependencies
pip install -r requirements.txt
-
Run the application
python app.py
-
Open your browser Navigate to
http://localhost:5000
Want to make your Course Extractor publicly accessible? Follow our deployment guide:
Quick Deploy Options:
- Heroku (Recommended) - Free tier available
- Railway - Simple deployment
- Render - Easy setup
Step-by-step guide: See DEPLOYMENT.md for detailed instructions.
Automated deployment: Use our deployment scripts:
- Windows:
deploy.bat - PowerShell:
deploy.ps1
-
Enter Website URL
- Navigate to the home page
- Enter the URL of an educational institution's website
- Click "Extract Courses"
-
View Results
- Results are displayed in a structured table
- Summary statistics show key metrics
- Each course row contains all extracted information
-
Export Data
- Use CSV export for spreadsheet applications
- Use Excel export for detailed analysis
- Files include timestamps for organization
- Click "Add Multiple URLs" to process several websites at once
- Each URL is processed independently
- Results are combined for comprehensive analysis
- View previous extraction attempts
- Reprocess URLs with updated data
- Track success rates and course counts
- Filter by institute, location, or format
- Sort results by any column
- Search within extracted data
- Use the main course listing page of the educational institution
- Ensure the website is publicly accessible
- Try department-specific course pages for specialized programs
- Some websites may require multiple attempts for optimal extraction
- University Courses:
https://university.edu/courses - Department Pages:
https://university.edu/computer-science/courses - Program Lists:
https://college.edu/academic-programs
- CourseExtractor Class: Core extraction logic
- Web Scraping: BeautifulSoup and Selenium integration
- API Endpoints: RESTful API for frontend communication
- Data Processing: Pandas for data manipulation and export
- Responsive Design: Bootstrap 5 framework
- Interactive UI: Modern JavaScript with ES6+ features
- Real-time Updates: Asynchronous data processing
- Local Storage: Persistent user preferences and history
- User submits URL(s) via web interface
- Backend processes URLs with web scraping
- Extracted data is structured and validated
- Results are returned to frontend
- Data is displayed in table format
- Export options generate downloadable files
Create a .env file in the project root:
FLASK_ENV=development
FLASK_DEBUG=True
CHROME_DRIVER_PATH=/path/to/chromedriver
MAX_CONCURRENT_REQUESTS=5
REQUEST_TIMEOUT=30- Scraping Patterns: Modify regex patterns in
CourseExtractorclass - Export Formats: Add new export formats in export functions
- UI Themes: Customize CSS variables in
static/css/style.css - API Rate Limiting: Adjust request limits and delays
{
"course_name": "Introduction to Computer Science",
"institute_name": "University of Technology",
"location": "New York, NY",
"format": "On-campus",
"faculty": "Dr. John Smith",
"language": "English",
"dates": "2024-01-15 - 2024-05-15",
"duration": "16 weeks",
"suitable_for": "Undergraduate students",
"fees": "$3,500",
"availability": "Open for enrollment"
}{
"success": true,
"results": [
{
"success": true,
"url": "https://example.edu/courses",
"courses_found": 25,
"courses": [...]
}
],
"total_courses": 25
}course-extractor/
βββ app.py # Main Flask application
βββ requirements.txt # Python dependencies
βββ templates/ # HTML templates
β βββ index.html # Main page template
βββ static/ # Static assets
β βββ css/ # Stylesheets
β β βββ style.css # Main CSS file
β βββ js/ # JavaScript files
β βββ app.js # Main application logic
βββ README.md # This file
βββ .gitignore # Git ignore rules
- Add field to
_extract_single_course_from_containermethod - Update HTML table headers
- Modify JavaScript display logic
- Update export functions
- Create new route in Flask app
- Implement export logic
- Add frontend button
- Update JavaScript export function
- Modify regex patterns in extraction methods
- Add new CSS selectors for specific websites
- Implement site-specific extraction logic
- Test with target websites
# Run basic tests
python -m pytest tests/
# Test specific functionality
python -m pytest tests/test_extraction.py
# Run with coverage
python -m pytest --cov=app tests/- Check if the website is publicly accessible
- Verify the URL contains course information
- Try different pages within the same website
- Check browser console for JavaScript errors
- Ensure all dependencies are installed
- Check Chrome browser installation
- Verify internet connection
- Review Flask application logs
- Reduce number of concurrent URLs
- Increase request timeout values
- Check website response times
- Monitor system resources
Enable debug mode for detailed error information:
app.run(debug=True, host='0.0.0.0', port=5000)Application logs are available in the console. For production, configure proper logging:
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('app.log'),
logging.StreamHandler()
]
)- Parallel Processing: Process multiple URLs concurrently
- Caching: Cache previously scraped results
- Rate Limiting: Respect website robots.txt and rate limits
- Connection Pooling: Reuse HTTP connections
- Streaming: Process large datasets in chunks
- Cleanup: Remove temporary files after export
- Garbage Collection: Optimize Python memory usage
For production use, consider adding a database:
# SQLite example
import sqlite3
conn = sqlite3.connect('courses.db')
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS courses (
id INTEGER PRIMARY KEY,
course_name TEXT,
institute_name TEXT,
extracted_at TIMESTAMP
)
''')- Only extract publicly available information
- Respect website terms of service
- Implement rate limiting to avoid overwhelming servers
- Store sensitive data securely
- Validate all user inputs
- Sanitize URLs before processing
- Implement CSRF protection
- Use parameterized queries
- Implement user authentication if needed
- Rate limit API endpoints
- Monitor for abuse patterns
- Log access attempts
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
- Follow PEP 8 Python guidelines
- Use meaningful variable names
- Add docstrings to functions
- Include type hints where appropriate
- Maintain test coverage above 80%
- Test edge cases and error conditions
- Include integration tests for API endpoints
- Test with various website structures
This project is licensed under the MIT License - see the LICENSE file for details.
- BeautifulSoup: HTML parsing and navigation
- Selenium: Dynamic content extraction
- Flask: Web framework
- Bootstrap: UI components and responsive design
- Font Awesome: Icons and visual elements
This Course Extractor can be easily deployed to make it publicly accessible to anyone on the internet.
-
Heroku (Recommended)
- Free tier available
- Automatic deployment from GitHub
- Easy scaling options
-
Railway
- Simple deployment process
- Good for small to medium apps
- Automatic HTTPS
-
Render
- Free tier available
- Easy GitHub integration
- Good documentation
-
Push to GitHub
git add . git commit -m "Ready for deployment" git push origin main
-
Deploy to Platform
- Follow platform-specific instructions
- Connect your GitHub repository
- Deploy automatically
-
Get Public URL
- Your app will have a public URL
- Share with others to use your Course Extractor
- Monitor usage and performance
For detailed deployment instructions, see DEPLOYMENT.md.
- Issues: Report bugs and feature requests on GitHub
- Documentation: Check this README and inline code comments
- Community: Join discussions in project forums
- Project Maintainer: [Your Name]
- Email: [your.email@example.com]
- GitHub: [github.com/yourusername]
Note: This tool is designed for educational and research purposes. Always respect website terms of service and implement appropriate rate limiting when scraping websites.