Ethical public-data web scraper with scheduled monitoring, multi-site collection, and pre-built Excel/PowerBI analytics export — built entirely in Python and vanilla JavaScript.
WebHarvest Pro is a full-stack data collection pipeline that:
- Validates every URL against an ethics engine (robots.txt, paywall detection, auth-wall detection)
- Scrapes structured public data — headings, paragraphs, tables, links, metadata
- Tracks content via SHA-256 hashes stored in every export for change auditing (browser UI adds scheduling on top)
- Exports to richly formatted Excel workbooks with pre-built PowerBI/PivotTable analytics templates, or clean analysis-ready CSV files
webharvest-pro/
│
├── src/
│ ├── scraper_engine.py # Core scraping pipeline + ethics enforcement
│ ├── excel_exporter.py # 8-sheet Excel builder with analytics templates
│ ├── csv_exporter.py # Analysis-ready CSV export (3 files)
│ └── build_excel.py # CLI entry point — orchestrates full pipeline
│
├── web/
│ └── index.html # Browser UI — no install needed
│
├── docs/
│ ├── ARCHITECTURE.md # System design & module breakdown
│ ├── SKILLS_SHOWCASE.md # Annotated code — interview talking points
│ └── ETHICS.md # Compliance policy & scraping ethics
│
├── examples/
│ ├── sample_export.xlsx # Pre-generated Excel output to explore
│ └── sample_output.json # Sample raw scrape data
│
├── requirements.txt
├── LICENSE
└── README.md ← You are here
1. Download index.html
2. Open in any browser
3. Paste URLs → click Scrape → export CSV
# Install dependencies
pip install -r requirements.txt
# Scrape one or more public URLs
python src/build_excel.py --urls https://news.ycombinator.com https://quotes.toscrape.com
# Build Excel from a previously saved JSON file
python src/build_excel.py --json examples/sample_output.json
# CSV output only, custom output directory
python src/build_excel.py --urls https://example.com --format csv --output ./exports| Sheet | Contents |
|---|---|
| 📊 Summary | KPI stat cards, per-site overview table, auto-status badges |
| 📄 Raw Data | All headings, paragraphs, and list items with source attribution |
| 📋 Tables | Every HTML table found, reformatted with column headers |
| 🔗 Links | Full link inventory with internal/external classification and domain |
| 📈 Analytics Template | Pre-written Excel formulas + 6 PivotTable configuration recipes |
| ⚡ PowerBI Guide | DAX formulas + step-by-step dashboard layout guide |
| 🔄 PivotTable Guide | Slicer setup, FILTER/UNIQUE/SORT formulas, chart type recommendations |
| 📝 Changelog | Timestamped run history with content hashes for change tracking |
| File | Contents |
|---|---|
summary_[ts].csv |
One row per URL with counts and status |
raw_content_[ts].csv |
All content rows with type labels (heading, paragraph, list_item) |
links_[ts].csv |
All links with source, text, URL, and external flag |
This tool is built around four layers of protection against unethical scraping:
| Layer | Mechanism | What it blocks |
|---|---|---|
| URL Validation | Regex pattern matching | Login pages, dashboards, private/admin paths |
| Domain Blocklist | Exact domain matching | Gmail, Instagram, LinkedIn DMs, Patreon, etc. |
| robots.txt | urllib.robotparser |
Any path disallowed by the site's crawl policy |
| Paywall Detection | Keyword scan of page HTML | Subscribe walls, members-only gates |
Designed for: public news sites, government data portals, open academic resources, company public pages, open directories.
Not for: paywalled articles, authenticated portals, personal social media pages, or any content where you don't have explicit permission.
| Component | Technology | Why |
|---|---|---|
| HTTP client | requests |
Robust redirects, timeout handling, session management |
| HTML parsing | BeautifulSoup4 + lxml |
Fast, fault-tolerant DOM traversal |
| robots.txt | urllib.robotparser (stdlib) |
Zero extra dependency, RFC-compliant |
| Excel generation | openpyxl |
Full formatting API — colors, borders, tables, charts |
| CSV export | csv (stdlib) |
Zero-dependency CSV writing with clean column schema |
| UI | Vanilla JS + HTML/CSS | Zero-dependency browser app, no build step |
| CLI | argparse (stdlib) |
Self-documenting command interface |
A quick reference for hiring managers — each item links to the relevant code:
| Skill Area | What's built | File |
|---|---|---|
| Web scraping | Full pipeline: fetch → parse → structure → export | scraper_engine.py |
| Data engineering | Structured extraction of tables, links, headings, metadata | scraper_engine.py |
| API / HTTP | Rate limiting, redirects, User-Agent headers, timeout handling | scraper_engine.py |
| File I/O | Generate formatted multi-sheet Excel and multi-file CSV | excel_exporter.py, csv_exporter.py |
| Security thinking | Ethics engine with 4 validation layers before any request fires | scraper_engine.py |
| Automation | Scheduled re-scraping + content-hash change detection | index.html (JS scheduler) |
| Frontend | Responsive single-file browser app — no framework | index.html |
| CLI design | Argparse with modes: direct URLs, JSON input, interactive | build_excel.py |
| Data viz prep | Pre-written DAX, PivotTable recipes, chart recommendations | excel_exporter.py |
See docs/SKILLS_SHOWCASE.md for annotated code walkthroughs of each area.
git clone https://github.com/LSaiko/Webharvest-Pro.git
cd Webharvest-Pro
pip install -r requirements.txt
python src/build_excel.py --urls https://quotes.toscrape.comRequirements:
requests>=2.28.0,<3.0.0
beautifulsoup4>=4.11.0,<5.0.0
lxml>=4.9.0,<6.0.0
openpyxl>=3.1.0,<4.0.0
MIT — free for personal and commercial use. See LICENSE.
Pull requests welcome. Please ensure any new scraping features:
- Pass the ethics validation pipeline
- Include a docstring explaining what data is collected
- Respect the existing rate-limiting patterns
Built as a portfolio project to demonstrate end-to-end data engineering: collection, transformation, and analytics-ready export.