⬡ WebHarvest Pro

Ethical public-data web scraper with scheduled monitoring, multi-site collection, and pre-built Excel/PowerBI analytics export — built entirely in Python and vanilla JavaScript.

📸 What It Does

WebHarvest Pro is a full-stack data collection pipeline that:

Validates every URL against an ethics engine (robots.txt, paywall detection, auth-wall detection)
Scrapes structured public data — headings, paragraphs, tables, links, metadata
Tracks content via SHA-256 hashes stored in every export for change auditing (browser UI adds scheduling on top)
Exports to richly formatted Excel workbooks with pre-built PowerBI/PivotTable analytics templates, or clean analysis-ready CSV files

🗂 Repository Structure

webharvest-pro/
│
├── src/
│   ├── scraper_engine.py      # Core scraping pipeline + ethics enforcement
│   ├── excel_exporter.py      # 8-sheet Excel builder with analytics templates
│   ├── csv_exporter.py        # Analysis-ready CSV export (3 files)
│   └── build_excel.py         # CLI entry point — orchestrates full pipeline
│
├── web/
│   └── index.html             # Browser UI — no install needed
│
├── docs/
│   ├── ARCHITECTURE.md        # System design & module breakdown
│   ├── SKILLS_SHOWCASE.md     # Annotated code — interview talking points
│   └── ETHICS.md              # Compliance policy & scraping ethics
│
├── examples/
│   ├── sample_export.xlsx     # Pre-generated Excel output to explore
│   └── sample_output.json     # Sample raw scrape data
│
├── requirements.txt
├── LICENSE
└── README.md                  ← You are here

⚡ Quick Start

Web UI (zero install)

1. Download index.html
2. Open in any browser
3. Paste URLs → click Scrape → export CSV

Python CLI

# Install dependencies
pip install -r requirements.txt

# Scrape one or more public URLs
python src/build_excel.py --urls https://news.ycombinator.com https://quotes.toscrape.com

# Build Excel from a previously saved JSON file
python src/build_excel.py --json examples/sample_output.json

# CSV output only, custom output directory
python src/build_excel.py --urls https://example.com --format csv --output ./exports

📊 Output Formats

Excel Workbook (`.xlsx`) — 8 sheets

Sheet	Contents
📊 Summary	KPI stat cards, per-site overview table, auto-status badges
📄 Raw Data	All headings, paragraphs, and list items with source attribution
📋 Tables	Every HTML table found, reformatted with column headers
🔗 Links	Full link inventory with internal/external classification and domain
📈 Analytics Template	Pre-written Excel formulas + 6 PivotTable configuration recipes
⚡ PowerBI Guide	DAX formulas + step-by-step dashboard layout guide
🔄 PivotTable Guide	Slicer setup, FILTER/UNIQUE/SORT formulas, chart type recommendations
📝 Changelog	Timestamped run history with content hashes for change tracking

CSV Files (3 files)

File	Contents
`summary_[ts].csv`	One row per URL with counts and status
`raw_content_[ts].csv`	All content rows with type labels (heading, paragraph, list_item)
`links_[ts].csv`	All links with source, text, URL, and external flag

🛡 Ethics & Compliance

This tool is built around four layers of protection against unethical scraping:

Layer	Mechanism	What it blocks
URL Validation	Regex pattern matching	Login pages, dashboards, private/admin paths
Domain Blocklist	Exact domain matching	Gmail, Instagram, LinkedIn DMs, Patreon, etc.
robots.txt	`urllib.robotparser`	Any path disallowed by the site's crawl policy
Paywall Detection	Keyword scan of page HTML	Subscribe walls, members-only gates

Designed for: public news sites, government data portals, open academic resources, company public pages, open directories.

Not for: paywalled articles, authenticated portals, personal social media pages, or any content where you don't have explicit permission.

🔧 Technical Stack

Component	Technology	Why
HTTP client	`requests`	Robust redirects, timeout handling, session management
HTML parsing	`BeautifulSoup4` + `lxml`	Fast, fault-tolerant DOM traversal
robots.txt	`urllib.robotparser` (stdlib)	Zero extra dependency, RFC-compliant
Excel generation	`openpyxl`	Full formatting API — colors, borders, tables, charts
CSV export	`csv` (stdlib)	Zero-dependency CSV writing with clean column schema
UI	Vanilla JS + HTML/CSS	Zero-dependency browser app, no build step
CLI	`argparse` (stdlib)	Self-documenting command interface

💼 Skills Demonstrated

A quick reference for hiring managers — each item links to the relevant code:

Skill Area	What's built	File
Web scraping	Full pipeline: fetch → parse → structure → export	`scraper_engine.py`
Data engineering	Structured extraction of tables, links, headings, metadata	`scraper_engine.py`
API / HTTP	Rate limiting, redirects, User-Agent headers, timeout handling	`scraper_engine.py`
File I/O	Generate formatted multi-sheet Excel and multi-file CSV	`excel_exporter.py`, `csv_exporter.py`
Security thinking	Ethics engine with 4 validation layers before any request fires	`scraper_engine.py`
Automation	Scheduled re-scraping + content-hash change detection	`index.html` (JS scheduler)
Frontend	Responsive single-file browser app — no framework	`index.html`
CLI design	Argparse with modes: direct URLs, JSON input, interactive	`build_excel.py`
Data viz prep	Pre-written DAX, PivotTable recipes, chart recommendations	`excel_exporter.py`

See docs/SKILLS_SHOWCASE.md for annotated code walkthroughs of each area.

📦 Installation

git clone https://github.com/LSaiko/Webharvest-Pro.git
cd Webharvest-Pro
pip install -r requirements.txt
python src/build_excel.py --urls https://quotes.toscrape.com

Requirements:

requests>=2.28.0,<3.0.0
beautifulsoup4>=4.11.0,<5.0.0
lxml>=4.9.0,<6.0.0
openpyxl>=3.1.0,<4.0.0

📄 License

MIT — free for personal and commercial use. See LICENSE.

🤝 Contributing

Pull requests welcome. Please ensure any new scraping features:

Pass the ethics validation pipeline
Include a docstring explaining what data is collected
Respect the existing rate-limiting patterns

Built as a portfolio project to demonstrate end-to-end data engineering: collection, transformation, and analytics-ready export.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⬡ WebHarvest Pro

📸 What It Does

🗂 Repository Structure

⚡ Quick Start

Web UI (zero install)

Python CLI

📊 Output Formats

Excel Workbook (`.xlsx`) — 8 sheets

CSV Files (3 files)

🛡 Ethics & Compliance

🔧 Technical Stack

💼 Skills Demonstrated

📦 Installation

📄 License

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
examples		examples
src		src
web		web
LICENSE		LICENSE
README.md		README.md
portfolio.html		portfolio.html
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

⬡ WebHarvest Pro

📸 What It Does

🗂 Repository Structure

⚡ Quick Start

Web UI (zero install)

Python CLI

📊 Output Formats

Excel Workbook (.xlsx) — 8 sheets

CSV Files (3 files)

🛡 Ethics & Compliance

🔧 Technical Stack

💼 Skills Demonstrated

📦 Installation

📄 License

🤝 Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Excel Workbook (`.xlsx`) — 8 sheets

Packages