webscraping_CE_EU

Python utilities that scrape the European Circular Economy Stakeholder Platform for published good-practice case studies and generate quick visual summaries of the collected data. The repository currently contains a scraper that walks through the paginated directory, normalizes each listing into a tabular dataset, and an analysis helper that produces count plots for the categorical attributes.

Repository contents

scrape_ce.py — pulls listing cards from every page, extracts a consistent set of attributes, and saves them into good_practices.csv.
analyze.py — reads the CSV output and stores distribution plots for each major column in plots/.
plots/ — destination folder for generated images; populated after running analyze.py.

Each scraped record contains the following fields:

Column	Description
`Title`	Title of the good-practice entry on the platform.
`Description`	Short abstract/summary provided by the publisher.
`Link`	Canonical URL on the Circular Economy site.
`Organisation`	Reporting organisation/company.
`Type of Organisation`	Contributor category as classified by the platform.
`Country`	Country associated with the practice.
`Language`	Main language of the published material.
`Key Area`	Platform key area tags (comma-separated when multiple apply).
`Sector`	Listed sectors (comma-separated).
`Scope`	Geographic scope tags (comma-separated).

Requirements

Python 3.9+ with pip
Install dependencies via:

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt

Usage

Scrape the dataset
```
python scrape_ce.py
```
- The script automatically loops through every paginated listing until an empty page is reached (hard stop at page 80 to avoid infinite loops).
- Command-line flags let you change retries, delays, destination file, and more (see CLI reference).
- Output: good_practices.csv in the repository root by default.
Generate quick visualizations
```
python analyze.py
```
- Creates the plots/ folder if it does not exist.
- Saves one horizontal count plot per categorical column requested through --columns (defaults match the column table above). Each file is named <Column>_distribution.png.

CLI reference

Script	Useful flags
`scrape_ce.py`	`--output <path>` choose another CSV destination; `--max-pages <n>` limit crawl depth; `--skip-page <n>` skip specific page numbers (repeatable flag); `--retries`/`--delay` tune retry policy; `--log-level DEBUG` increase verbosity.
`analyze.py`	`--input <csv>` point to another dataset; `--output-dir <folder>` control where charts go; `--columns Country Language ...` limit the generated plots; `--log-level DEBUG` for troubleshooting.

Testing

Automated tests cover the parsing logic and plot generation helpers. Run them with:

pytest

Notes and tips

Respect the website's traffic limits. The CLI exposes --delay and --retries so you can slow down the crawl for politeness.
Inspect good_practices.csv before sharing: it contains the full description text from the site.
If the platform's markup changes, update the CSS selectors inside get_good_practices accordingly.
Both scripts log progress and warnings to stdout/stderr. Use --log-level DEBUG for detailed traces or review the messages to diagnose failed pages.

License

Distributed under the terms of the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
plots		plots
tests		tests
.DS_Store		.DS_Store
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
analyze.py		analyze.py
good_practices.csv		good_practices.csv
good_practices.xlsx		good_practices.xlsx
requirements.txt		requirements.txt
scrape_ce.py		scrape_ce.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webscraping_CE_EU

Repository contents

Requirements

Usage

CLI reference

Testing

Notes and tips

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

webscraping_CE_EU

Repository contents

Requirements

Usage

CLI reference

Testing

Notes and tips

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages