This scraper pulls article data from:
https://corporate.exxonmobil.com/locations/mozambique/mozambique-newsroom
It is built for the date range:
- Start:
2017-03-01 - End:
2026-12-31
output/exxon_mozambique_news_2017_2026.jsonoutput/exxon_mozambique_news_2017_2026.csvoutput/exxon_mozambique_keyword_hits.jsonoutput/exxon_mozambique_keyword_paragraph_hits.jsonoutput/exxon_mozambique_keyword_paragraph_hits.csv
Each record includes:
- article title
- article URL
- published date
- article type
- read time
- location tag
- summary bullets
- matched keywords
- keyword hit count
- keyword snippets
- paragraph-level keyword hits with article link
- extracted body text
- full raw page text in the JSON output
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
playwright install chromiumpython scrape_exxon_mozambique.pyTo override the default scan terms:
python scrape_exxon_mozambique.py --keywords conflict "force majeure" crisis- The newsroom uses a
Load Moreinterface, so the script uses Playwright instead of plainrequestsfor URL discovery. - Article extraction is heuristic-based. If ExxonMobil changes the HTML structure, selectors may need a small update.
- The script filters by article publish date after fetching each page.
- A separate
output/exxon_mozambique_keyword_hits.jsonfile is written with only the articles that matched your scan terms. - Paragraph-level matches are also exported so you can review the exact paragraph containing each keyword alongside the article URL.