This project automates web scraping, text preprocessing, and linguistic analysis for a set of articles. It combines Natural Language Processing (NLP) techniques and readability metrics to generate structured insights in an Excel report.
- Automated scraping of article content from URLs
- Text cleaning and tokenization
- Sentiment analysis (positive, negative, polarity, subjectivity)
- Readability metrics (average sentence length, fog index, percentage of complex words)
- Lexical and grammatical features (word count, syllable per word, personal pronouns, average word length)
- Outputs a comprehensive report in
Output.xlsx
Open a terminal in the project directory and run:
pip install pandas requests beautifulsoup4 nltk openpyxl- Place
Input.xlsxin the same folder asmain.py. - Ensure the folders
MasterDictionary/andStopWords/are present in the project directory.
python main.py- Scraped articles are saved in the
articles/folder. - Final processed report is generated as
Output.xlsx.
- pandas: Data manipulation
- requests: HTTP requests
- beautifulsoup4: HTML parsing & scraping
- nltk: Text preprocessing, tokenization, sentiment resources
- openpyxl: Excel report generation
- Scraped articles (raw text)
- Processed metrics (sentiment + readability)
- Final report (
Output.xlsx)
- End-to-end NLP pipeline: scraping → preprocessing → analysis → reporting
- Uses lexical dictionaries for sentiment classification
- Implements Fog Index and other linguistic complexity metrics
- Provides actionable insights for large-scale textual datasets