Professional Python project for Web Mining and Applied NLP.
Web Mining and Applied NLP focus on retrieving, processing, and analyzing text from the web and other digital sources. This course builds those capabilities through working projects.
In the age of generative AI, durable skills are grounded in real work: setting up a professional environment, reading and running code, understanding the logic, and pushing work to a shared repository. Each project follows a similar structure based on professional Python projects. These projects are hands-on textbooks for learning Web Mining and Applied NLP.
This project introduces text preprocessing.
The goal is to copy this repository, set up your environment, run the example analysis, and explore how raw text is cleaned and prepared for natural language processing.
You will run the example pipeline, read the code, and make small modifications to understand how the preprocessing workflow works.
The example pipeline reads text records from a file in data/.
We use Python to preprocesses the text by applying steps such as tokenization, normalization, punctuation removal, and stop word filtering. The results show how raw text changes as it moves through the preprocessing pipeline.
You'll work with just these areas:
- notebooks/ - Jupyter notebooks for exploration
- src/nlp/ - Python code (verifies .venv/)
- pyproject.toml - update authorship, links, and dependencies
- zensical.toml - update authorship and links
Follow the step-by-step workflow guide to complete:
- Phase 1. Start & Run
- Phase 2. Change Authorship
- Phase 3. Read & Understand
- Phase 4. Technical Modifications
- Phase 5. Apply Skills to a New Problem
Challenges are expected. Sometimes instructions may not quite match your operating system. When issues occur, share screenshots, error messages, and details about what you tried. Working through issues is an important part of implementing professional projects.
After completing Phase 1. Start & Run, you'll have your own GitHub project, running on your machine, and running the example will print out:
========================
Pipeline executed successfully!
========================And a new file named project.log will appear in the project folder.
The commands below are used in the workflow guide above. They are provided here for convenience.
Follow the guide for the full instructions.
Show command reference
After you get a copy of this repo in your own GitHub account,
open a machine terminal in your Repos folder:
# Replace username with YOUR GitHub username.
git clone https://github.com/RucuAvinash/nlp-02-text-preprocessing
cd nlp-02-text-preprocessing
code .uv self update
uv python pin 3.14
uv sync --extra dev --extra docs --upgrade
uvx pre-commit install
git add -A
uvx pre-commit run --all-files
# Later, we install spacy data model and
# en_core_web_sm = english, core, web, small
# It's big: spacy+data ~200+ MB w/ model installed
# ~350–450 MB for .venv is normal for NLP
# uv run python -m spacy download en_core_web_sm
# First, run the module
# IMPORTANT: Close each figure after viewing so execution continues
uv run python -m nlp.text_preprocessing_Rucu
# Then, open the notebook.
# IMPORTANT: Select the kernel and Run All:
# notebooks/text_preprocessing_case.ipynb
uv run ruff format .
uv run ruff check . --fix
uv run zensical build
git add -A
git commit -m "update"
git push -u origin main- Use the UP ARROW and DOWN ARROW in the terminal to scroll through past commands.
- Use
CTRL+fto find (and replace) text within a file.
