Skip to content

CSroseX/olx-pyhton-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Financial Data Processing & Web Data Collection Suite

A comprehensive dual-module solution for automated financial data extraction and manual web marketplace data collection. This project demonstrates expertise in shell scripting for API integration, Python development for data management, and practical approaches to structured data handling in environments where traditional web scraping is constrained.

The suite addresses real-world challenges in data acquisition from diverse sources, employing bash scripting with curl for RESTful API consumption, Python for interactive CLI applications, and multi-format data serialization (JSON, CSV, TSV) for downstream analysis and database ingestion.

Key Technical Learnings:

  • RESTful API consumption and data transformation using shell utilities (curl, awk, sed)
  • Python-based interactive CLI development with structured data output
  • Multi-format data serialization strategies for analytical workflows
  • Error handling and data validation in production-grade shell scripts
  • File I/O operations with proper encoding management for international character sets

Tech Stack

Project Modules

1. AMFI NAV Data Extractor

A bash-based automated pipeline for retrieving and processing daily Net Asset Value (NAV) data from the Association of Mutual Funds in India (AMFI) API. Implements ETL (Extract, Transform, Load) principles to convert semicolon-delimited raw data into analysis-ready tab-separated values.

Technical Highlights:

  • HTTP request handling via curl with error management
  • Stream processing with AWK for pattern matching and field extraction
  • Regular expression-based data validation
  • TSV output optimized for database ingestion and spreadsheet analysis

2. OLX Marketplace Data Collector

A Python-based interactive command-line interface (CLI) for structured collection and storage of e-commerce listing metadata from OLX India. Designed for scenarios where DOM parsing restrictions or anti-bot measures necessitate manual data entry.

Technical Highlights:

  • Object-oriented data modeling with Python dictionaries
  • Dual-format export (JSON with ISO 8601 timestamps, CSV with DictWriter)
  • UTF-8 encoding support for multi-language marketplace data
  • Interactive user prompts with input validation and flow control

Installation & Dependencies

Prerequisites

# For AMFI NAV Data Extractor (Linux/macOS/WSL)
- bash (version 4.0+)
- curl
- awk (GNU awk recommended)

# For OLX Data Collector
- Python 3.6 or higher

Setup

# Clone the repository
git clone <repository-url>
cd temp_assignment

# Verify bash script permissions
chmod +x data_extractor/data_extractor.sh

# Python dependencies are part of standard library (json, csv, datetime)
# No additional pip installations required

Usage

AMFI NAV Data Extractor

Navigate to the data extraction module and execute the shell script:

cd data_extractor
./data_extractor.sh

Output:

  • amfi_nav_data.tsv - Tab-separated file containing scheme names and NAV values
  • Console displays record count and sample data preview

Use Cases:

  • Daily mutual fund performance tracking
  • Historical NAV database population
  • Financial analysis and reporting pipelines
  • Integration with data visualization tools

OLX Data Collector

Navigate to the web scraping module and run the Python script:

cd "olx scrapper"
python web_scraper.py

Interactive Menu Options:

  1. Manual Entry - Enter listing data interactively via CLI prompts
  2. Load Sample Data - Generate sample output files for testing
  3. Instructions - Display browser console snippet for advanced users

Output Files:

  • olx_manual_data_YYYYMMDD_HHMMSS.json - Structured JSON with metadata and ISO timestamps
  • olx_manual_data_YYYYMMDD_HHMMSS.csv - Comma-separated format for spreadsheet import

Data Fields Collected:

  • Title, Price, Location, Date, URL, Description

Use Cases:

  • Market research and competitive analysis
  • Price tracking and trend analysis
  • Database population for e-commerce analytics
  • Training datasets for machine learning models

Project Structure

temp_assignment/
├── data_extractor/
│   ├── data_extractor.sh       # Bash ETL pipeline for AMFI data
│   ├── amfi_nav_data.tsv       # Generated NAV dataset
│   └── readme.txt              # Module-specific documentation
├── olx scrapper/
│   ├── web_scraper.py          # Python CLI data collector
│   ├── olx_manual_data_*.json  # Generated JSON datasets
│   ├── olx_manual_data_*.csv   # Generated CSV datasets
│   └── readme.txt              # Module-specific documentation
└── README.md                    # This file

Technical Architecture

Data Pipeline Flow - AMFI Module:

AMFI API → HTTP GET (curl) → Raw Semicolon-Delimited Data → 
AWK Stream Processing → Field Extraction & Validation → 
TSV Output → File System Storage

Data Collection Flow - OLX Module:

User Input (CLI) → Python Dictionary Structures → 
Validation & Storage → Dual Serialization (JSON + CSV) → 
Timestamped File Output

Future Enhancements

  • Database integration (PostgreSQL/MySQL) for persistent storage
  • RESTful API wrapper for programmatic data access
  • Automated scheduling with cron jobs for daily NAV updates
  • Enhanced data validation with pandas DataFrames
  • Web dashboard for data visualization using Flask/FastAPI
  • Selenium-based automation for OLX data extraction (where permitted)

License

This project is available for portfolio and educational purposes.


Developed with focus on clean code architecture, data integrity, and production-ready error handling.

About

Extracts structured financial and marketplace data via APIs and CLI tools. Demonstrates shell scripting with curl and Python-based data handling with multi-format serialization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors