A comprehensive dual-module solution for automated financial data extraction and manual web marketplace data collection. This project demonstrates expertise in shell scripting for API integration, Python development for data management, and practical approaches to structured data handling in environments where traditional web scraping is constrained.
The suite addresses real-world challenges in data acquisition from diverse sources, employing bash scripting with curl for RESTful API consumption, Python for interactive CLI applications, and multi-format data serialization (JSON, CSV, TSV) for downstream analysis and database ingestion.
Key Technical Learnings:
- RESTful API consumption and data transformation using shell utilities (curl, awk, sed)
- Python-based interactive CLI development with structured data output
- Multi-format data serialization strategies for analytical workflows
- Error handling and data validation in production-grade shell scripts
- File I/O operations with proper encoding management for international character sets
A bash-based automated pipeline for retrieving and processing daily Net Asset Value (NAV) data from the Association of Mutual Funds in India (AMFI) API. Implements ETL (Extract, Transform, Load) principles to convert semicolon-delimited raw data into analysis-ready tab-separated values.
Technical Highlights:
- HTTP request handling via curl with error management
- Stream processing with AWK for pattern matching and field extraction
- Regular expression-based data validation
- TSV output optimized for database ingestion and spreadsheet analysis
A Python-based interactive command-line interface (CLI) for structured collection and storage of e-commerce listing metadata from OLX India. Designed for scenarios where DOM parsing restrictions or anti-bot measures necessitate manual data entry.
Technical Highlights:
- Object-oriented data modeling with Python dictionaries
- Dual-format export (JSON with ISO 8601 timestamps, CSV with DictWriter)
- UTF-8 encoding support for multi-language marketplace data
- Interactive user prompts with input validation and flow control
# For AMFI NAV Data Extractor (Linux/macOS/WSL)
- bash (version 4.0+)
- curl
- awk (GNU awk recommended)
# For OLX Data Collector
- Python 3.6 or higher# Clone the repository
git clone <repository-url>
cd temp_assignment
# Verify bash script permissions
chmod +x data_extractor/data_extractor.sh
# Python dependencies are part of standard library (json, csv, datetime)
# No additional pip installations requiredNavigate to the data extraction module and execute the shell script:
cd data_extractor
./data_extractor.shOutput:
amfi_nav_data.tsv- Tab-separated file containing scheme names and NAV values- Console displays record count and sample data preview
Use Cases:
- Daily mutual fund performance tracking
- Historical NAV database population
- Financial analysis and reporting pipelines
- Integration with data visualization tools
Navigate to the web scraping module and run the Python script:
cd "olx scrapper"
python web_scraper.pyInteractive Menu Options:
- Manual Entry - Enter listing data interactively via CLI prompts
- Load Sample Data - Generate sample output files for testing
- Instructions - Display browser console snippet for advanced users
Output Files:
olx_manual_data_YYYYMMDD_HHMMSS.json- Structured JSON with metadata and ISO timestampsolx_manual_data_YYYYMMDD_HHMMSS.csv- Comma-separated format for spreadsheet import
Data Fields Collected:
- Title, Price, Location, Date, URL, Description
Use Cases:
- Market research and competitive analysis
- Price tracking and trend analysis
- Database population for e-commerce analytics
- Training datasets for machine learning models
temp_assignment/
├── data_extractor/
│ ├── data_extractor.sh # Bash ETL pipeline for AMFI data
│ ├── amfi_nav_data.tsv # Generated NAV dataset
│ └── readme.txt # Module-specific documentation
├── olx scrapper/
│ ├── web_scraper.py # Python CLI data collector
│ ├── olx_manual_data_*.json # Generated JSON datasets
│ ├── olx_manual_data_*.csv # Generated CSV datasets
│ └── readme.txt # Module-specific documentation
└── README.md # This file
Data Pipeline Flow - AMFI Module:
AMFI API → HTTP GET (curl) → Raw Semicolon-Delimited Data →
AWK Stream Processing → Field Extraction & Validation →
TSV Output → File System Storage
Data Collection Flow - OLX Module:
User Input (CLI) → Python Dictionary Structures →
Validation & Storage → Dual Serialization (JSON + CSV) →
Timestamped File Output
- Database integration (PostgreSQL/MySQL) for persistent storage
- RESTful API wrapper for programmatic data access
- Automated scheduling with cron jobs for daily NAV updates
- Enhanced data validation with pandas DataFrames
- Web dashboard for data visualization using Flask/FastAPI
- Selenium-based automation for OLX data extraction (where permitted)
This project is available for portfolio and educational purposes.
Developed with focus on clean code architecture, data integrity, and production-ready error handling.