Skip to content

Mariam-iftikhar/BigDataProjects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Projects Portfolio

AWS EMR Hadoop Spark HBase Hive Python

Overview

This repository demonstrates hands-on experience with modern big data technologies and cloud computing infrastructure. Through a series of structured exercises and projects, I've built practical expertise in distributed data processing, NoSQL databases, data warehousing, and real-time analytics using industry-standard tools hosted on AWS Elastic MapReduce (EMR).

Technical Stack

  • Cloud Platform: Amazon Web Services (AWS) - Elastic MapReduce (EMR)
  • Distributed Processing: Apache Hadoop MapReduce, Apache Spark (PySpark)
  • Data Warehousing: Apache Hive (HiveQL)
  • NoSQL Database: Apache HBase
  • Programming Languages: Python, SQL
  • Environment: Jupyter Notebooks, AWS EMR Clusters

Projects & Exercises

1. Hadoop MapReduce

View Project Documentation

  • Implemented scalable MapReduce jobs for distributed data processing
  • Designed custom mapper and reducer functions for large-scale data transformations
  • Optimized data partitioning and shuffling strategies for performance
  • Processed multi-gigabyte datasets across distributed cluster nodes

Key Skills: Distributed computing, parallel processing, data partitioning, performance optimization


2. HBase - NoSQL Database Management

View Project Documentation

  • Designed and implemented scalable NoSQL database schemas for high-throughput applications
  • Performed CRUD operations on large-scale distributed datasets
  • Optimized row key design for efficient data retrieval and scan operations
  • Managed column families and implemented data versioning strategies

Key Skills: NoSQL database design, schema optimization, distributed data storage, data modeling


3. Hive - Data Warehousing & Analytics

View Project Documentation

  • Executed complex SQL-like queries using HiveQL for data warehousing tasks
  • Created and managed partitioned and bucketed tables for optimized query performance
  • Implemented ETL pipelines for structured and semi-structured data
  • Performed advanced analytical queries including joins, aggregations, and window functions

Key Skills: Data warehousing, HiveQL, ETL processes, query optimization, analytical reporting


4. Apache Spark - Real-Time & Batch Processing

View Project Documentation

  • Built distributed data processing applications using PySpark
  • Implemented RDD transformations and actions for efficient data manipulation
  • Developed batch processing workflows for large-scale data analysis
  • Extracted actionable insights from complex datasets using Spark DataFrames and Spark SQL

Key Skills: PySpark, RDD operations, Spark DataFrames, Spark SQL, batch processing, data analysis


Key Accomplishments

✅ Successfully deployed and managed multiple AWS EMR clusters for distributed computing

✅ Processed and analyzed large-scale datasets (GB-TB range) using distributed frameworks

✅ Demonstrated proficiency across the entire big data ecosystem (storage, processing, querying)

✅ Applied industry best practices for data partitioning, optimization, and cluster management

✅ Integrated multiple big data technologies to build comprehensive data pipelines

Technologies Deep Dive

Technology Use Case Key Features Implemented
Hadoop MapReduce Distributed batch processing Custom mappers/reducers, data partitioning, shuffling optimization
HBase Real-time NoSQL storage Row key design, column families, distributed scans, bulk loading
Hive Data warehousing & SQL analytics Partitioning, bucketing, complex queries, ETL pipelines
Spark In-memory distributed processing RDD transformations, DataFrames, Spark SQL, batch analytics
AWS EMR Managed Hadoop/Spark clusters Cluster provisioning, scaling, resource management

Learning Outcomes

Through these projects, I developed comprehensive skills in:

  • Distributed Systems Architecture: Understanding of cluster computing, data locality, and fault tolerance
  • Cloud Infrastructure: Hands-on experience with AWS EMR for big data workloads
  • Data Pipeline Development: End-to-end ETL processes from ingestion to analysis
  • Performance Optimization: Query tuning, data partitioning, and resource management
  • Multi-Technology Integration: Combining different tools for comprehensive data solutions

Getting Started

Each project folder contains detailed documentation with:

  • Problem statements and objectives
  • Implementation approaches and code samples
  • Results and performance metrics
  • Key learnings and best practices

Contact & Collaboration

Interested in discussing big data solutions or potential collaboration opportunities?

LinkedIn GitHub


Note: All projects were completed as part of advanced coursework in Big Data Technologies and deployed on AWS EMR infrastructure.

Last Updated: December 2025

About

The repository showcases a series of exercises and projects focused on big data processing using Hadoop, HBase, Hive, and Spark with Python. Hosted on AWS EMR, these projects demonstrate efficient data handling and processing techniques, leveraging the power of cloud computing to tackle complex data challenges.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors