A scalable, fault-tolerant distributed system for analyzing text files using Stanford NLP Parser. Built on AWS infrastructure (EC2, S3, SQS) with a Manager-Worker architecture.
- Fully Scalable: Dynamically scales workers based on workload - handles millions of concurrent requests
- Fault Tolerant: Automatic worker replacement, message retry, and crash detection
- Multi-Client Support: Processes multiple concurrent client requests simultaneously
- Three NLP Analysis Types:
- Part-of-Speech (POS) tagging
- Constituency parsing
- Dependency parsing
- System Architecture
- Quick Start
- Usage
- Implementation Details
- Fault Tolerance
- Threading Model
- Scalability
- Uploads input file to S3
- Starts Manager if not already running
- Creates unique response queue per client
- Waits for completion and downloads results
- Handles Manager crash detection
- Single EC2 instance coordinating all work
- Receives tasks from multiple clients concurrently
- Splits tasks into subtasks (one per URL+analysis pair)
- Dynamic worker scaling: Launches workers based on
nratio - Collects results and generates HTML summaries
- Automatic worker replacement: Monitors worker health every 60s
- Multiple EC2 instances processing tasks in parallel
- Single-threaded design (horizontal scaling, not vertical)
- Downloads text from URLs
- Performs CPU-intensive Stanford NLP parsing
- Uploads results to S3
βββββββββββββββββββ
β LocalApp Client β
ββββββββββ¬βββββββββ
β Upload input β S3
β Send "new task"
βΌ
βββββββββββββββββββββββββββββββββββββββ
β manager-input-queue (SQS) β
ββββββββββ¬βββββββββββββββββββββββββββββ
β Manager polls
βΌ
βββββββββββββββββββββββββββββββββββββββ
β MANAGER β
β β’ Downloads input from S3 β
β β’ Creates subtasks β
β β’ Launches workers (dynamic) β
ββββββββββ¬βββββββββββββββββββββββββββββ
β Enqueue subtasks
βΌ
βββββββββββββββββββββββββββββββββββββββ
β worker-tasks-queue (SQS) β β Shared by all workers
ββββββββββ¬βββββββββββββββββββββββββββββ
β Workers poll (distributed)
βΌ
βββββββββββββββββββββββββββββββββββββββ
β WORKERS (1 to 1000s) β
β β’ Download text from URL β
β β’ Perform NLP analysis (CPU) β
β β’ Upload result to S3 β
ββββββββββ¬βββββββββββββββββββββββββββββ
β Send completion
βΌ
βββββββββββββββββββββββββββββββββββββββ
β worker-results-queue (SQS) β
ββββββββββ¬βββββββββββββββββββββββββββββ
β Manager polls
βΌ
βββββββββββββββββββββββββββββββββββββββ
β MANAGER β
β β’ Collects results β
β β’ Generates HTML summary β
β β’ Uploads summary to S3 β
ββββββββββ¬βββββββββββββββββββββββββββββ
β Send "DONE"
βΌ
βββββββββββββββββββββββββββββββββββββββ
β response-{clientId}-queue (SQS) β β Unique per client
ββββββββββ¬βββββββββββββββββββββββββββββ
β Client polls
βΌ
βββββββββββββββββββ
β LocalApp Client β
β Downloads HTML β
βββββββββββββββββββ
- AWS Account with appropriate permissions (EC2, S3, SQS)
- Java 17+
- Maven 3.6+
- IAM Role with EC2, S3, and SQS permissions
mvn clean packageThis generates:
target/manager.jar- Manager applicationtarget/worker.jar- Worker applicationtarget/local-application.jar- Client application
-
Launch EC2 instance (Amazon Linux 2023, t3.micro)
-
Install Java 17
-
Upload JARs to
/opt/dsp-app/on the instance:sudo mkdir -p /opt/dsp-app sudo cp manager.jar worker.jar /opt/dsp-app/
-
Create AMI from the instance
-
Update AMI ID in
LocalApplication.java:private static final String AMI_ID = "ami-xxxxxxxxx"; // Your AMI ID
-
Rebuild LocalApplication:
mvn clean package
java -jar target/local-application.jar <inputFile> <outputFile> <n> [terminate]| Parameter | Description | Example |
|---|---|---|
inputFile |
Input file with URLs and analysis types | input.txt |
outputFile |
Output HTML file path | output.html |
n |
Files-per-worker ratio | 10 (1 worker per 10 URLs) |
terminate |
(Optional) Terminate Manager after completion | terminate |
Each line: <ANALYSIS_TYPE><TAB><URL>
POS https://www.gutenberg.org/files/1661/1661-0.txt
CONSTITUENCY https://www.gutenberg.org/files/1342/1342-0.txt
DEPENDENCY https://example.com/sample.txt
Analysis Types:
POS- Part-of-Speech taggingCONSTITUENCY- Constituency parsing (tree structure)DEPENDENCY- Dependency parsing (word relationships)
HTML with clickable links to input and result files:
<html><body>
POS: <a href="https://input-url.com/file.txt">https://input-url.com/file.txt</a>
<a href="https://s3.amazonaws.com/bucket/result.txt">https://s3.amazonaws.com/bucket/result.txt</a><br>
CONSTITUENCY: <a href="...">...</a> <a href="...">...</a><br>
</body></html>Error handling:
POS: <a href="https://bad-url.com/file.txt">https://bad-url.com/file.txt</a> ERROR: Connection timeout<br># Start system with 1 worker per 5 files
java -jar target/local-application.jar input.txt output.html 5
# With automatic termination
java -jar target/local-application.jar input.txt output.html 10 terminateSQS Queues (4 total):
| Queue | Direction | Purpose | Message Format |
|---|---|---|---|
manager-input-queue |
LocalApp β Manager | Task submissions | new task\t{s3_key}\t{n}\t{response_queue} |
worker-tasks-queue |
Manager β Workers | Subtasks (shared) | {taskId}\t{analysisType}\t{fileUrl} |
worker-results-queue |
Workers β Manager | Completion messages | {taskId}\t{analysisType}\t{url}\t{result} |
response-{uuid}-queue |
Manager β LocalApp | Final results (unique per client) | DONE\t{summary_s3_key} |
EC2 Configuration:
- Manager: t3.micro instance
- Workers: t3.micro instances (dynamically scaled)
- Region: us-east-1
- AMI: Custom AMI with Java 17 + JARs
S3 Storage:
- Input files uploaded by clients
- Result files (POS/CONSTITUENCY/DEPENDENCY output) uploaded by workers
- HTML summary files generated by Manager
SQS Settings:
- Long polling: 20 seconds (reduces API costs)
- Visibility timeout: 4000 seconds (~66 minutes)
- Message retention: 4 days
IAM Role-Based Authentication (Zero credentials in code):
- All instances use
LabInstanceProfileIAM role - Temporary credentials auto-rotated by AWS
- No hardcoded access keys anywhere
- Prevents credential leakage
// SDK automatically uses IAM instance profile
S3Client s3 = S3Client.builder().region(Region.US_EAST_1).build();- Detection: Manager health checks every 60 seconds via EC2 API
- Recovery: Automatic worker replacement when tasks are active
- Message Safety: SQS visibility timeout (66 min) ensures messages reappear if worker dies
- Result: Zero data loss, automatic recovery
- Detection: LocalApp polls Manager status every 2 minutes
- Response: Throws exception with cleanup instructions
- Limitation: In-memory state lost (design choice for assignment scope)
- Detection: Messages reappear after 66-minute visibility timeout
- Recovery: Stalled workers detected on next health check, messages re-processed
- Prevention: Workers deleted from queue only after successful completion
Clean shutdown when client sends terminate flag:
- Manager stops accepting new tasks
- Manager waits for all active tasks to complete
- Manager terminates all worker instances
- Manager deletes system queues
- Manager self-terminates
- LocalApp deletes its response queue
Result: Zero orphaned resources
Main Thread
ββ Monitors shutdown condition
ββ Performs worker health checks (every 60s)
ClientListener Thread
ββ Polls manager-input-queue
ββ Submits tasks to executor (non-blocking)
ββ Handles new task requests
WorkerResultListener Thread
ββ Polls worker-results-queue
ββ Submits result processing to executor (non-blocking)
ββ Collects worker outputs
Executor Pool (CachedThreadPool - unlimited threads)
ββ Processes client messages concurrently
ββ Processes worker results concurrently
ββ Scales dynamically based on load
Why multi-threaded?
- Concurrent client handling: Process multiple clients simultaneously
- Non-blocking listeners: Queue polling never blocks message processing
- Parallel result aggregation: Collect results from many workers at once
Main Thread
ββ while(true):
ββ Poll worker-tasks-queue (blocking, 20s)
ββ Download text from URL
ββ Perform NLP parsing (CPU-intensive)
ββ Upload result to S3
ββ Send completion message
ββ Delete SQS message (ACK)
Why single-threaded?
- CPU-bound workload: Stanford NLP parsing is 95%+ CPU time
- Horizontal scaling: Add more workers instead of more threads per worker
- Simplicity: No thread synchronization complexity
- Fair distribution: SQS automatically distributes work across workers
When Manager receives a new task:
- Parse input file and count total URLs
- Calculate required workers:
required_workers = ceil(total_urls / n) - Determine workers to launch:
workers_to_launch = max(0, required_workers - active_workers) - Launch EC2 instances with Worker JAR in user data
- Each client creates a unique response queue (
response-{uuid}-queue) - Manager processes all client tasks concurrently
- Workers pull from shared task queue (SQS load balancing)
- No starvation: Each client's messages are independent
| Scenario | System Response |
|---|---|
| 1 million clients | β Scale to ~100,000 workers (AWS quota dependent) |
| 1 billion tasks | β Queue unbounded, workers scale horizontally |
| Manager bottleneck |
Production optimization: Use multiple Managers with load balancer or sharded task queues.
- Java 17 - Application runtime
- Maven - Dependency management
- AWS EC2 - Compute instances
- AWS S3 - Object storage
- AWS SQS - Message queuing
- Stanford NLP Parser 3.6.0 - Natural language processing
This project was developed as part of a Distributed Systems Programming course.
Ben Kapon and Ori Cohen