Skip to content

Commit 37a34b2

Browse files
docs: clarify deployment options, configuration, and summarization behavior
Documentation: - Reorder README installation sections: pre-built image first, local build second - Add external Ollama section with cpu-external/gpu-external profile instructions - Fix default ports: 8000 → 8002 in all examples (align with APP_PORT default) - Clarify OLLAMA_NUM_PARALLEL: configures bundled Ollama container, not RAPTOR app - Clarify LOG_LEVEL: set in Docker environment but not consumed by Python code - Add
1 parent dabd1a6 commit 37a34b2

7 files changed

Lines changed: 104 additions & 1240 deletions

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,4 +88,5 @@ config.local.yaml
8888
.claude/
8989
CLAUDE.md
9090
CLAUDE.local.md
91-
.claudeignore
91+
.claudeignore
92+
docs/

README.md

Lines changed: 88 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ The system is designed for production environments, offering a robust REST API,
2424
- **Intelligent Resource Management**: Optimizes CPU and memory usage based on available system resources
2525
- **Production-Ready API**: FastAPI-based REST interface with automatic documentation and validation
2626
- **Docker Integration**: Easy deployment with Docker and docker-compose for both CPU and GPU environments
27-
- **Configurable Processing**: Adjustable parameters for summarization depth, model selection, and processing options
27+
- **Configurable Processing**: Adjustable parameters for model selection, temperature, token limits, and processing options (summarization hierarchy is fixed at 3 levels)
2828
- **Model Caching**: Efficient model management with lifespan context managers for improved performance
2929
- **Comprehensive Logging**: Detailed logging with rotating file handlers for debugging and monitoring
3030
- **Thread-Safe Processing**: Concurrent processing capabilities with proper resource management
@@ -44,15 +44,17 @@ The system is designed for production environments, offering a robust REST API,
4444
- [Prerequisites](#prerequisites)
4545
- [Getting the Code](#getting-the-code)
4646
- [Local Installation with Uvicorn](#local-installation-with-uvicorn)
47-
- [Docker Deployment (Recommended)](#docker-deployment-recommended)
47+
- [Option A: Pre-built Image from GitHub Container Registry](#option-a-pre-built-image-from-github-container-registry)
48+
- [Option B: Docker Compose (Local Build)](#option-b-docker-compose-local-build)
49+
- [Using an external Ollama instance](#using-an-external-ollama-instance)
4850
- [Ollama Setup](#ollama-setup)
4951
- [Using the API](#using-the-api)
5052
- [API Endpoints](#api-endpoints)
5153
- [Example API Call](#example-api-call)
5254
- [Response Format](#response-format)
5355
- [Configuration](#configuration)
54-
- [Custom Prompt Templates](#custom-prompt-templates)
55-
- [Default Prompt Template](#default-prompt-template)
56+
- [Custom Instructions](#custom-instructions)
57+
- [Default Instructions](#default-instructions)
5658
- [Contributing](#contributing)
5759

5860
## How the Summarization Algorithm Works
@@ -212,36 +214,33 @@ cd Progressive-Summarizer-RAPTOR
212214
pip install -r requirements.txt
213215
```
214216

215-
Note: For GPU support, ensure you have the appropriate PyTorch version installed:
216-
```bash
217-
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
218-
```
219-
220217
3. Run the FastAPI server:
221218
```bash
222-
uvicorn raptor_api:app --reload --host 0.0.0.0 --port 8000
219+
uvicorn raptor_api:app --reload --port 8002
223220
```
224221

225-
4. The API will be available at `http://localhost:8000`.
222+
4. The API will be available at `http://localhost:8002`.
226223

227-
Access the API documentation and interactive testing interface at `http://localhost:8000/docs`.
224+
Access the API documentation and interactive testing interface at `http://localhost:8002/docs`.
228225

229-
### Docker Deployment (Recommended)
226+
### Option B: Docker Compose (Local Build)
230227

231228
1. Create required directories for persistent storage:
232229
```bash
233230
# Linux/macOS
234231
mkdir -p models logs
235-
232+
236233
# Windows CMD
237234
mkdir models
238235
mkdir logs
239-
236+
240237
# Windows PowerShell
241238
New-Item -ItemType Directory -Path models -Force
242239
New-Item -ItemType Directory -Path logs -Force
243240
```
244241

242+
> **Note**: Docker Compose mounts three named volumes automatically: `raptor_models` (downloaded embedding models), `raptor_logs` (application logs), and `raptor_cache` (Hugging Face / PyTorch caches). The `models` and `logs` directories above are for reference only; data is persisted in Docker named volumes.
243+
245244
2. Deploy with Docker Compose:
246245

247246
**CPU-only deployment**:
@@ -260,13 +259,77 @@ cd Progressive-Summarizer-RAPTOR
260259
```bash
261260
# To stop CPU deployment
262261
docker compose --profile cpu down
263-
262+
264263
# To stop GPU deployment
265264
docker compose --profile gpu down
265+
266+
# To stop CPU (external Ollama) deployment
267+
docker compose --profile cpu-external down
268+
269+
# To stop GPU (external Ollama) deployment
270+
docker compose --profile gpu-external down
266271
```
267272

268273
3. The API will be available at `http://localhost:8002` (configurable via `APP_PORT`).
269274

275+
### Option A: Pre-built Image from GitHub Container Registry
276+
277+
The easiest way to deploy is using our pre-built Docker images published to GitHub Container Registry.
278+
279+
Pull the latest image:
280+
```bash
281+
docker pull ghcr.io/smart-models/progressive-summarizer-raptor:latest
282+
```
283+
284+
Run with GPU acceleration (recommended, requires NVIDIA GPU + drivers):
285+
```bash
286+
docker run -d \
287+
--name progressive-summarizer-raptor \
288+
--gpus all \
289+
-p 8002:8000 \
290+
-v $(pwd)/logs:/app/logs \
291+
ghcr.io/smart-models/progressive-summarizer-raptor:latest
292+
```
293+
294+
Windows PowerShell:
295+
```powershell
296+
docker run -d `
297+
--name progressive-summarizer-raptor `
298+
--gpus all `
299+
-p 8002:8000 `
300+
-v ${PWD}/logs:/app/logs `
301+
ghcr.io/smart-models/progressive-summarizer-raptor:latest
302+
```
303+
304+
Run on CPU only (fallback for systems without GPU):
305+
```bash
306+
docker run -d \
307+
--name progressive-summarizer-raptor \
308+
-p 8002:8000 \
309+
-v $(pwd)/logs:/app/logs \
310+
ghcr.io/smart-models/progressive-summarizer-raptor:latest
311+
```
312+
313+
Use a specific version (recommended for production):
314+
```bash
315+
# Replace v1.0.0 with your desired version
316+
docker pull ghcr.io/smart-models/progressive-summarizer-raptor:v1.0.0
317+
docker run -d --gpus all -p 8002:8000 \
318+
-v $(pwd)/logs:/app/logs \
319+
ghcr.io/smart-models/progressive-summarizer-raptor:v1.0.0
320+
```
321+
322+
Verify the service is running:
323+
```bash
324+
curl http://localhost:8002/
325+
```
326+
327+
Stop and remove the container:
328+
```bash
329+
docker stop progressive-summarizer-raptor
330+
docker rm progressive-summarizer-raptor
331+
```
332+
270333
### Using an external Ollama instance
271334

272335
If you already have Ollama running (local network, cloud VM, managed service, etc.), use the
@@ -380,18 +443,18 @@ The RAPTOR API will connect to Ollama at `http://localhost:11434` by default. Yo
380443
**Using cURL:**
381444
```bash
382445
# Basic usage (no authentication)
383-
curl -X POST "http://localhost:8000/raptor/" \
446+
curl -X POST "http://localhost:8002/raptor/" \
384447
-F "file=@document.json" \
385448
-H "accept: application/json"
386449

387450
# With authentication (when API_TOKEN is set)
388-
curl -X POST "http://localhost:8000/raptor/" \
451+
curl -X POST "http://localhost:8002/raptor/" \
389452
-F "file=@document.json" \
390453
-H "accept: application/json" \
391454
-H "Authorization: Bearer your-token-here"
392455

393456
# With custom parameters
394-
curl -X POST "http://localhost:8000/raptor/?llm_model=qwen2.5:7b-instruct&temperature=0.2&threshold_tokens=4000" \
457+
curl -X POST "http://localhost:8002/raptor/?llm_model=qwen2.5:7b-instruct&temperature=0.2&threshold_tokens=4000" \
395458
-F "file=@document.json" \
396459
-H "accept: application/json"
397460
```
@@ -402,7 +465,7 @@ import requests
402465
import json
403466

404467
# API endpoint
405-
api_url = 'http://localhost:8000/raptor/'
468+
api_url = 'http://localhost:8002/raptor/'
406469
file_path = 'document.json'
407470

408471
# Prepare the document
@@ -518,14 +581,14 @@ RAPTOR can be tuned through environment variables (for Docker deployments) or a
518581
| `OLLAMA_NUM_THREAD` | CPU threads for Ollama inference | `8` |
519582
| `OLLAMA_NUM_GPU` | GPU layers for Ollama (99 = all on GPU) | `99` |
520583
| `OLLAMA_NUM_PREDICT` | Max output tokens per LLM generation | `512` |
521-
| `OLLAMA_NUM_PARALLEL` | Max parallel requests Ollama can handle | `2` |
584+
| `OLLAMA_NUM_PARALLEL` | Max parallel requests the bundled Ollama container can handle. Configures Ollama, not the RAPTOR app directly (set ≥ `LLM_MAX_WORKERS`) | `2` |
522585
| `EMBEDDER_MODEL` | Sentence-Transformer model used for embeddings | `BAAI/bge-m3` |
523586
| `TEMPERATURE` | Sampling temperature for the LLM | `0.1` |
524587
| `CONTEXT_WINDOW` | Maximum token window supplied to the LLM | `16384` |
525588
| `RANDOM_SEED` | Seed for deterministic operations | `224` |
526589
| `MAX_WORKERS` | Number of worker threads (absolute or percentage) | `75% of CPU cores` |
527590
| `MODEL_CACHE_TIMEOUT` | Seconds before an unused model is evicted from cache | `3600` |
528-
| `LOG_LEVEL` | Logging verbosity (honoured by Docker, Python defaults to INFO) | `INFO` |
591+
| `LOG_LEVEL` | Logging verbosity passed to the Docker container environment. Note: the Python application sets logging to `INFO` unconditionally and does not read this variable at runtime. | `INFO` |
529592

530593
### Docker-Specific Variables
531594
| Variable | Description | Default |
@@ -539,8 +602,9 @@ RAPTOR can be tuned through environment variables (for Docker deployments) or a
539602
| `PYTHONUNBUFFERED` | Python output buffering | `1` |
540603
| `OLLAMA_VERSION` | Ollama image version tag for bundled containers. Leave empty for `latest`. Pin for reproducible deploys (e.g. `0.6.5`). | *(latest)* |
541604
| `OLLAMA_PORT` | Host port for the bundled Ollama container | `11435` |
605+
| `OLLAMA_CONTEXT_SIZE` | Context size passed to the bundled Ollama container (sets the model context window at the Ollama level) | `16384` |
542606

543-
**Note**: `MODEL_CACHE_TIMEOUT` is read directly by the API (`raptor_api.py`, line 517) to control how long a model remains in the on-disk cache. The `LOG_LEVEL` variable is evaluated by the Docker start-up script; the Python code sets logging to `INFO` by default. Docker deployments use different `OLLAMA_BASE_URL` defaults depending on the profile (CPU/GPU).
607+
**Note**: `MODEL_CACHE_TIMEOUT` is read directly by the API (`raptor_api.py`, line 566) to control how long a model remains in the on-disk cache. The `LOG_LEVEL` variable is set in the Docker environment but is not consumed by the Python application, which logs at `INFO` unconditionally. Docker deployments use different `OLLAMA_BASE_URL` defaults depending on the profile (CPU/GPU).
544608

545609
## Custom Instructions
546610

@@ -584,7 +648,7 @@ Do **NOT** include `{chunk}` or XML tags in your custom instructions. The system
584648

585649
### Important Note about LLM models with thinking abilities
586650

587-
By default, thinking abilities are disabled in the RAPTOR API. When using models with chain-of-thought capabilities, the `think` parameter is set to `False` in API requests to Ollama.
651+
RAPTOR does not send a `think` parameter to Ollama. Models with chain-of-thought or reasoning capabilities will use their default behavior as configured in Ollama.
588652

589653
## Contributing
590654

WHAT_IS_IT.md

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -54,14 +54,14 @@ The core summarization flow consists of several stages:
5454
5. **Recursive Clustering (Level 2)**: Clustering Level 1 summaries to identify higher-level relationships
5555
6. **Intermediate Summarization**: Generating second-level summaries from Level 2 clusters
5656
7. **Final Consolidation (Level 3)**: Combining Level 2 summaries to create a comprehensive final summary
57-
8. **Token Optimization**: Ensuring summaries stay within configurable token limits
57+
8. **Token Optimization**: At each summarization level, oversized summaries are split into multiple chunks to stay within configurable token limits (splitting, not truncation)
5858
9. **Hierarchical Output**: Returning all three levels with detailed metadata
5959

6060
### 4. LLM Integration
6161

6262
RAPTOR connects with Ollama for LLM capabilities, making use of template-based prompting to guide the summarization process. The system uses environment variables like `OLLAMA_BASE_URL` to configure the LLM endpoint, making deployment flexible across different environments.
6363

64-
The summarization prompts are designed to produce consistent, high-quality outputs, with careful attention to template string formatting to ensure proper content insertion at runtime. The system supports custom prompt templates through the API, allowing users to tailor the summarization process to specific domains or requirements.
64+
The summarization prompts are designed to produce consistent, high-quality outputs, with careful attention to template string formatting to ensure proper content insertion at runtime. The system supports custom instructions through the API, allowing users to tailor the summarization process to specific domains or requirements.
6565

6666
## Implementation Details
6767

@@ -87,31 +87,35 @@ The summarization prompts are designed to produce consistent, high-quality outpu
8787
- Primary POST `/raptor/` endpoint for document processing
8888
- Health check GET `/` endpoint for service status
8989
- RESTful design with comprehensive parameter validation
90-
- Stateless architecture for scalability
90+
- Stateless per-request design enabling horizontal scaling (the server maintains an in-memory model cache for performance)
9191

9292
### Configuration and Environment
9393

9494
RAPTOR is designed for flexible deployment with configuration via environment variables:
9595

9696
- `API_TOKEN`: Bearer token for API authentication. Leave empty (default) to disable. When set, all POST requests require `Authorization: Bearer <token>`; `GET /` is always public.
9797
- `OLLAMA_BASE_URL`: Configures the endpoint for LLM services (default: http://localhost:11434)
98-
- `OLLAMA_API_KEY`: Optional API key for authenticated external Ollama instances. Sent as `Authorization: Bearer <key>` on every Ollama request. Only applicable with the `cpu-external` / `gpu-external` Docker profiles. Leave empty to disable.
98+
- `OLLAMA_API_KEY`: Optional API key for authenticated external Ollama instances. Sent as `Authorization: Bearer <key>` on every Ollama request. Leave empty to disable.
9999
- `OLLAMA_VERSION`: Optional version tag for the bundled Ollama container image (`cpu` / `gpu` profiles only). Leave empty to use `latest`. Set to a specific version (e.g. `0.6.5`) for reproducible deployments.
100100
- `LLM_MODEL`: Override the default LLM model (`gemma3:4b`) used for summarization
101101
- `EMBEDDER_MODEL`: Override the default embedding model (`BAAI/bge-m3`)
102-
- `TEMPERATURE`: Override the default sampling temperature (0.1)
103-
- `CONTEXT_WINDOW`: Override the default LLM context window (16384)
102+
- `TEMPERATURE`: Default sampling temperature (0.1) — overridable per-request via the API `temperature` parameter; not read from environment at runtime
103+
- `CONTEXT_WINDOW`: Default LLM context window (16384) — overridable per-request via the API `context_window` parameter; not read from environment at runtime
104104
- `LLM_MAX_WORKERS`: Max concurrent LLM requests (default: 2)
105105
- `LLM_MAX_RETRIES`: Number of retry attempts for failed LLM requests (default: 3)
106106
- `LLM_BASE_DELAY`: Base delay in seconds for exponential backoff between retries (default: 1.0)
107107
- `LLM_TIMEOUT`: Timeout in seconds for each LLM request (default: 600)
108108
- `OLLAMA_NUM_THREAD`: CPU threads for Ollama inference (default: 8)
109109
- `OLLAMA_NUM_GPU`: GPU layers for Ollama, 99 = all on GPU (default: 99)
110-
- `OLLAMA_NUM_PREDICT`: Max output tokens per LLM generation (default: 512)
110+
- `OLLAMA_NUM_PREDICT`: Max output tokens per LLM generation (RAPTOR default: 512; bundled Docker Ollama containers default to 2048)
111+
- `OLLAMA_NUM_PARALLEL`: Max parallel requests Ollama can handle (default: 2)
111112
- `RANDOM_SEED`: Set the random seed for reproducibility (default: 224)
112113
- `MAX_WORKERS`: Number of parallel threads used for processing (default: 75% of available CPU cores)
113114
- `MODEL_CACHE_TIMEOUT`: Seconds before an unused model is evicted from cache (default: 3600)
114115
- `LOG_LEVEL`: Controls logging verbosity **(Docker only – not consumed by the Python code)** (default: INFO)
116+
- `OLLAMA_CONTEXT_SIZE`: Context size passed to the bundled Ollama container (default: 16384; `cpu` / `gpu` profiles only)
117+
- `APP_PORT`: Host port exposed by the RAPTOR API container (default: 8002)
118+
- `OLLAMA_PORT`: Host port for the bundled Ollama container (default: 11435)
115119

116120
The system supports both local deployment with Uvicorn and containerized deployment with Docker and docker-compose, with four profiles: `cpu` and `gpu` for deployments with a bundled Ollama container, and `cpu-external` and `gpu-external` for deployments that connect to an existing Ollama instance on the network.
117121

@@ -133,7 +137,7 @@ The system supports both local deployment with Uvicorn and containerized deploym
133137

134138
- `llm_model`: LLM model to use for summarization (default: gemma3:4b)
135139
- `embedder_model`: Model for generating embeddings (default: BAAI/bge-m3)
136-
- `threshold_tokens`: Maximum token limit for summaries
140+
- `threshold_tokens`: Token threshold that triggers splitting of oversized summaries into multiple chunks (not truncation)
137141
- `temperature`: Controls randomness in LLM output (default: 0.1)
138142
- `context_window`: Maximum context window size for LLM (default: 16384)
139143
- `custom_instructions`: Optional custom instructions for summarization (text chunk is added automatically)
@@ -143,5 +147,5 @@ The system supports both local deployment with Uvicorn and containerized deploym
143147

144148
The API returns a JSON structure containing:
145149

146-
- `chunks`: Array of summary objects with text, token count, cluster level, and ID
147-
- `metadata`: Detailed processing information including input counts, cluster counts per level, reduction ratio, model names, and processing times
150+
- `chunks`: Array of summary objects with `text`, `token_count`, `cluster_level`, and `id` fields (additional fields from `chunk_metadata_json` are merged at the top level of each chunk)
151+
- `metadata`: Detailed processing information including `input_chunks`, cluster counts per level (`level_1_clusters`, `level_2_clusters`, `level_3_clusters`, `total_clusters`), `reduction_ratio`, `llm_model`, `embedder_model`, `temperature`, `context_window`, `custom_prompt_used` (boolean), `source` (uploaded filename), and `processing_time` (broken down by level)

0 commit comments

Comments
 (0)