You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: clarify deployment options, configuration, and summarization behavior
Documentation:
- Reorder README installation sections: pre-built image first, local build second
- Add external Ollama section with cpu-external/gpu-external profile instructions
- Fix default ports: 8000 → 8002 in all examples (align with APP_PORT default)
- Clarify OLLAMA_NUM_PARALLEL: configures bundled Ollama container, not RAPTOR app
- Clarify LOG_LEVEL: set in Docker environment but not consumed by Python code
- Add
Copy file name to clipboardExpand all lines: README.md
+88-24Lines changed: 88 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ The system is designed for production environments, offering a robust REST API,
24
24
-**Intelligent Resource Management**: Optimizes CPU and memory usage based on available system resources
25
25
-**Production-Ready API**: FastAPI-based REST interface with automatic documentation and validation
26
26
-**Docker Integration**: Easy deployment with Docker and docker-compose for both CPU and GPU environments
27
-
-**Configurable Processing**: Adjustable parameters for summarization depth, model selection, and processing options
27
+
-**Configurable Processing**: Adjustable parameters for model selection, temperature, token limits, and processing options (summarization hierarchy is fixed at 3 levels)
28
28
-**Model Caching**: Efficient model management with lifespan context managers for improved performance
29
29
-**Comprehensive Logging**: Detailed logging with rotating file handlers for debugging and monitoring
30
30
-**Thread-Safe Processing**: Concurrent processing capabilities with proper resource management
@@ -44,15 +44,17 @@ The system is designed for production environments, offering a robust REST API,
44
44
-[Prerequisites](#prerequisites)
45
45
-[Getting the Code](#getting-the-code)
46
46
-[Local Installation with Uvicorn](#local-installation-with-uvicorn)
4. The API will be available at `http://localhost:8000`.
222
+
4. The API will be available at `http://localhost:8002`.
226
223
227
-
Access the API documentation and interactive testing interface at `http://localhost:8000/docs`.
224
+
Access the API documentation and interactive testing interface at `http://localhost:8002/docs`.
228
225
229
-
### Docker Deployment (Recommended)
226
+
### Option B: Docker Compose (Local Build)
230
227
231
228
1. Create required directories for persistent storage:
232
229
```bash
233
230
# Linux/macOS
234
231
mkdir -p models logs
235
-
232
+
236
233
# Windows CMD
237
234
mkdir models
238
235
mkdir logs
239
-
236
+
240
237
# Windows PowerShell
241
238
New-Item -ItemType Directory -Path models -Force
242
239
New-Item -ItemType Directory -Path logs -Force
243
240
```
244
241
242
+
> **Note**: Docker Compose mounts three named volumes automatically: `raptor_models` (downloaded embedding models), `raptor_logs` (application logs), and `raptor_cache` (Hugging Face / PyTorch caches). The `models` and `logs` directories above are for reference only; data is persisted in Docker named volumes.
243
+
245
244
2. Deploy with Docker Compose:
246
245
247
246
**CPU-only deployment**:
@@ -260,13 +259,77 @@ cd Progressive-Summarizer-RAPTOR
260
259
```bash
261
260
# To stop CPU deployment
262
261
docker compose --profile cpu down
263
-
262
+
264
263
# To stop GPU deployment
265
264
docker compose --profile gpu down
265
+
266
+
# To stop CPU (external Ollama) deployment
267
+
docker compose --profile cpu-external down
268
+
269
+
# To stop GPU (external Ollama) deployment
270
+
docker compose --profile gpu-external down
266
271
```
267
272
268
273
3. The API will be available at `http://localhost:8002` (configurable via `APP_PORT`).
269
274
275
+
### Option A: Pre-built Image from GitHub Container Registry
276
+
277
+
The easiest way to deploy is using our pre-built Docker images published to GitHub Container Registry.
If you already have Ollama running (local network, cloud VM, managed service, etc.), use the
@@ -380,18 +443,18 @@ The RAPTOR API will connect to Ollama at `http://localhost:11434` by default. Yo
380
443
**Using cURL:**
381
444
```bash
382
445
# Basic usage (no authentication)
383
-
curl -X POST "http://localhost:8000/raptor/" \
446
+
curl -X POST "http://localhost:8002/raptor/" \
384
447
-F "file=@document.json" \
385
448
-H "accept: application/json"
386
449
387
450
# With authentication (when API_TOKEN is set)
388
-
curl -X POST "http://localhost:8000/raptor/" \
451
+
curl -X POST "http://localhost:8002/raptor/" \
389
452
-F "file=@document.json" \
390
453
-H "accept: application/json" \
391
454
-H "Authorization: Bearer your-token-here"
392
455
393
456
# With custom parameters
394
-
curl -X POST "http://localhost:8000/raptor/?llm_model=qwen2.5:7b-instruct&temperature=0.2&threshold_tokens=4000" \
457
+
curl -X POST "http://localhost:8002/raptor/?llm_model=qwen2.5:7b-instruct&temperature=0.2&threshold_tokens=4000" \
395
458
-F "file=@document.json" \
396
459
-H "accept: application/json"
397
460
```
@@ -402,7 +465,7 @@ import requests
402
465
import json
403
466
404
467
# API endpoint
405
-
api_url ='http://localhost:8000/raptor/'
468
+
api_url ='http://localhost:8002/raptor/'
406
469
file_path ='document.json'
407
470
408
471
# Prepare the document
@@ -518,14 +581,14 @@ RAPTOR can be tuned through environment variables (for Docker deployments) or a
518
581
|`OLLAMA_NUM_THREAD`| CPU threads for Ollama inference |`8`|
519
582
|`OLLAMA_NUM_GPU`| GPU layers for Ollama (99 = all on GPU) |`99`|
520
583
|`OLLAMA_NUM_PREDICT`| Max output tokens per LLM generation |`512`|
521
-
|`OLLAMA_NUM_PARALLEL`| Max parallel requests Ollama can handle |`2`|
584
+
|`OLLAMA_NUM_PARALLEL`| Max parallel requests the bundled Ollama container can handle. Configures Ollama, not the RAPTOR app directly (set ≥ `LLM_MAX_WORKERS`)|`2`|
522
585
|`EMBEDDER_MODEL`| Sentence-Transformer model used for embeddings |`BAAI/bge-m3`|
523
586
|`TEMPERATURE`| Sampling temperature for the LLM |`0.1`|
524
587
|`CONTEXT_WINDOW`| Maximum token window supplied to the LLM |`16384`|
525
588
|`RANDOM_SEED`| Seed for deterministic operations |`224`|
526
589
|`MAX_WORKERS`| Number of worker threads (absolute or percentage) |`75% of CPU cores`|
527
590
|`MODEL_CACHE_TIMEOUT`| Seconds before an unused model is evicted from cache |`3600`|
528
-
|`LOG_LEVEL`| Logging verbosity (honoured by Docker, Python defaults to INFO)|`INFO`|
591
+
|`LOG_LEVEL`| Logging verbosity passed to the Docker container environment. Note: the Python application sets logging to `INFO` unconditionally and does not read this variable at runtime.|`INFO`|
529
592
530
593
### Docker-Specific Variables
531
594
| Variable | Description | Default |
@@ -539,8 +602,9 @@ RAPTOR can be tuned through environment variables (for Docker deployments) or a
|`OLLAMA_VERSION`| Ollama image version tag for bundled containers. Leave empty for `latest`. Pin for reproducible deploys (e.g. `0.6.5`). |*(latest)*|
541
604
|`OLLAMA_PORT`| Host port for the bundled Ollama container |`11435`|
605
+
|`OLLAMA_CONTEXT_SIZE`| Context size passed to the bundled Ollama container (sets the model context window at the Ollama level) |`16384`|
542
606
543
-
**Note**: `MODEL_CACHE_TIMEOUT` is read directly by the API (`raptor_api.py`, line 517) to control how long a model remains in the on-disk cache. The `LOG_LEVEL` variable is evaluated by the Docker start-up script; the Python code sets logging to`INFO`by default. Docker deployments use different `OLLAMA_BASE_URL` defaults depending on the profile (CPU/GPU).
607
+
**Note**: `MODEL_CACHE_TIMEOUT` is read directly by the API (`raptor_api.py`, line 566) to control how long a model remains in the on-disk cache. The `LOG_LEVEL` variable is set in the Docker environment but is not consumed by the Python application, which logs at`INFO`unconditionally. Docker deployments use different `OLLAMA_BASE_URL` defaults depending on the profile (CPU/GPU).
544
608
545
609
## Custom Instructions
546
610
@@ -584,7 +648,7 @@ Do **NOT** include `{chunk}` or XML tags in your custom instructions. The system
584
648
585
649
### Important Note about LLM models with thinking abilities
586
650
587
-
By default, thinking abilities are disabled in the RAPTOR API. When using models with chain-of-thought capabilities, the `think` parameter is set to `False` in API requests to Ollama.
651
+
RAPTOR does not send a `think` parameter to Ollama. Models with chain-of-thought or reasoning capabilities will use their default behavior as configured in Ollama.
6.**Intermediate Summarization**: Generating second-level summaries from Level 2 clusters
56
56
7.**Final Consolidation (Level 3)**: Combining Level 2 summaries to create a comprehensive final summary
57
-
8.**Token Optimization**: Ensuring summaries stay within configurable token limits
57
+
8.**Token Optimization**: At each summarization level, oversized summaries are split into multiple chunks to stay within configurable token limits (splitting, not truncation)
58
58
9.**Hierarchical Output**: Returning all three levels with detailed metadata
59
59
60
60
### 4. LLM Integration
61
61
62
62
RAPTOR connects with Ollama for LLM capabilities, making use of template-based prompting to guide the summarization process. The system uses environment variables like `OLLAMA_BASE_URL` to configure the LLM endpoint, making deployment flexible across different environments.
63
63
64
-
The summarization prompts are designed to produce consistent, high-quality outputs, with careful attention to template string formatting to ensure proper content insertion at runtime. The system supports custom prompt templates through the API, allowing users to tailor the summarization process to specific domains or requirements.
64
+
The summarization prompts are designed to produce consistent, high-quality outputs, with careful attention to template string formatting to ensure proper content insertion at runtime. The system supports custom instructions through the API, allowing users to tailor the summarization process to specific domains or requirements.
65
65
66
66
## Implementation Details
67
67
@@ -87,31 +87,35 @@ The summarization prompts are designed to produce consistent, high-quality outpu
87
87
- Primary POST `/raptor/` endpoint for document processing
88
88
- Health check GET `/` endpoint for service status
89
89
- RESTful design with comprehensive parameter validation
90
-
- Stateless architecture for scalability
90
+
- Stateless per-request design enabling horizontal scaling (the server maintains an in-memory model cache for performance)
91
91
92
92
### Configuration and Environment
93
93
94
94
RAPTOR is designed for flexible deployment with configuration via environment variables:
95
95
96
96
-`API_TOKEN`: Bearer token for API authentication. Leave empty (default) to disable. When set, all POST requests require `Authorization: Bearer <token>`; `GET /` is always public.
97
97
-`OLLAMA_BASE_URL`: Configures the endpoint for LLM services (default: http://localhost:11434)
98
-
-`OLLAMA_API_KEY`: Optional API key for authenticated external Ollama instances. Sent as `Authorization: Bearer <key>` on every Ollama request. Only applicable with the `cpu-external` / `gpu-external` Docker profiles. Leave empty to disable.
98
+
-`OLLAMA_API_KEY`: Optional API key for authenticated external Ollama instances. Sent as `Authorization: Bearer <key>` on every Ollama request. Leave empty to disable.
99
99
-`OLLAMA_VERSION`: Optional version tag for the bundled Ollama container image (`cpu` / `gpu` profiles only). Leave empty to use `latest`. Set to a specific version (e.g. `0.6.5`) for reproducible deployments.
100
100
-`LLM_MODEL`: Override the default LLM model (`gemma3:4b`) used for summarization
101
101
-`EMBEDDER_MODEL`: Override the default embedding model (`BAAI/bge-m3`)
102
-
-`TEMPERATURE`: Override the default sampling temperature (0.1)
103
-
-`CONTEXT_WINDOW`: Override the default LLM context window (16384)
102
+
-`TEMPERATURE`: Default sampling temperature (0.1) — overridable per-request via the API `temperature` parameter; not read from environment at runtime
103
+
-`CONTEXT_WINDOW`: Default LLM context window (16384) — overridable per-request via the API `context_window` parameter; not read from environment at runtime
104
104
-`LLM_MAX_WORKERS`: Max concurrent LLM requests (default: 2)
105
105
-`LLM_MAX_RETRIES`: Number of retry attempts for failed LLM requests (default: 3)
106
106
-`LLM_BASE_DELAY`: Base delay in seconds for exponential backoff between retries (default: 1.0)
107
107
-`LLM_TIMEOUT`: Timeout in seconds for each LLM request (default: 600)
108
108
-`OLLAMA_NUM_THREAD`: CPU threads for Ollama inference (default: 8)
109
109
-`OLLAMA_NUM_GPU`: GPU layers for Ollama, 99 = all on GPU (default: 99)
110
-
-`OLLAMA_NUM_PREDICT`: Max output tokens per LLM generation (default: 512)
110
+
-`OLLAMA_NUM_PREDICT`: Max output tokens per LLM generation (RAPTOR default: 512; bundled Docker Ollama containers default to 2048)
111
+
-`OLLAMA_NUM_PARALLEL`: Max parallel requests Ollama can handle (default: 2)
111
112
-`RANDOM_SEED`: Set the random seed for reproducibility (default: 224)
112
113
-`MAX_WORKERS`: Number of parallel threads used for processing (default: 75% of available CPU cores)
113
114
-`MODEL_CACHE_TIMEOUT`: Seconds before an unused model is evicted from cache (default: 3600)
114
115
-`LOG_LEVEL`: Controls logging verbosity **(Docker only – not consumed by the Python code)** (default: INFO)
116
+
-`OLLAMA_CONTEXT_SIZE`: Context size passed to the bundled Ollama container (default: 16384; `cpu` / `gpu` profiles only)
117
+
-`APP_PORT`: Host port exposed by the RAPTOR API container (default: 8002)
118
+
-`OLLAMA_PORT`: Host port for the bundled Ollama container (default: 11435)
115
119
116
120
The system supports both local deployment with Uvicorn and containerized deployment with Docker and docker-compose, with four profiles: `cpu` and `gpu` for deployments with a bundled Ollama container, and `cpu-external` and `gpu-external` for deployments that connect to an existing Ollama instance on the network.
117
121
@@ -133,7 +137,7 @@ The system supports both local deployment with Uvicorn and containerized deploym
133
137
134
138
-`llm_model`: LLM model to use for summarization (default: gemma3:4b)
135
139
-`embedder_model`: Model for generating embeddings (default: BAAI/bge-m3)
136
-
-`threshold_tokens`: Maximum token limit for summaries
140
+
-`threshold_tokens`: Token threshold that triggers splitting of oversized summaries into multiple chunks (not truncation)
137
141
-`temperature`: Controls randomness in LLM output (default: 0.1)
138
142
-`context_window`: Maximum context window size for LLM (default: 16384)
139
143
-`custom_instructions`: Optional custom instructions for summarization (text chunk is added automatically)
@@ -143,5 +147,5 @@ The system supports both local deployment with Uvicorn and containerized deploym
143
147
144
148
The API returns a JSON structure containing:
145
149
146
-
-`chunks`: Array of summary objects with text, token count, cluster level, and ID
147
-
-`metadata`: Detailed processing information including input counts, cluster counts per level, reduction ratio, model names, and processing times
150
+
-`chunks`: Array of summary objects with `text`, `token_count`, `cluster_level`, and `id` fields (additional fields from `chunk_metadata_json` are merged at the top level of each chunk)
151
+
-`metadata`: Detailed processing information including `input_chunks`, cluster counts per level (`level_1_clusters`, `level_2_clusters`, `level_3_clusters`, `total_clusters`), `reduction_ratio`, `llm_model`, `embedder_model`, `temperature`, `context_window`, `custom_prompt_used` (boolean), `source` (uploaded filename), and `processing_time` (broken down by level)
0 commit comments