smart-models
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 1 deletion b/‎.gitignore‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 88 additions & 24 deletions b/‎README.md‎
Lines changed: 88 additions & 24 deletions
diff --git a/‎WHAT_IS_IT.md‎
Lines changed: 14 additions & 10 deletions b/‎WHAT_IS_IT.md‎
Lines changed: 14 additions & 10 deletions
@@ -88,4 +88,5 @@ config.local.yaml
 .claude/
 CLAUDE.md
 CLAUDE.local.md
-.claudeignore
+.claudeignore
+docs/
@@ -24,7 +24,7 @@ The system is designed for production environments, offering a robust REST API,
 - **Intelligent Resource Management**: Optimizes CPU and memory usage based on available system resources
 - **Production-Ready API**: FastAPI-based REST interface with automatic documentation and validation
 - **Docker Integration**: Easy deployment with Docker and docker-compose for both CPU and GPU environments
-- **Configurable Processing**: Adjustable parameters for summarization depth, model selection, and processing options
+- **Configurable Processing**: Adjustable parameters for model selection, temperature, token limits, and processing options (summarization hierarchy is fixed at 3 levels)
 - **Model Caching**: Efficient model management with lifespan context managers for improved performance
 - **Comprehensive Logging**: Detailed logging with rotating file handlers for debugging and monitoring
 - **Thread-Safe Processing**: Concurrent processing capabilities with proper resource management
@@ -44,15 +44,17 @@ The system is designed for production environments, offering a robust REST API,
   - [Prerequisites](#prerequisites)
   - [Getting the Code](#getting-the-code)
   - [Local Installation with Uvicorn](#local-installation-with-uvicorn)
-  - [Docker Deployment (Recommended)](#docker-deployment-recommended)
+  - [Option A: Pre-built Image from GitHub Container Registry](#option-a-pre-built-image-from-github-container-registry)
+  - [Option B: Docker Compose (Local Build)](#option-b-docker-compose-local-build)
+  - [Using an external Ollama instance](#using-an-external-ollama-instance)
   - [Ollama Setup](#ollama-setup)
 - [Using the API](#using-the-api)
   - [API Endpoints](#api-endpoints)
   - [Example API Call](#example-api-call)
   - [Response Format](#response-format)
 - [Configuration](#configuration)
-- [Custom Prompt Templates](#custom-prompt-templates)
-  - [Default Prompt Template](#default-prompt-template)
+- [Custom Instructions](#custom-instructions)
+  - [Default Instructions](#default-instructions)
 - [Contributing](#contributing)
 
 ## How the Summarization Algorithm Works
@@ -212,36 +214,33 @@ cd Progressive-Summarizer-RAPTOR
    pip install -r requirements.txt
    ```
 
-   Note: For GPU support, ensure you have the appropriate PyTorch version installed:
-   ```bash
-   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
-   ```
-
 3. Run the FastAPI server:
    ```bash
-   uvicorn raptor_api:app --reload --host 0.0.0.0 --port 8000
+   uvicorn raptor_api:app --reload --port 8002
    ```
 
-4. The API will be available at `http://localhost:8000`.
+4. The API will be available at `http://localhost:8002`.
 
-   Access the API documentation and interactive testing interface at `http://localhost:8000/docs`.
+   Access the API documentation and interactive testing interface at `http://localhost:8002/docs`.
 
-### Docker Deployment (Recommended)
+### Option B: Docker Compose (Local Build)
 
 1. Create required directories for persistent storage:
    ```bash
    # Linux/macOS
    mkdir -p models logs
-   
+
    # Windows CMD
    mkdir models
    mkdir logs
-   
+
    # Windows PowerShell
    New-Item -ItemType Directory -Path models -Force
    New-Item -ItemType Directory -Path logs -Force
    ```
 
+   > **Note**: Docker Compose mounts three named volumes automatically: `raptor_models` (downloaded embedding models), `raptor_logs` (application logs), and `raptor_cache` (Hugging Face / PyTorch caches). The `models` and `logs` directories above are for reference only; data is persisted in Docker named volumes.
+
 2. Deploy with Docker Compose:
 
    **CPU-only deployment**:
@@ -260,13 +259,77 @@ cd Progressive-Summarizer-RAPTOR
    ```bash
    # To stop CPU deployment
    docker compose --profile cpu down
-   
+
    # To stop GPU deployment
    docker compose --profile gpu down
+
+   # To stop CPU (external Ollama) deployment
+   docker compose --profile cpu-external down
+
+   # To stop GPU (external Ollama) deployment
+   docker compose --profile gpu-external down
    ```
 
 3. The API will be available at `http://localhost:8002` (configurable via `APP_PORT`).
 
+### Option A: Pre-built Image from GitHub Container Registry
+
+The easiest way to deploy is using our pre-built Docker images published to GitHub Container Registry.
+
+Pull the latest image:
+```bash
+docker pull ghcr.io/smart-models/progressive-summarizer-raptor:latest
+```
+
+Run with GPU acceleration (recommended, requires NVIDIA GPU + drivers):
+```bash
+docker run -d \
+  --name progressive-summarizer-raptor \
+  --gpus all \
+  -p 8002:8000 \
+  -v $(pwd)/logs:/app/logs \
+  ghcr.io/smart-models/progressive-summarizer-raptor:latest
+```
+
+Windows PowerShell:
+```powershell
+docker run -d `
+  --name progressive-summarizer-raptor `
+  --gpus all `
+  -p 8002:8000 `
+  -v ${PWD}/logs:/app/logs `
+  ghcr.io/smart-models/progressive-summarizer-raptor:latest
+```
+
+Run on CPU only (fallback for systems without GPU):
+```bash
+docker run -d \
+  --name progressive-summarizer-raptor \
+  -p 8002:8000 \
+  -v $(pwd)/logs:/app/logs \
+  ghcr.io/smart-models/progressive-summarizer-raptor:latest
+```
+
+Use a specific version (recommended for production):
+```bash
+# Replace v1.0.0 with your desired version
+docker pull ghcr.io/smart-models/progressive-summarizer-raptor:v1.0.0
+docker run -d --gpus all -p 8002:8000 \
+  -v $(pwd)/logs:/app/logs \
+  ghcr.io/smart-models/progressive-summarizer-raptor:v1.0.0
+```
+
+Verify the service is running:
+```bash
+curl http://localhost:8002/
+```
+
+Stop and remove the container:
+```bash
+docker stop progressive-summarizer-raptor
+docker rm progressive-summarizer-raptor
+```
+
 ### Using an external Ollama instance
 
 If you already have Ollama running (local network, cloud VM, managed service, etc.), use the
@@ -380,18 +443,18 @@ The RAPTOR API will connect to Ollama at `http://localhost:11434` by default. Yo
 **Using cURL:**
 ```bash
 # Basic usage (no authentication)
-curl -X POST "http://localhost:8000/raptor/" \
+curl -X POST "http://localhost:8002/raptor/" \
   -F "file=@document.json" \
   -H "accept: application/json"
 
 # With authentication (when API_TOKEN is set)
-curl -X POST "http://localhost:8000/raptor/" \
+curl -X POST "http://localhost:8002/raptor/" \
   -F "file=@document.json" \
   -H "accept: application/json" \
   -H "Authorization: Bearer your-token-here"
 
 # With custom parameters
-curl -X POST "http://localhost:8000/raptor/?llm_model=qwen2.5:7b-instruct&temperature=0.2&threshold_tokens=4000" \
+curl -X POST "http://localhost:8002/raptor/?llm_model=qwen2.5:7b-instruct&temperature=0.2&threshold_tokens=4000" \
   -F "file=@document.json" \
   -H "accept: application/json"
 ```
@@ -402,7 +465,7 @@ import requests
 import json
 
 # API endpoint
-api_url = 'http://localhost:8000/raptor/'
+api_url = 'http://localhost:8002/raptor/'
 file_path = 'document.json'
 
 # Prepare the document
@@ -518,14 +581,14 @@ RAPTOR can be tuned through environment variables (for Docker deployments) or a
 | `OLLAMA_NUM_THREAD` | CPU threads for Ollama inference | `8` |
 | `OLLAMA_NUM_GPU` | GPU layers for Ollama (99 = all on GPU) | `99` |
 | `OLLAMA_NUM_PREDICT` | Max output tokens per LLM generation | `512` |
-| `OLLAMA_NUM_PARALLEL` | Max parallel requests Ollama can handle | `2` |
+| `OLLAMA_NUM_PARALLEL` | Max parallel requests the bundled Ollama container can handle. Configures Ollama, not the RAPTOR app directly (set ≥ `LLM_MAX_WORKERS`) | `2` |
 | `EMBEDDER_MODEL` | Sentence-Transformer model used for embeddings | `BAAI/bge-m3` |
 | `TEMPERATURE` | Sampling temperature for the LLM | `0.1` |
 | `CONTEXT_WINDOW` | Maximum token window supplied to the LLM | `16384` |
 | `RANDOM_SEED` | Seed for deterministic operations | `224` |
 | `MAX_WORKERS` | Number of worker threads (absolute or percentage) | `75% of CPU cores` |
 | `MODEL_CACHE_TIMEOUT` | Seconds before an unused model is evicted from cache | `3600` |
-| `LOG_LEVEL` | Logging verbosity (honoured by Docker, Python defaults to INFO) | `INFO` |
+| `LOG_LEVEL` | Logging verbosity passed to the Docker container environment. Note: the Python application sets logging to `INFO` unconditionally and does not read this variable at runtime. | `INFO` |
 
 ### Docker-Specific Variables
 | Variable | Description | Default |
@@ -539,8 +602,9 @@ RAPTOR can be tuned through environment variables (for Docker deployments) or a
 | `PYTHONUNBUFFERED` | Python output buffering | `1` |
 | `OLLAMA_VERSION` | Ollama image version tag for bundled containers. Leave empty for `latest`. Pin for reproducible deploys (e.g. `0.6.5`). | *(latest)* |
 | `OLLAMA_PORT` | Host port for the bundled Ollama container | `11435` |
+| `OLLAMA_CONTEXT_SIZE` | Context size passed to the bundled Ollama container (sets the model context window at the Ollama level) | `16384` |
 
-**Note**: `MODEL_CACHE_TIMEOUT` is read directly by the API (`raptor_api.py`, line 517) to control how long a model remains in the on-disk cache. The `LOG_LEVEL` variable is evaluated by the Docker start-up script; the Python code sets logging to `INFO` by default. Docker deployments use different `OLLAMA_BASE_URL` defaults depending on the profile (CPU/GPU).
+**Note**: `MODEL_CACHE_TIMEOUT` is read directly by the API (`raptor_api.py`, line 566) to control how long a model remains in the on-disk cache. The `LOG_LEVEL` variable is set in the Docker environment but is not consumed by the Python application, which logs at `INFO` unconditionally. Docker deployments use different `OLLAMA_BASE_URL` defaults depending on the profile (CPU/GPU).
 
 ## Custom Instructions
 
@@ -584,7 +648,7 @@ Do **NOT** include `{chunk}` or XML tags in your custom instructions. The system
 
 ### Important Note about LLM models with thinking abilities
 
-By default, thinking abilities are disabled in the RAPTOR API. When using models with chain-of-thought capabilities, the `think` parameter is set to `False` in API requests to Ollama.
+RAPTOR does not send a `think` parameter to Ollama. Models with chain-of-thought or reasoning capabilities will use their default behavior as configured in Ollama.
 
 ## Contributing
 
 
@@ -54,14 +54,14 @@ The core summarization flow consists of several stages:
 5. **Recursive Clustering (Level 2)**: Clustering Level 1 summaries to identify higher-level relationships
 6. **Intermediate Summarization**: Generating second-level summaries from Level 2 clusters
 7. **Final Consolidation (Level 3)**: Combining Level 2 summaries to create a comprehensive final summary
-8. **Token Optimization**: Ensuring summaries stay within configurable token limits
+8. **Token Optimization**: At each summarization level, oversized summaries are split into multiple chunks to stay within configurable token limits (splitting, not truncation)
 9. **Hierarchical Output**: Returning all three levels with detailed metadata
 
 ### 4. LLM Integration
 
 RAPTOR connects with Ollama for LLM capabilities, making use of template-based prompting to guide the summarization process. The system uses environment variables like `OLLAMA_BASE_URL` to configure the LLM endpoint, making deployment flexible across different environments.
 
-The summarization prompts are designed to produce consistent, high-quality outputs, with careful attention to template string formatting to ensure proper content insertion at runtime. The system supports custom prompt templates through the API, allowing users to tailor the summarization process to specific domains or requirements.
+The summarization prompts are designed to produce consistent, high-quality outputs, with careful attention to template string formatting to ensure proper content insertion at runtime. The system supports custom instructions through the API, allowing users to tailor the summarization process to specific domains or requirements.
 
 ## Implementation Details
 
@@ -87,31 +87,35 @@ The summarization prompts are designed to produce consistent, high-quality outpu
    - Primary POST `/raptor/` endpoint for document processing
    - Health check GET `/` endpoint for service status
    - RESTful design with comprehensive parameter validation
-   - Stateless architecture for scalability
+   - Stateless per-request design enabling horizontal scaling (the server maintains an in-memory model cache for performance)
 
 ### Configuration and Environment
 
 RAPTOR is designed for flexible deployment with configuration via environment variables:
 
 - `API_TOKEN`: Bearer token for API authentication. Leave empty (default) to disable. When set, all POST requests require `Authorization: Bearer <token>`; `GET /` is always public.
 - `OLLAMA_BASE_URL`: Configures the endpoint for LLM services (default: http://localhost:11434)
-- `OLLAMA_API_KEY`: Optional API key for authenticated external Ollama instances. Sent as `Authorization: Bearer <key>` on every Ollama request. Only applicable with the `cpu-external` / `gpu-external` Docker profiles. Leave empty to disable.
+- `OLLAMA_API_KEY`: Optional API key for authenticated external Ollama instances. Sent as `Authorization: Bearer <key>` on every Ollama request. Leave empty to disable.
 - `OLLAMA_VERSION`: Optional version tag for the bundled Ollama container image (`cpu` / `gpu` profiles only). Leave empty to use `latest`. Set to a specific version (e.g. `0.6.5`) for reproducible deployments.
 - `LLM_MODEL`: Override the default LLM model (`gemma3:4b`) used for summarization
 - `EMBEDDER_MODEL`: Override the default embedding model (`BAAI/bge-m3`)
-- `TEMPERATURE`: Override the default sampling temperature (0.1)
-- `CONTEXT_WINDOW`: Override the default LLM context window (16384)
+- `TEMPERATURE`: Default sampling temperature (0.1) — overridable per-request via the API `temperature` parameter; not read from environment at runtime
+- `CONTEXT_WINDOW`: Default LLM context window (16384) — overridable per-request via the API `context_window` parameter; not read from environment at runtime
 - `LLM_MAX_WORKERS`: Max concurrent LLM requests (default: 2)
 - `LLM_MAX_RETRIES`: Number of retry attempts for failed LLM requests (default: 3)
 - `LLM_BASE_DELAY`: Base delay in seconds for exponential backoff between retries (default: 1.0)
 - `LLM_TIMEOUT`: Timeout in seconds for each LLM request (default: 600)
 - `OLLAMA_NUM_THREAD`: CPU threads for Ollama inference (default: 8)
 - `OLLAMA_NUM_GPU`: GPU layers for Ollama, 99 = all on GPU (default: 99)
-- `OLLAMA_NUM_PREDICT`: Max output tokens per LLM generation (default: 512)
+- `OLLAMA_NUM_PREDICT`: Max output tokens per LLM generation (RAPTOR default: 512; bundled Docker Ollama containers default to 2048)
+- `OLLAMA_NUM_PARALLEL`: Max parallel requests Ollama can handle (default: 2)
 - `RANDOM_SEED`: Set the random seed for reproducibility (default: 224)
 - `MAX_WORKERS`: Number of parallel threads used for processing (default: 75% of available CPU cores)
 - `MODEL_CACHE_TIMEOUT`: Seconds before an unused model is evicted from cache (default: 3600)
 - `LOG_LEVEL`: Controls logging verbosity **(Docker only – not consumed by the Python code)** (default: INFO)
+- `OLLAMA_CONTEXT_SIZE`: Context size passed to the bundled Ollama container (default: 16384; `cpu` / `gpu` profiles only)
+- `APP_PORT`: Host port exposed by the RAPTOR API container (default: 8002)
+- `OLLAMA_PORT`: Host port for the bundled Ollama container (default: 11435)
 
 The system supports both local deployment with Uvicorn and containerized deployment with Docker and docker-compose, with four profiles: `cpu` and `gpu` for deployments with a bundled Ollama container, and `cpu-external` and `gpu-external` for deployments that connect to an existing Ollama instance on the network.
 
@@ -133,7 +137,7 @@ The system supports both local deployment with Uvicorn and containerized deploym
 
 - `llm_model`: LLM model to use for summarization (default: gemma3:4b)
 - `embedder_model`: Model for generating embeddings (default: BAAI/bge-m3)
-- `threshold_tokens`: Maximum token limit for summaries
+- `threshold_tokens`: Token threshold that triggers splitting of oversized summaries into multiple chunks (not truncation)
 - `temperature`: Controls randomness in LLM output (default: 0.1)
 - `context_window`: Maximum context window size for LLM (default: 16384)
 - `custom_instructions`: Optional custom instructions for summarization (text chunk is added automatically)
@@ -143,5 +147,5 @@ The system supports both local deployment with Uvicorn and containerized deploym
 
 The API returns a JSON structure containing:
 
-- `chunks`: Array of summary objects with text, token count, cluster level, and ID
-- `metadata`: Detailed processing information including input counts, cluster counts per level, reduction ratio, model names, and processing times
+- `chunks`: Array of summary objects with `text`, `token_count`, `cluster_level`, and `id` fields (additional fields from `chunk_metadata_json` are merged at the top level of each chunk)
+- `metadata`: Detailed processing information including `input_chunks`, cluster counts per level (`level_1_clusters`, `level_2_clusters`, `level_3_clusters`, `total_clusters`), `reduction_ratio`, `llm_model`, `embedder_model`, `temperature`, `context_window`, `custom_prompt_used` (boolean), `source` (uploaded filename), and `processing_time` (broken down by level)