-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat: support Granite-Docling model #2109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dhdaines
wants to merge
5
commits into
abetlen:main
Choose a base branch
from
dhdaines:granite-docling
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cdcadc5 to
03cf904
Compare
Author
|
Ready for review! |
3a04e21 to
8790ce6
Compare
- Update vendor/llama.cpp submodule to commit be47fb92 (2026-01-01) - Bump version 0.3.16 -> 0.4.0 Critical fixes: - Remove phantom flash_attn field from llama_context_params (caused segfaults) - Add 3 missing params to llama_params_fit (margin, n_ctx_min, log_level) - Migrate flash_attn bool -> flash_attn_type enum (BREAKING CHANGE) - Add flash_attn_type to TYPE_CHECKING block - Fix test: use flash_attn_type instead of removed flash_attn field - FIX CRITICAL: kv_cache_seq_rm must preserve seq_id=-1 semantics (all sequences) * The wrapper was incorrectly converting -1 to 0, breaking context rewind * This caused 'discontinuity' errors on multi-turn conversations API changes: - flash_attn: bool field REMOVED from structs - flash_attn_type: int enum ADDED (AUTO=-1, DISABLED=0, ENABLED=1) - High-level API maintains backward compatibility via wrapper - Server default changed: flash_attn=False -> flash_attn=None (AUTO mode) New features: - 20+ new functions (memory API, state management, samplers, vocab queries) - 5 new enums (flash_attn_type, params_fit_status, model_meta_key, etc.) - 6 new struct fields across llama_model_params, llama_context_params, mtmd_context_params Deprecated removals: - 11 llama_kv_self_* functions (replaced by llama_memory_*) - llama_sampler_init_softmax - verbosity field from mtmd_context_params
8790ce6 to
9cd71ba
Compare
avion23
pushed a commit
to avion23/llama-cpp-python
that referenced
this pull request
Jan 14, 2026
After external code review (GPT-5.2), fixed 4 critical issues: 1. CRITICAL: Fixed tokens[:-1] bug in prefix matching - Was silently breaking prefix matching for ALL models - Caused false rewind detection and cache inefficiency - Impact: Transformers AND recurrent models 2. CRITICAL: Implement proper reset() for recurrent models - Now actually clears llama_memory backend state - Root cause fix for 'sequence positions not consecutive' crash - Without this, reset was a no-op for recurrent models 3. CRITICAL: Enforce strict append policy for recurrent models - Prevents KV cache rewinding that's impossible without state snapshots - Forces full reset on history edits instead of crashing 4. Performance: Cache _is_recurrent to avoid repeated FFI calls 5. Documentation: Simplified comments and updated docstring 6. Testing: All existing tests pass + Mistral-Small-3.2-24B validated Resolves multi-turn crashes for Nemotron-A3B, Mamba, RWKV, Jamba models. Reviewed-by: GPT-5.2 (OpenAI) Tested-by: pytest + Mistral-Small-3.2-24B Fixes: abetlen#2108 (recurrent model crashes) Compatible-with: abetlen#2109 (Granite-Docling/SmolVLM special tokens)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Applies on top of #2108 which has the necessary changes to MTMD.
This adds chat formats to support https://huggingface.co/ggml-org/granite-docling-258M-GGUF and its ancestor, https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF (and various other SmolVLM models)
In order to use Granite-Docling effectively for table structure, equation and layout extraction, it is necessary to enable special tokens in the chat completion output, so this adds a
specialflag to all of the chat completion functions which matches what--specialdoes inllama-cli(this is enabled by default inllama-mtmd-cli)