Skip to content

Conversation

@ServeurpersoCom
Copy link
Collaborator

@ServeurpersoCom ServeurpersoCom commented Dec 21, 2025

Make sure to read the contributing guidelines before submitting a PR

Add batch-level prompt preprocessing progress

Track n_batches_total and n_batches_processed per slot. Emit prompt_progress
chunks after each llama_decode() during prompt processing. Activates automatically
when prompt requires 2+ batches (controlled by -b flag)

Setup (A 100% CPU model added on a testing-server for easier testing) :

; llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

[*]
fit = off                 ; Disable automatic memory fitting
ngl = 999                 ; Full GPU offload
ctk = q8_0                ; KV cache key quantization
ctv = q8_0                ; KV cache value quantization
fa = on                   ; Enable flash attention
mlock = on                ; Lock model in RAM
np = 4                    ; Parallel request batching
kvu = on                  ; Unified KV cache buffer
sleep-idle-seconds = 3600 ; Unload weights on child process
b = 128                   ; Logical maximum batch size (default: 2048)
ub = 512                  ; Physical maximum batch size (default: 512)

; Testing prompt progress on CPU
[CPU-MoE-Qwen3-30B-A3B-Instruct-2507]
m = mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
ngl = 0                   ; No GPU offload
device = none             ; Disable GPU device
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0
c = 32768

...Other GPU or hybrid models...

[Dense-Devstral-Small-2-24B-Instruct-2512]
m = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
; chat-template-file = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512.jinja
c = 131072                ; Context size in tokens for this model
load-on-startup = 1       ; Load immediately on server startup
...

Backend testing command

# OpenAI test (big prompt to force slow preprocessing)
curl -N https://www.serveurperso.com/ia/webui/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CPU-MoE-Qwen3-30B-A3B-Instruct-2507",
    "messages": [{"role": "user", "content": "'"$(python3 -c "print('Test '*500)")"'"}],
    "stream": true,
    "max_tokens": 10,
    "cache_prompt": false
  }'

# Anthropic test ("cache_prompt": false don't work!) OK, no need, it is a proprietary chunk for the WebUI

Close #17079

@ngxson
Copy link
Collaborator

ngxson commented Dec 21, 2025

IMO hooking into eval_cb can be quite risky and messy if the backend scheduler work in an asynchronous way.

Also, technically say, the backend never process token-by-token. The whole batch of tokens is represent as a 2D matrix and they will be processed all at once.

To have more frequent updates, simply lower the number of tokens for each batch (controlled via -b and -ub args)

@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Dec 21, 2025

Also, technically say, the backend never process token-by-token. The whole batch of tokens is represent as a 2D matrix and they will be processed all at once.

Right. Seen from this perspective, if it's I make fake-time interpolation on the backend, it's not even worth trying to make it smooth; it's better to just have progress tracking for each batch! I'll start again :

Track total batches (n_tokens / n_batch) and increment after each llama_decode() call. Progress chunks will only appear when there are 2+ batches (automatically happens with large prompts), and users can reduce -b/-ub for finer granularity if needed. Much cleaner approach, no core callbacks required.

@ExtReMLapin
Copy link
Contributor

Why not just use streamed prompt_progress object ???

@ServeurpersoCom
Copy link
Collaborator Author

Why not just use streamed prompt_progress object ???

Yes, I already had this working with high-frequency emission (100ms intervals). Now reimplementing it at batch frequency as suggested by ngxson: cleaner approach.

@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Dec 22, 2025

I track total batches and increment after each llama_decode(). Stream the existing prompt_progress object at batch boundaries with estimated token counts, Only activates when there are 2+ batches so large prompts automatically get progress updates,
Tested with b=128 on 509 token prompt, got 3 progress chunks showing 127, 254, 381 tokens processed :

...,"prompt_progress":{"total":509,"cache":0,"processed":127,"time_ms":1901}}
...,"prompt_progress":{"total":509,"cache":0,"processed":254,"time_ms":3937}}
...,"prompt_progress":{"total":509,"cache":0,"processed":381,"time_ms":6079}}

(root|~/llama.cpp.pascal) curl -N https://www.serveurperso.com/ia/webui/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CPU-MoE-Qwen3-30B-A3B-Instruct-2507",
    "messages": [{"role": "user", "content": "'"$(python3 -c "print('Test '*500)")"'"}],
    "stream": true,
    "max_tokens": 10,
    "cache_prompt": false
  }'

<- synthetic chunks :

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412609,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":127,"time_ms":1901}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412611,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":254,"time_ms":3937}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412613,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":381,"time_ms":6079}}

<- normal chunks :

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"It"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" looks"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" like"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"'ve"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" past"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"ed"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" a"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" long"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" sequence"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":"length","index":0,"delta":{}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":509,"prompt_ms":8337.839,"prompt_per_token_ms":16.380823182711197,"prompt_per_second":61.04699311176433,"predicted_n":10,"predicted_ms":390.215,"predicted_per_token_ms":39.021499999999996,"predicted_per_second":25.626897992132545}}

data: [DONE]

@ngxson
Copy link
Collaborator

ngxson commented Dec 22, 2025

Yes, I already had this working with high-frequency emission (100ms intervals). Now reimplementing it at batch frequency as suggested by ngxson: cleaner approach.

maybe you misunderstood the question from @ExtReMLapin

he meant that we already had this exact function that you are trying to implement in this PR, and its name is prompt_progress. looking at server docs:

return_progress: Include prompt processing progress in stream mode. The progress will be contained inside prompt_progress with 4 values: total, cache, processed, and time_ms. The overall progress is processed/total, while the actual timed progress is (processed-cache)/(total-cache). The time_ms field contains the elapsed time in milliseconds since prompt processing started. Default: false

I don't get why you need to add some extra calculations in this PR for that - is there any cases where the current progress calculation is wrong? (asking to make sure that we are not spending double efforts here)

slot.t_start_generation = 0;

const int32_t batch_size = std::max<int32_t>(1, llama_n_batch(ctx));
slot.n_batches_total = (slot.task->n_tokens() + batch_size - 1) / batch_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this calculation is likely wrong as it doesn't take into account cached tokens

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch on cached tokens! I haven't tested the cache scenario yet. I'll apply/test your suggested fix

@ServeurpersoCom
Copy link
Collaborator Author

Yes, I already had this working with high-frequency emission (100ms intervals). Now reimplementing it at batch frequency as suggested by ngxson: cleaner approach.

maybe you misunderstood the question from @ExtReMLapin

he meant that we already had this exact function that you are trying to implement in this PR, and its name is prompt_progress. looking at server docs:

return_progress: Include prompt processing progress in stream mode. The progress will be contained inside prompt_progress with 4 values: total, cache, processed, and time_ms. The overall progress is processed/total, while the actual timed progress is (processed-cache)/(total-cache). The time_ms field contains the elapsed time in milliseconds since prompt processing started. Default: false

I don't get why you need to add some extra calculations in this PR for that - is there any cases where the current progress calculation is wrong? (asking to make sure that we are not spending double efforts here)

The existing prompt_progress with return_progress: true only emits once after prompt processing completes. This PR streams progress during processing at batch boundaries. It's about real-time updates, not final summary. Without --stream-prompt-progress, behavior is unchanged

@ngxson
Copy link
Collaborator

ngxson commented Dec 22, 2025

The existing prompt_progress with return_progress: true only emits once after prompt processing completes.

No, it was supposed to sent on each batch (in real time):

def test_return_progress(n_batch, batch_count, reuse_cache):

Maybe something wrong with your test (I assume?) But this will return on each processed batch. If it doesn't, you maybe re-using cached prompts

@ServeurpersoCom
Copy link
Collaborator Author

The existing prompt_progress with return_progress: true only emits once after prompt processing completes.

No, it was supposed to sent on each batch (in real time):

def test_return_progress(n_batch, batch_count, reuse_cache):

Maybe something wrong with your test (I assume?) But this will return on each processed batch. If it doesn't, you maybe re-using cached prompts

Oh OK I never tested "return_progress": True, on API with large prompt / small batch ! I try it... It's possible that from the beginning, all that was needed was front-end work

@ServeurpersoCom
Copy link
Collaborator Author

The existing implementation streams progress at each batch boundary exactly as intended. I completely missed this during my initial testing: sorry for the noise!

Closing this PR. Thanks for the patience and for pointing me to "test_return_progress".

(root|~) curl -N https://www.serveurperso.com/ia/webui/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "CPU-MoE-Qwen3-30B-A3B-Instruct-2507",
"messages": [{"role": "user", "content": "'"$(python3 -c "print('Test '*500)")"'"}],
"stream": true,
"return_progress": true,
"cache_prompt": false,
"max_tokens": 10
}'
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766421373,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":128,"time_ms":2087}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766421375,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":256,"time_ms":4222}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766421378,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":384,"time_ms":6470}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":509,"time_ms":8745}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"It"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" looks"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" like"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"'ve"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" past"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"ed"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" a"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" long"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" sequence"}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":"length","index":0,"delta":{}}],"created":1766421380,"id":"chatcmpl-UAryn5Ky3sZDrPbYttKJB8tP2iKE6TNx","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":509,"prompt_ms":8745.15,"prompt_per_token_ms":17.181041257367387,"prompt_per_second":58.20369004533942,"predicted_n":10,"predicted_ms":406.724,"predicted_per_token_ms":40.672399999999996,"predicted_per_second":24.5866976131234}}

data: [DONE]

@ngxson
Copy link
Collaborator

ngxson commented Dec 22, 2025

IIRC return_progress was added for this exact use case, and ideally what will be displayed on webui is (processed-cache)/(total-cache) as mentioned in the help docs (that will match the UX on LM Studio)

Some downstream projects also use this exact feature for receiving real-time prompt processing progress, and so far I haven't received any issues about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: webui: add parsing progress

3 participants