Fix logprobs when multiple tokens are returned at once.#141
Fix logprobs when multiple tokens are returned at once.#141zewt wants to merge 2 commits intotheroyallab:mainfrom
Conversation
e55d3c7 to
eefb572
Compare
This also brings the two chat/completions code paths back into alignment.
|
Token healing is tricky. I think the behavior I described above is correct (if the tokens in logprobs are "http://" and "https://", but "http" is part of token healing overlap and not actually output, the API should strip it out from logprobs too and return "://" and "s://"). But I think doing this needs more information from exllamav2. I tried implementing this by looking at the lengths of the tokens to make a guess, but that's not correct in general (for example, it's wrong if skip_special_tokens is false). text_offset has a similar problem: it advances by the length of the token, but that's wrong in several cases (skip_special_tokens false, token healing, perhaps others). Maybe the same information for token healing would help here too, like exllamav2 calculating the offset into text (it has the missing info to do this correctly). I'm also still not sure whether text_offset is meant to be from the start of the response or context since I can't find OAI docs for it. I think these are separate issues and should be explored separately from this patch. |
|
Here's a simple repro: This is with turboderp/Llama-3.1-8B-Instruct-exl2_5.0bpw, but I think any Llama version will repro it. The "Helloxxx" stop sequence just makes it easier to repro since it triggers buffering in exllamav2 when the model says "Hello". response.txt This shows the issues in original example:
One thing this doesn't show is that "tokens" always uses the first entry from top_logprobs, instead of the token that was actually chosen. To see that, set temperature to 2 and change the prompt to just "Hello!". The tokens array will be completely different from the actual results. Temperature will cause different tokens to be chosen, but the tokens array doesn't use them. |
This fixes a few issues with logprobs:
Here's an example of the current output. To reproduce this more easily, I set "Helloxxx" as a stop string, which causes "Hello" + " !" to be returned together by exllamav2:
Note that "tokens" is "Hi", even though the actual text is "Hello!", and the logprobs for the two are lumped together. With this update:
On the chat completion side, with a similar output where "Hello" + "!" are returned together:
The tokens are mismatched: the "!" token is missing and the top_logprobs are off by one. This now returns:
A couple things that still need to be figured out:
I'm not sure if text_offset supposed to be the offset into the text string (this is close to what it was doing before, so I went with that for now), or the offset into the full context. I can't find OAI docs on this, but from some API snippets I've seen it might be the latter. (It's simple to derive from the other data, so maybe nobody's actually using this field right now.)
Results are odd when token healing is enabled, since the regenerated initial token is included in the list. For example, if the context was "https://", and token healing backs up by three characters and generates "://www", it currently returns that whole underlying token (and a text_offset of -3, since the token starts three characters before the start of the output). But from the client's perspective all that the model actually generated was "www". The token healing overlap should probably be trimmed off from the output, so concatenating the "token" in each entry always gives the same result as "text". I'll return to this after discussion.