server : display token probabilities in the UI#2489
server : display token probabilities in the UI#2489jhen0409 merged 41 commits intoggml-org:masterfrom
Conversation
|
Ah, this is fun. It works as expected on Android too. |
|
Very nice! But maybe you could round the probabilities to 2 decimals or so? |
Sure, I'll do it later. Thanks! Maybe just change the numbers to percentage. |
|
I just had the craziest idea. I'm not requesting anything, just wondering about the viability of this idea. Do you thing this could be used to create spelling suggestions while typing ? Like on Android keyboard ? |
|
If it was percentages, it would be cool. The byte thing should be fixed in the API level, but at the time we added the probabilities, we couldn't figure it out. |
I have read PR #1962, and I'm a bit confused about this, shouldn't we improve it by convert bytes in the UI side? I'm thinking maybe we can do the merge for bytes to get a readable result (also helpful for Chinese or other language users), but I'm not sure if it will have other problems. |
I've confirmed the another bytes pair is not able to decode successfully, so I just hide that. I see that Open AI playground also doing the same thing. |
Not very sure why it happened, maybe the completion_probabilities of some partial responses is not an array, but as I know in the server.cpp, it should have ensured that it is an array. I just removed the array check of completion_probabilities for messages, only check params.n_probs > 0 for that, it should be avoid this problem. |
| { | ||
| // Always send partial response | ||
| // so we can get the correct partial response of the last to_send in the client | ||
| const json data = format_partial_response(llama, to_send, probs_output); |
There was a problem hiding this comment.
I also made the last to_send have a partial response, so we can correctly get the probabilities of last message. (the final response included all probabilities).
Thank you for the fixes, my testing shows it's working. Thank you. |
Generally, the partial response is included a single But things like |
Confirmed it was a problem from Log I'll fix this later. UPDATED: The fix is here but it was problem with sent_token_probs_index, the above log is expected as we need to wait for possible stop words. |
examples/server/server.cpp
Outdated
| const std::string to_send = llama.generated_text.substr(pos, stop_pos); | ||
| const std::string to_send = stop_pos == std::string::npos | ||
| ? llama.generated_text.substr(pos, std::string::npos) | ||
| : ""; // just don't send anything if we're not done | ||
|
|
There was a problem hiding this comment.
Before this fix, the to_send is a whitespace when it gets stop_pos 1 from L, then the sent_token_probs_index will be incorrect.
|
Also, I merged the master branch, so need to use GGUF models for testing here. |
I got the same in master, in this case the model responds content like |
|
After fixing the newline issue, I think this can be merged. Thank you guys! |
|
Nice one. What I'd like to have in the future is a notebook mode (so basic completion instead of chat). Do you have any plans for that? I could maybe it hack it together a few weeks from now when I'm a bit less busy. |
I'm also thinking about to have a pure text completion in web UI, but the plans is not very clear. Currently I'm using the vim plugin for that, but the web UI may be could provide more visual capabilities. It's a low priority for me, but interesting. |







#2423
This is a simple implementation for probabilities of llama response.
It renders a popover for each token. The popover is based on preact-portal, it's short so I make some modifications and copy that into index.html.
Dark mode:

Light mode:

For bytes, I just add a bottom border line to split them: (https://github.com/ggerganov/llama.cpp/assets/3001525/ad92444e-58cc-445a-b8a9-44704236e285)(Screenshots updated after 04b6f2c)
We can set
More options->Show Probabilitiesto usen_probsparam.