Add support for tokenize and untokenize of UTF-8 encoding in prompt/output#87
Add support for tokenize and untokenize of UTF-8 encoding in prompt/output#87wizd wants to merge 8 commits intoggml-org:masterfrom
Conversation
|
make编译不过; |
|
Need to add sentencepiece library manually |
|
@wizd I using your fork, the intract mode is not work: it was in dead loop..... |
|
Resolved in #79 |
|
Oh wait, did I get confused? |
|
I think it does. Are you still able to reproduce the issues? |
|
I reran the There are 2 problems still:
|
|
Can you try running from shell script encoded as UTF-8 and outputting to a text file? Your terminal might not be handling Unicode correctly. You’ll also need to re-generate your models from scratch since this PR changes how the ggml files are created. |
|
@ggerganov just in case: did you re-run the quantization script as well? |
Oops .. all good now 🦙 |
|
suggestion: can we add a magic version number? i feel we’ll get further updates?
…On Mon, Mar 13, 2023 at 21:08, Georgi Gerganov ***@***.***> wrote:
> ***@***.***(https://github.com/ggerganov) just in case: did you re-run the quantization script as well?
Oops .. all good now 🦙
—
Reply to this email directly, [view it on GitHub](#87 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AADHVK435NNAH7DLEKXMTEDW35WCRANCNFSM6AAAAAAVYZNR34).
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
|
Does this merge into master? How to test it? The wizd's branch doesn't work well with intract mode. |
|
@ggerganov |
Fix TypeError in low_level chat
…l-fix vulkan: fix turbo3 build + coopmat FA after April upstream sync
* iq3_k: fix Metal dot product I was accessing the scales as 4-byte aligned, but iq3_k is not 4-byte aligned. Instead of throwing an error (as it happens on CUDA when one makes this mistake), Metal silently accepts and we get garbage. * iq3_k: slightly faster Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>


The tokenization process of LLaMA is filled with magic numbers and not easily replicable. However, I have found that using the SentencePiece library works well. It's possible that the original LLaMA model also used SentencePiece for its tokenization.
test prompt: '我静静的坐在雨中,思考着'
I sit quietly in the rain, thinking
This sentence was heavily tokenized to <0x??>, making it very difficult to replicate.