Skip to content

Add support for tokenize and untokenize of UTF-8 encoding in prompt/output#87

Closed
wizd wants to merge 8 commits intoggml-org:masterfrom
wizd:master
Closed

Add support for tokenize and untokenize of UTF-8 encoding in prompt/output#87
wizd wants to merge 8 commits intoggml-org:masterfrom
wizd:master

Conversation

@wizd
Copy link
Copy Markdown

@wizd wizd commented Mar 13, 2023

The tokenization process of LLaMA is filled with magic numbers and not easily replicable. However, I have found that using the SentencePiece library works well. It's possible that the original LLaMA model also used SentencePiece for its tokenization.

test prompt: '我静静的坐在雨中,思考着'
I sit quietly in the rain, thinking

This sentence was heavily tokenized to <0x??>, making it very difficult to replicate.

Screenshot 2023-03-13 at 5 14 58 PM

@baifachuan
Copy link
Copy Markdown

baifachuan commented Mar 13, 2023

make编译不过;

     |                               ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:683:40: error: ‘absl::string_view’ has not been declared
  683 |   util::Status ParseExtraOptions(absl::string_view extra_option,
      |                                        ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:690:13: error: ‘absl::string_view’ has not been declared
  690 |       absl::string_view input, absl::string_view normalized,
      |             ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:690:38: error: ‘absl::string_view’ has not been declared
  690 |       absl::string_view input, absl::string_view normalized,
      |                                      ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:692:41: error: ‘string_view’ is not a member of ‘absl’
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                         ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:692:41: error: ‘string_view’ is not a member of ‘absl’
/usr/local/include/sentencepiece_processor.h:692:54: error: template argument 1 is invalid
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                                      ^~~
/usr/local/include/sentencepiece_processor.h:692:57: error: template argument 1 is invalid
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                                         ^~
/usr/local/include/sentencepiece_processor.h:692:57: error: template argument 2 is invalid
/usr/local/include/sentencepiece_processor.h:721:35: error: ‘string_view’ is not a member of ‘absl’
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                   ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:721:59: error: expected primary-expression before ‘*’ token
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                                           ^
/usr/local/include/sentencepiece_processor.h:721:60: error: ‘model_proto’ was not declared in this scope; did you mean ‘ModelProto’?
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                                            ^~~~~~~~~~~
      |                                                            ModelProto
/usr/local/include/sentencepiece_processor.h:724:35: error: ‘string_view’ is not a member of ‘absl’
  724 | util::Status SaveModelProto(absl::string_view, const ModelProto &model_proto);
      |                                   ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:724:48: error: expected primary-expression before ‘const’
  724 | util::Status SaveModelProto(absl::string_view, const ModelProto &model_proto);
      |                                                ^~~~~
utils.cpp: In function ‘std::vector<int> llama_tokenize(const gpt_vocab&, const string&, bool)’:
utils.cpp:291:13: error: invalid conversion from ‘const char*’ to ‘int’ [-fpermissive]
  291 |     sp.Load("./models/tokenizer.model");
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |             |
      |             const char*
In file included from utils.h:10,
                 from utils.cpp:1:
/usr/local/include/sentencepiece_processor.h:244:47: note:   initializing argument 1 of ‘virtual sentencepiece::util::Status sentencepiece::SentencePieceProcessor::Load(int)’
  244 |   virtual util::Status Load(absl::string_view filename);
      |                             ~~~~~~~~~~~~~~~~~~^~~~~~~~
utils.cpp:294:27: error: cannot convert ‘const string’ {aka ‘const std::__cxx11::basic_string<char>’} to ‘int’
  294 |     return sp.EncodeAsIds(text);
      |                           ^~~~
      |                           |
      |                           const string {aka const std::__cxx11::basic_string<char>}
In file included from utils.h:10,
                 from utils.cpp:1:
/usr/local/include/sentencepiece_processor.h:457:58: note:   initializing argument 1 of ‘virtual std::vector<int> sentencepiece::SentencePieceProcessor::EncodeAsIds(int) const’
  457 |   virtual std::vector<int> EncodeAsIds(absl::string_view input) const {
      |                                        ~~~~~~~~~~~~~~~~~~^~~~~
make: *** [Makefile:185: utils.o] Error 1

@wizd
Copy link
Copy Markdown
Author

wizd commented Mar 13, 2023

Need to add sentencepiece library manually
on macos:
https://github.com/google/sentencepiece#build-and-install-using-vcpkg

@lucasjinreal
Copy link
Copy Markdown

@wizd I using your fork, the intract mode is not work:

image

it was in dead loop.....

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented Mar 13, 2023

Resolved in #79

@ggerganov ggerganov closed this Mar 13, 2023
@ggerganov
Copy link
Copy Markdown
Member

Oh wait, did I get confused?
#79 does not resolve the tokenizer issues?

@kharvd
Copy link
Copy Markdown
Contributor

kharvd commented Mar 13, 2023

I think it does. Are you still able to reproduce the issues?

@ggerganov
Copy link
Copy Markdown
Member

I reran the convert script and I get the following:

make -j && ./main -m models/13B/ggml-model-q4_0.bin -t 8 -n 64 -s 11 -p "我静静的坐在雨中,思考着"
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 11
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from 'models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: prompt: '我静静的坐在雨中,思考着'
main: number of tokens in prompt = 2
     1 -> ''
 30672 -> '我'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


我们已经开始了。 (We've already begun.)
������行������。 (A camel caravan travels in a circle. )
The above-mentioned idioms and phrases are what I found on Chinese websites when googling

main: mem per token = 22439492 bytes
main:     load time =  2962.67 ms
main:   sample time =    59.34 ms
main:  predict time =  5717.17 ms / 87.96 ms per token
main:    total time = 10370.07 ms

There are 2 problems still:

  • The prompt is not converted to tokens
  • The generated text has invalid characters

@j-f1
Copy link
Copy Markdown
Contributor

j-f1 commented Mar 13, 2023

Can you try running from shell script encoded as UTF-8 and outputting to a text file? Your terminal might not be handling Unicode correctly.

You’ll also need to re-generate your models from scratch since this PR changes how the ggml files are created.

@kharvd
Copy link
Copy Markdown
Contributor

kharvd commented Mar 13, 2023

@ggerganov just in case: did you re-run the quantization script as well?

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov just in case: did you re-run the quantization script as well?

Oops .. all good now 🦙

@wizzard0
Copy link
Copy Markdown
Contributor

wizzard0 commented Mar 13, 2023 via email

@lucasjinreal
Copy link
Copy Markdown

Does this merge into master? How to test it? The wizd's branch doesn't work well with intract mode.

@zhoujian1028
Copy link
Copy Markdown

@ggerganov
image
The prompt is not converted to tokens
How do you solve it? Thks!

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
Fix TypeError in low_level chat
InfernalDread pushed a commit to InfernalDread/llama.cpp that referenced this pull request Apr 23, 2026
…l-fix

vulkan: fix turbo3 build + coopmat FA after April upstream sync
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* iq3_k: fix Metal dot product

I was accessing the scales as 4-byte aligned, but iq3_k is
not 4-byte aligned. Instead of throwing an error (as it happens
on CUDA when one makes this mistake), Metal silently accepts
and we get garbage.

* iq3_k: slightly faster Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants