Fix converter for internlm2 by RunningLeon · Pull Request #8321 · ggml-org/llama.cpp

RunningLeon · 2024-07-05T08:19:52Z

fix internlm2 gguf output eos token at the end of each conversation
output example:

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

compilade · 2024-07-09T04:01:38Z

+            # take care of ununsed raw token
+            if piece.startswith('[UNUSED'):
+                toktype = SentencePieceTokenTypes.UNKNOWN


I wonder, does it work without this if the checks for UNKNOWN below are replaced with checks for UNUSED?

hi, these raw tokens are in fact used and updated by added_tokens_decoder. And there are asserations in line 2180 and 2199 which is aligned with what's done in phi3. The tokens are all read as SentencePieceTokenTypes.NORMAL and the assertations would fail if remove or change to UNUSED type.

This makes the changes from #8321 more consistent with the other changes made here.

* update internlm2 * remove unused file * fix lint

* llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from #8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with #8379 * test-tokenizer-random : add a failing edge case for falcon

* llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggml-org#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggml-org#8379 * test-tokenizer-random : add a failing edge case for falcon

* update internlm2 * remove unused file * fix lint

* llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggml-org#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggml-org#8379 * test-tokenizer-random : add a failing edge case for falcon

RunningLeon changed the title ~~[WIP]: update internlm2~~ [WIP]: Fix converter for internlm2 Jul 5, 2024

update internlm2

32b6d12

RunningLeon force-pushed the fix-internlm2 branch from bb4c5c8 to 32b6d12 Compare July 5, 2024 08:21

github-actions Bot added the python python script changes label Jul 5, 2024

remove unused file

32c97de

RunningLeon changed the title ~~[WIP]: Fix converter for internlm2~~ Fix converter for internlm2 Jul 5, 2024

RunningLeon mentioned this pull request Jul 5, 2024

Add support for InternLM 2.5 1M context. Should be as good as command r+ #8285

Closed

4 tasks

compilade reviewed Jul 7, 2024

View reviewed changes

Comment thread convert_hf_to_gguf.py

fix lint

3d04d33

compilade approved these changes Jul 9, 2024

View reviewed changes

ggerganov merged commit e4dd31f into ggml-org:master Jul 10, 2024

compilade added a commit that referenced this pull request Jul 10, 2024

convert_hf : reduce usages of UNKNOWN for InternLM2

1caa20f

This makes the changes from #8321 more consistent with the other changes made here.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024

py : fix converter for internlm2 (ggml-org#8321)

0464524

* update internlm2 * remove unused file * fix lint

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

py : fix converter for internlm2 (ggml-org#8321)

dc0f474

* update internlm2 * remove unused file * fix lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix converter for internlm2#8321

Fix converter for internlm2#8321
ggerganov merged 3 commits intoggml-org:masterfrom
RunningLeon:fix-internlm2

RunningLeon commented Jul 5, 2024

Uh oh!

Uh oh!

compilade Jul 9, 2024

Uh oh!

RunningLeon Jul 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RunningLeon commented Jul 5, 2024

Uh oh!

Uh oh!

compilade Jul 9, 2024

Choose a reason for hiding this comment

Uh oh!

RunningLeon Jul 10, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants