Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 by goerch · Pull Request #3170 · ggml-org/llama.cpp

goerch · 2023-09-14T15:16:28Z

The last deviations are fixed now, too.

We don't seem to need Unicode normalization immediately.

Work on compiler warnings.

…izer-1

common/common.cpp

cebtenzzre

I would like a more complete understanding of why the API change is needed. The fact that sentencepiece accepts it isn't good enough - does the text parameter represent a UTF-8 string, or something else?

goerch · 2023-09-14T17:59:49Z

I would like a more complete understanding of why the API change is needed. The fact that sentencepiece accepts it isn't good enough - does the text parameter represent a UTF-8 string, or something else?

As far as I understand it, 0x00 is a valid Unicode point which converts to 0x00 in UTF8 and vice versa.

cebtenzzre · 2023-09-14T18:09:46Z

As far as I understand it, 0x00 is a valid Unicode point which converts to 0x00 in UTF8.

Yes, but it is a non-printable control character. As far as POSIX is concerned, text files must not contain NUL bytes - therefore, text should not contain NUL bytes.

goerch · 2023-09-14T18:17:50Z

As far as I understand it, 0x00 is a valid Unicode point which converts to 0x00 in UTF8.

Yes, but it is a non-printable control character. As far as POSIX is concerned, text files must not contain NUL bytes - therefore, text should not contain NUL bytes.

The small print says: 'Although POSIX.1-2017 does not distinguish between text files and binary files (see the ISO C standard)...'

cebtenzzre · 2023-09-14T18:30:21Z

If the POSIX description isn't sufficient, what about MIME types?

$ printf 'hello\n' >text.bin
$ printf 'hello\0\n' >not_text.bin
$ file -i text.bin
text.bin: text/plain; charset=us-ascii
$ file -i not_text.bin
not_text.bin: application/octet-stream; charset=binary

goerch · 2023-09-14T18:43:41Z

@cebtenzzre : I believe that the NUL character is improbable and badly supported by software around. But this PR is aiming to improve sentencepiece compatibility for a specific test case. So either we work with this PR or change the test case. How do you propose I should change the test case? Ignore token 3 and code point 0? Then I can simply change the test case and retract all changes to the llama.cpp kernel.

slaren · 2023-09-14T18:53:33Z

I am not sure why POSIX is relevant here. Is there any reason to believe that the tokenizer needs to respect any part of the POSIX spec? I think this is simpler that, if the sentencepiece tokenizer can encode NUL characters, but our implementation can't, then it is a bug in our implementation and it should be fixed. Supporting non-NUL terminated strings seems the natural way to do it.

cebtenzzre

Okay, I'll defer to your judgement in this case.

common/common.cpp

…izer-1 (ggml-org#3170) * Fix für ggml-org#2721 * Reenable tokenizer test for LLaMa * Add `console.cpp` dependency * Fix dependency to `common` * Fixing wrong fix. * Make console usage platform specific Work on compiler warnings. * Adapting makefile * Remove trailing whitespace * Adapting the other parts of the makefile * Fix typo. * Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 * Simplify logic * Add missing change... * Fix ugly compiler warning * llama_tokenize should accept strings containing NUL now * Adding huichen's test case

goerch added 16 commits August 22, 2023 21:37

Fix für #2721

3d59f50

Merge branch 'master' of https://github.com/goerch/llama.cpp

84220df

Merge branch 'ggerganov:master' into master

9a953a4

Reenable tokenizer test for LLaMa

89a7277

Add console.cpp dependency

52c9ecf

Fix dependency to common

4ee2152

Fixing wrong fix.

e903d5f

Make console usage platform specific

96533e0

Work on compiler warnings.

Adapting makefile

28b7494

Remove trailing whitespace

516a0d5

Adapting the other parts of the makefile

75a20d5

Fix typo.

16bf5f2

Fixing the last deviations from sentencepiece indicated by test-token…

64b0b74

…izer-1

Merge branch 'master' of https://github.com/goerch/llama.cpp

5d528ed

Simplify logic

01b0105

Add missing change...

c7c0fcb

cebtenzzre reviewed Sep 14, 2023

View reviewed changes

common/common.cpp Outdated Show resolved Hide resolved

Fix ugly compiler warning

a90bf49

ggerganov approved these changes Sep 14, 2023

View reviewed changes

cebtenzzre suggested changes Sep 14, 2023

View reviewed changes

cebtenzzre approved these changes Sep 14, 2023

View reviewed changes

cebtenzzre reviewed Sep 14, 2023

View reviewed changes

common/common.cpp Outdated Show resolved Hide resolved

llama_tokenize should accept strings containing NUL now

e41209a

goerch mentioned this pull request Sep 16, 2023

Fix the tokenizer #2023

Closed

Adding huichen's test case

afc0d0d

slaren merged commit b08e75b into ggml-org:master Sep 16, 2023

This was referenced Sep 18, 2023

Bump llama_tokenize APIs to latest specs abetlen/llama-cpp-python#730

Closed

Segfaults now with latest llama.cpp commits abetlen/llama-cpp-python#727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing the last deviations from sentencepiece indicated by test-tokenizer-1#3170

Fixing the last deviations from sentencepiece indicated by test-tokenizer-1#3170
slaren merged 19 commits intoggml-org:masterfrom
goerch:master

goerch commented Sep 14, 2023 •

edited

Loading

Uh oh!

Uh oh!

cebtenzzre left a comment

Uh oh!

goerch commented Sep 14, 2023 •

edited

Loading

Uh oh!

cebtenzzre commented Sep 14, 2023 •

edited

Loading

Uh oh!

goerch commented Sep 14, 2023

Uh oh!

cebtenzzre commented Sep 14, 2023

Uh oh!

goerch commented Sep 14, 2023 •

edited

Loading

Uh oh!

slaren commented Sep 14, 2023

Uh oh!

cebtenzzre left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

goerch commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cebtenzzre left a comment

Choose a reason for hiding this comment

Uh oh!

goerch commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cebtenzzre commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

goerch commented Sep 14, 2023

Uh oh!

cebtenzzre commented Sep 14, 2023

Uh oh!

goerch commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Sep 14, 2023

Uh oh!

cebtenzzre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

goerch commented Sep 14, 2023 •

edited

Loading

goerch commented Sep 14, 2023 •

edited

Loading

cebtenzzre commented Sep 14, 2023 •

edited

Loading

goerch commented Sep 14, 2023 •

edited

Loading