Fixing the last deviations from sentencepiece indicated by test-tokenizer-1#3170
Fixing the last deviations from sentencepiece indicated by test-tokenizer-1#3170slaren merged 19 commits intoggml-org:masterfrom
Conversation
Work on compiler warnings.
cebtenzzre
left a comment
There was a problem hiding this comment.
I would like a more complete understanding of why the API change is needed. The fact that sentencepiece accepts it isn't good enough - does the text parameter represent a UTF-8 string, or something else?
As far as I understand it, |
Yes, but it is a non-printable control character. As far as POSIX is concerned, text files must not contain NUL bytes - therefore, text should not contain NUL bytes. |
The small print says: 'Although POSIX.1-2017 does not distinguish between text files and binary files (see the ISO C standard)...' |
|
If the POSIX description isn't sufficient, what about MIME types? $ printf 'hello\n' >text.bin
$ printf 'hello\0\n' >not_text.bin
$ file -i text.bin
text.bin: text/plain; charset=us-ascii
$ file -i not_text.bin
not_text.bin: application/octet-stream; charset=binary |
|
@cebtenzzre : I believe that the NUL character is improbable and badly supported by software around. But this PR is aiming to improve |
|
I am not sure why POSIX is relevant here. Is there any reason to believe that the tokenizer needs to respect any part of the POSIX spec? I think this is simpler that, if the sentencepiece tokenizer can encode NUL characters, but our implementation can't, then it is a bug in our implementation and it should be fixed. Supporting non-NUL terminated strings seems the natural way to do it. |
cebtenzzre
left a comment
There was a problem hiding this comment.
Okay, I'll defer to your judgement in this case.
…izer-1 (ggml-org#3170) * Fix für ggml-org#2721 * Reenable tokenizer test for LLaMa * Add `console.cpp` dependency * Fix dependency to `common` * Fixing wrong fix. * Make console usage platform specific Work on compiler warnings. * Adapting makefile * Remove trailing whitespace * Adapting the other parts of the makefile * Fix typo. * Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 * Simplify logic * Add missing change... * Fix ugly compiler warning * llama_tokenize should accept strings containing NUL now * Adding huichen's test case
The last deviations are fixed now, too.
We don't seem to need Unicode normalization immediately.