Add BPE pre-tokenization for Command-R/R+.#7063
Conversation
|
This and #7041 have different regex. Which one is correct? |
@slaren There is mention of an assumption about digits, which I haven't included but I can include if needed. The regex in this PR has been tested with |
|
Hi, does this mean that Command-R was always running at reduced quality, and we just didn't know until recently? Or have the recent Llama 3 changes to the llama.cpp tokenizer resulted in this update being needed to get it back to where it was before the Llama 3 changes went in? |
I haven't really tested command-r before with any math or numbers, but isn't it a similar issue to llama3 where digits were grouped and tokenized incorrectly? |
|
I had to update to new diff --git a/requirements/requirements-convert.txt b/requirements/requirements-convert.txt
index a3d6ecec..5520ba73 100644
--- a/requirements/requirements-convert.txt
+++ b/requirements/requirements-convert.txt
@@ -1,5 +1,5 @@
numpy~=1.24.4
sentencepiece~=0.1.98
-transformers>=4.35.2,<5.0.0
+transformers>=4.40.1,<5.0.0
gguf>=0.1.0
protobuf>=4.21.0,<5.0.0Else, I got this error: |
|
Let's rebase on latest |
7bfc01b to
d5d6731
Compare
|
@ggerganov Thanks, the PR has been rebased and I added the transformers change. |
* Add BPE pre-tokenization for Command-R/R+. * Bump transformers convert requirement. * command-r : add individual digits regex --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add BPE pre-tokenization for Command-R/R+. * Bump transformers convert requirement. * command-r : add individual digits regex --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>




This replaces PR #7033 as a result of merging PR #6511.
Closes #7030 and #7040.