Skip to content

wpm : portable unicode tolower#6305

Merged
cebtenzzre merged 7 commits intomasterfrom
ceb/wpm-portable-tolower
Mar 26, 2024
Merged

wpm : portable unicode tolower#6305
cebtenzzre merged 7 commits intomasterfrom
ceb/wpm-portable-tolower

Conversation

@cebtenzzre
Copy link
Copy Markdown
Collaborator

@cebtenzzre cebtenzzre commented Mar 25, 2024

This is a portable implementation of Unicode tolower for BERT embeddings models that use the WPM tokenizer.

We need this because we can't assume that the en_US.UTF-8 locale is available, see #5740 (comment).

Wikitext tokenizer diff with this change (same as before):

--- good_tokens.txt	2024-03-25 16:26:39.506423621 -0400
+++ lcpp_tokens.txt	2024-03-25 16:26:39.970426376 -0400
@@ -200554,7 +200554,6 @@
 1337: ष
 29870: ##ल
 29869: ##र
-29879: ##ो
 29863: ##न
 1317: ग
 1000: "
@@ -200633,7 +200632,11 @@
 29836: ##و
 29817: ##ت
 25573: ##ا
-100: [UNK]
+1282: س
+23673: ##ل
+29836: ##و
+15394: ##د
+29836: ##و
 23856: kota
 16183: sal
 6784: ##ud

@cebtenzzre cebtenzzre marked this pull request as ready for review March 25, 2024 20:27
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Mar 25, 2024
excludes unicodedata.cpp split

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Comment thread unicodedata.cpp
@cebtenzzre cebtenzzre merged commit 32c8486 into master Mar 26, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants