fix: crash on bge-m3 embedding model by thxCode · Pull Request #8883 · ggml-org/llama.cpp

thxCode · 2024-08-06T10:01:39Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

crash on testing llama-server embedding with bge-m3.

the first commit is to align the n_ubatch to n_batch if the model is non-causal, like what embedding: adjust n_ubatch value, print error on insufficient n_batch value #6296 did.
the SPM vocabulary checks whether has the linefeed_id, and uses special_pad_id instead if not found. the second commit is to respect the special_pad_id of metadata.
panic as the following if the char byte is not found: libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found. use special_unk_id if not found.

I am unsure if there are other corner cases, please let me know.

Signed-off-by: thxCode <thxcode0824@gmail.com>

when vocab.type is SPM, we will confirm the linefeed_id by searching the char, and use special_pad_id instead if not found. the special_*_id are usually record in metadata, to ensure the special_pad_id can be used correctly, we need to obtain it from metadata first and then perform the above confirmation logic. Signed-off-by: thxCode <thxcode0824@gmail.com>

Signed-off-by: thxCode <thxcode0824@gmail.com>

ExtReMLapin · 2024-08-06T15:35:03Z

Btw it's still not faster than transformers so why use it ?

ggerganov · 2024-08-06T15:42:54Z

+                llama_vocab::id token_id;
+                try {
+                    token_id = llama_byte_to_token_impl(vocab, symbol.text[j]);
+                } catch(const std::exception & e) {
+                    // not found, use UNK token instead.
+                    token_id = vocab.special_unk_id;
+                }


I'm unsure about this change - if this happened, wouldn't it imply a problem with the model / tokenizer? Seems better to find and fix the root of the problem instead of hiding it

get it. this fix is inspired by https://github.com/ggerganov/llama.cpp/blame/1e6f6554aa11fa10160a5fda689e736c3c34169f/src/llama.cpp#L5560-L5565, maybe my understanding is not correct.

@ggerganov should I close this PR if the last commit is not a reasonable change? thanks.

thxCode added 3 commits August 6, 2024 17:17

refactor: let ubatch-size = batch-size if non-casual

bb55b19

Signed-off-by: thxCode <thxcode0824@gmail.com>

fix: crash on token not found at spm

6ed2f79

Signed-off-by: thxCode <thxcode0824@gmail.com>

ggerganov reviewed Aug 6, 2024

View reviewed changes

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: crash on bge-m3 embedding model#8883

fix: crash on bge-m3 embedding model#8883
thxCode wants to merge 3 commits intoggml-org:masterfrom
thxCode:embedding

thxCode commented Aug 6, 2024

Uh oh!

ExtReMLapin commented Aug 6, 2024

Uh oh!

ggerganov Aug 6, 2024

Uh oh!

thxCode Aug 6, 2024 •

edited

Loading

Uh oh!

thxCode Aug 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

thxCode commented Aug 6, 2024

Uh oh!

ExtReMLapin commented Aug 6, 2024

Uh oh!

ggerganov Aug 6, 2024

Choose a reason for hiding this comment

Uh oh!

thxCode Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thxCode Aug 6, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thxCode Aug 6, 2024 •

edited

Loading