feat: add DISTILUSE_BASE_MULTILINGUAL_CASED_V2 text embeddings model#1098
feat: add DISTILUSE_BASE_MULTILINGUAL_CASED_V2 text embeddings model#1098msluszniak merged 6 commits intomainfrom
Conversation
Addresses the multilingual half of #945. Shipping only the WordPiece tokenizer model for now — paraphrase-multilingual-MiniLM-L12-v2 needs Unigram/Precompiled/Metaspace support in executorch/extension/llm/ tokenizers, which is in-flight upstream. The model lives at software-mansion/react-native-executorch-distiluse-base-multilingual-cased-v2 under tag v0.9.0, so the constant uses NEXT_VERSION_TAG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chmjkb
left a comment
There was a problem hiding this comment.
I haven't verified the demo apps, but the code looks good. Any idea on how the performance compares to other embedding models? 500mb seems like a lot
To be fairly honest more than a half of our text embedding models currently are the same other of magnitude. Look: https://huggingface.co/collections/software-mansion/text-embeddings |
ok, I was biased by the minilm, which is super small, fine |
Follows the same scheme-suffix convention used for LLaMA (`_QLORA`, `_SPINQUANT`) — each variant has its own constant so the caller picks exactly the quantization / backend combo they want: DISTILUSE_BASE_MULTILINGUAL_CASED_V2 xnnpack fp32 (baseline) DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W xnnpack 8da4w DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML_FP32 coreml fp32 (iOS/macOS) DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML_FP16 coreml fp16 (iOS/macOS) All four point at the same HF repo tag v0.9.0; tokenizer.json is shared. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o benchmarks Size data belongs next to the other model sizes, not inline in the hook reference. The useTextEmbeddings page now only lists the model family (one row) and leaves variant enumeration to the API reference + the model-size benchmark table. Model column in model-size.md is renamed from "XNNPACK [MB]" to just "Size [MB]" since the table now mixes XNNPACK and CoreML rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This comment was marked as resolved.
This comment was marked as resolved.
Adds inference-time and memory-usage rows for all four distiluse-base-multilingual-cased-v2 variants (XNNPACK fp32, XNNPACK 8da4w, Core ML fp32, Core ML fp16). Captured on a OnePlus 12 (Android, debug build) and iPhone 17 Pro (iOS, debug build) with a fixed ~80-token sentence over 100 measured forwards, JS-side wall-clock around model.forward(). Memory column reports peak resident-set delta vs the pre-model-load baseline, sampled with adb dumpsys meminfo on Android and Xcode's Debug Navigator on iOS. Also normalizes the text-embeddings table headers to match the Classification section convention: column header drops the "(XNNPACK)" suffix and the backend now lives in the per-row label, which lets multi-backend models (fp32 / 8da4w / Core ML) share a single table without an artificial column split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NorbertKlockiewicz
left a comment
There was a problem hiding this comment.
I think we should ship just 1 version of the model per backend in modelUrls
Renames `DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML_FP32` → `DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML` to match the existing convention where the default-precision XNNPACK variant has no precision suffix. The fp16 variant keeps its suffix since it's non-default. Per review feedback on #1098.
|
Tested all 4 variants off-device via the executorch Python runtime on Tatoeba bitext-mining (eng↔X for X ∈ {pol, deu, fra, spa, rus, jpn}, 1000 pairs each, 6000 total). All four land within 0.2 pp R@1 / 0.1 pp R@10 across every language pair. CoreML fp32 is bit-exact with XNNPACK fp32 (cosine 1.0). CoreML fp16 differs only at the 5th decimal. XNNPACK 8da4w drifts ~1% in cosine but retrieval is unaffected (sometimes +0.1 pp R@1, sometimes −0.2 pp — quantization noise either way). Proposing to drop 2:
Naming: with no bare |
Tatoeba bitext-mining (eng↔X for X ∈ {pol, deu, fra, spa, rus, jpn},
1000 pairs each) shows all 4 variants land within 0.2 pp R@1 / 0.1 pp
R@10 of each other. CoreML fp32 is bit-exact with XNNPACK fp32; CoreML
fp16 differs at the 5th decimal; 8da4w drifts ~1% cosine but retrieval
is unaffected.
Drop:
- XNNPACK fp32 (bare _V2): Pareto-dominated on iPhone by COREML and on
Android by 8DA4W (speed and memory). No retained quality benefit.
- COREML_FP16: identical retrieval quality to COREML fp32 but slower
on iPhone (19 vs 15 ms) and uses more memory (143 vs 55 MB).
Ships as _8DA4W (Android) and _COREML (iOS) only.
Description
Addresses the multilingual half of #945 by shipping the first multilingual text embeddings model,
distiluse-base-multilingual-cased-v2, covering 50+ languages at 512 embedding dims.The
paraphrase-multilingual-MiniLM-L12-v2model from the same issue is deferred — its tokenizer pipeline is Unigram + Precompiled normalizer + Metaspace decoder, andexecutorch/extension/llm/tokenizers(the C++ lib RNE links) only supports BPE + WordPiece +BertNormalizer. Unigram support is in flight upstream; we'll ship paraphrase-multilingual in a follow-up once the runtime picks it up.What the diff does:
modelUrls.ts— addsDISTILUSE_BASE_MULTILINGUAL_CASED_V2(XNNPACK fp32),_8DA4W(XNNPACK 8-bit dynamic-act / 4-bit weight via torchao),_COREML_FP32, and_COREML_FP16pointing at the new HF repo underNEXT_VERSION_TAG(=resolve/v0.9.0), and registers all four inMODEL_REGISTRY.ALL_MODELS. Files are uploaded underxnnpack/andcoreml/subfolders following the newer CLIP repo convention.types/textEmbeddings.ts— extendsTextEmbeddingsModelNamewith'distiluse-base-multilingual-cased-v2'.apps/text-embeddings/.../index.tsx— adds the model to the playground picker.useTextEmbeddings.md— adds a row to the "Supported models" table..cspell-wordlist.txt— addsDISTILUSE,distiluse,Distilfor the spell-check hook.HF repo (live): software-mansion/react-native-executorch-distiluse-base-multilingual-cased-v2,
mainbranch +v0.9.0tag.Introduces a breaking change?
Type of change
Tested on
Testing instructions
cd apps/text-embeddings && yarn ios(oryarn android).forwardshould return a 512-dimFloat32Arrayin ~35 ms.Screenshots
Related issues
Closes the multilingual-BERT half of #945. The
paraphrase-multilingual-MiniLM-L12-v2half stays open pending Unigram/Precompiled/Metaspace support in ExecuTorch's tokenizer lib.Checklist
Additional notes
The exporter wrapper passes
attention_mask=Noneinto the underlying transformer even though the RNE runtime always supplies an all-ones mask. This is deliberate: HF'screate_bidirectional_mask/masking_utilsotherwise emits a chain ofwhere/eq/any/logical_not/expand_copyops on the mask that XNNPACK can't delegate, costing ~10 ms per forward with no observable effect on the output. With the bypass the .pte is bit-exact with eager PyTorch (RMSE 0.0 on fp32 random input) and XNNPACK delegation stays around 89–91% of graph runtime.The concrete non-delegated ops that remain (LayerNorm, the residual mask prep inside HF, the explicit
expandin mean pooling) are all inside upstream code paths — pushing past ~91% would need either aStaticLayerNormalizationmatch in XNNPACK's partitioner or surgery on HF's mask utils. Out of scope here; worth flagging for a future iteration.Export of fp16 for XNNPack succeeded but produced NaNs, so not useful.
The exporter script, profiling setup, and the full write-up of the above live in the internal export-scripts repo.