Skip to content

feat: add DISTILUSE_BASE_MULTILINGUAL_CASED_V2 text embeddings model#1098

Merged
msluszniak merged 6 commits intomainfrom
feat/multilingual-text-embeddings
Apr 28, 2026
Merged

feat: add DISTILUSE_BASE_MULTILINGUAL_CASED_V2 text embeddings model#1098
msluszniak merged 6 commits intomainfrom
feat/multilingual-text-embeddings

Conversation

@msluszniak
Copy link
Copy Markdown
Member

@msluszniak msluszniak commented Apr 24, 2026

Description

Addresses the multilingual half of #945 by shipping the first multilingual text embeddings model, distiluse-base-multilingual-cased-v2, covering 50+ languages at 512 embedding dims.

The paraphrase-multilingual-MiniLM-L12-v2 model from the same issue is deferred — its tokenizer pipeline is Unigram + Precompiled normalizer + Metaspace decoder, and executorch/extension/llm/tokenizers (the C++ lib RNE links) only supports BPE + WordPiece + BertNormalizer. Unigram support is in flight upstream; we'll ship paraphrase-multilingual in a follow-up once the runtime picks it up.

What the diff does:

  • modelUrls.ts — adds DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (XNNPACK fp32), _8DA4W (XNNPACK 8-bit dynamic-act / 4-bit weight via torchao), _COREML_FP32, and _COREML_FP16 pointing at the new HF repo under NEXT_VERSION_TAG (= resolve/v0.9.0), and registers all four in MODEL_REGISTRY.ALL_MODELS. Files are uploaded under xnnpack/ and coreml/ subfolders following the newer CLIP repo convention.
  • types/textEmbeddings.ts — extends TextEmbeddingsModelName with 'distiluse-base-multilingual-cased-v2'.
  • apps/text-embeddings/.../index.tsx — adds the model to the playground picker.
  • useTextEmbeddings.md — adds a row to the "Supported models" table.
  • .cspell-wordlist.txt — adds DISTILUSE, distiluse, Distil for the spell-check hook.

HF repo (live): software-mansion/react-native-executorch-distiluse-base-multilingual-cased-v2, main branch + v0.9.0 tag.

├── README.md
├── config.json
├── tokenizer.json
├── tokenizer_config.json
├── xnnpack/
│   ├── distiluse-base-multilingual-cased-v2_xnnpack_fp32.pte   (541 MB)
│   └── distiluse-base-multilingual-cased-v2_xnnpack_8da4w.pte  (393 MB)
└── coreml/
    ├── distiluse-base-multilingual-cased-v2_coreml_fp32.pte    (541 MB)
    └── distiluse-base-multilingual-cased-v2_coreml_fp16.pte    (271 MB)

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

  • 1. Run the text-embeddings playground: cd apps/text-embeddings && yarn ios (or yarn android).
  • 2. In the model picker, select "Multilingual DistilUSE".
  • 3. Enter any sentence; forward should return a 512-dim Float32Array in ~35 ms.
  • 4. Cross-lingual retrieval is the model's strength — try indexing a Polish sentence and querying with an English equivalent (or vice versa); top match should be the aligned pair.
  • 5. Short single-word non-English queries over short targets are the weakest case (inherent to the model, not an export issue) — use ≥ 1-sentence inputs for best results.

Screenshots

Related issues

Closes the multilingual-BERT half of #945. The paraphrase-multilingual-MiniLM-L12-v2 half stays open pending Unigram/Precompiled/Metaspace support in ExecuTorch's tokenizer lib.

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

The exporter wrapper passes attention_mask=None into the underlying transformer even though the RNE runtime always supplies an all-ones mask. This is deliberate: HF's create_bidirectional_mask / masking_utils otherwise emits a chain of where / eq / any / logical_not / expand_copy ops on the mask that XNNPACK can't delegate, costing ~10 ms per forward with no observable effect on the output. With the bypass the .pte is bit-exact with eager PyTorch (RMSE 0.0 on fp32 random input) and XNNPACK delegation stays around 89–91% of graph runtime.

The concrete non-delegated ops that remain (LayerNorm, the residual mask prep inside HF, the explicit expand in mean pooling) are all inside upstream code paths — pushing past ~91% would need either a StaticLayerNormalization match in XNNPACK's partitioner or surgery on HF's mask utils. Out of scope here; worth flagging for a future iteration.

Export of fp16 for XNNPack succeeded but produced NaNs, so not useful.

The exporter script, profiling setup, and the full write-up of the above live in the internal export-scripts repo.

Addresses the multilingual half of #945. Shipping only the WordPiece
tokenizer model for now — paraphrase-multilingual-MiniLM-L12-v2 needs
Unigram/Precompiled/Metaspace support in executorch/extension/llm/
tokenizers, which is in-flight upstream.

The model lives at
software-mansion/react-native-executorch-distiluse-base-multilingual-cased-v2
under tag v0.9.0, so the constant uses NEXT_VERSION_TAG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@msluszniak msluszniak self-assigned this Apr 24, 2026
@msluszniak msluszniak added the feature PRs that implement a new feature label Apr 24, 2026
@msluszniak msluszniak requested a review from IgorSwat April 24, 2026 13:32
Copy link
Copy Markdown
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't verified the demo apps, but the code looks good. Any idea on how the performance compares to other embedding models? 500mb seems like a lot

@msluszniak
Copy link
Copy Markdown
Member Author

I haven't verified the demo apps, but the code looks good. Any idea on how the performance compares to other embedding models? 500mb seems like a lot

To be fairly honest more than a half of our text embedding models currently are the same other of magnitude. Look: https://huggingface.co/collections/software-mansion/text-embeddings

@chmjkb
Copy link
Copy Markdown
Collaborator

chmjkb commented Apr 24, 2026

I haven't verified the demo apps, but the code looks good. Any idea on how the performance compares to other embedding models? 500mb seems like a lot

To be fairly honest more than a half of our text embedding models currently are the same other of magnitude. Look: https://huggingface.co/collections/software-mansion/text-embeddings

ok, I was biased by the minilm, which is super small, fine

msluszniak and others added 2 commits April 24, 2026 16:38
Follows the same scheme-suffix convention used for LLaMA (`_QLORA`,
`_SPINQUANT`) — each variant has its own constant so the caller picks
exactly the quantization / backend combo they want:

  DISTILUSE_BASE_MULTILINGUAL_CASED_V2               xnnpack fp32 (baseline)
  DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W         xnnpack 8da4w
  DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML_FP32   coreml  fp32 (iOS/macOS)
  DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML_FP16   coreml  fp16 (iOS/macOS)

All four point at the same HF repo tag v0.9.0; tokenizer.json is shared.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o benchmarks

Size data belongs next to the other model sizes, not inline in the hook
reference. The useTextEmbeddings page now only lists the model family
(one row) and leaves variant enumeration to the API reference + the
model-size benchmark table.

Model column in model-size.md is renamed from "XNNPACK [MB]" to just
"Size [MB]" since the table now mixes XNNPACK and CoreML rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@msluszniak

This comment was marked as resolved.

Adds inference-time and memory-usage rows for all four
distiluse-base-multilingual-cased-v2 variants (XNNPACK fp32, XNNPACK
8da4w, Core ML fp32, Core ML fp16). Captured on a OnePlus 12 (Android,
debug build) and iPhone 17 Pro (iOS, debug build) with a fixed ~80-token
sentence over 100 measured forwards, JS-side wall-clock around
model.forward(). Memory column reports peak resident-set delta vs the
pre-model-load baseline, sampled with adb dumpsys meminfo on Android
and Xcode's Debug Navigator on iOS.

Also normalizes the text-embeddings table headers to match the
Classification section convention: column header drops the
"(XNNPACK)" suffix and the backend now lives in the per-row label,
which lets multi-backend models (fp32 / 8da4w / Core ML) share a
single table without an artificial column split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@NorbertKlockiewicz NorbertKlockiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should ship just 1 version of the model per backend in modelUrls

Comment thread packages/react-native-executorch/src/constants/modelUrls.ts Outdated
Comment thread packages/react-native-executorch/src/types/textEmbeddings.ts Outdated
Renames `DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML_FP32` →
`DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML` to match the existing
convention where the default-precision XNNPACK variant has no precision
suffix. The fp16 variant keeps its suffix since it's non-default.

Per review feedback on #1098.
@msluszniak
Copy link
Copy Markdown
Member Author

Tested all 4 variants off-device via the executorch Python runtime on Tatoeba bitext-mining (eng↔X for X ∈ {pol, deu, fra, spa, rus, jpn}, 1000 pairs each, 6000 total).

All four land within 0.2 pp R@1 / 0.1 pp R@10 across every language pair. CoreML fp32 is bit-exact with XNNPACK fp32 (cosine 1.0). CoreML fp16 differs only at the 5th decimal. XNNPACK 8da4w drifts ~1% in cosine but retrieval is unaffected (sometimes +0.1 pp R@1, sometimes −0.2 pp — quantization noise either way).

Proposing to drop 2:

  • _V2 (XNNPACK fp32) — Pareto-dominated on iPhone by _COREML (15 vs 47 ms, 55 vs 175 MB) and on Android by _8DA4W (15 vs 41 ms, 44 vs 196 MB). The "Android accuracy fallback" rationale doesn't survive the data.
  • _COREML_FP16 — quality identical to _COREML fp32, but per the docs benchmarks it's slower on iPhone (19 vs 15 ms) and uses more memory (143 vs 55 MB). Only edge is download size (271 vs 541 MB). That memory delta is surprising — worth one re-measure (cold vs warm) before final delete; if fp16 actually wins on memory, the call flips.

Naming: with no bare _V2 shipped, both surviving variants carry explicit suffixes (_8DA4W, _COREML) — slightly inconsistent with other text-embedding models where bare = default XNNPACK fp32. I'd keep the explicit suffixes (clear > convention-pure). Alternative: rename _8DA4W → bare _V2 since it becomes the canonical Android variant — preserves the convention but redefines "default" as quantized for this one model.

Tatoeba bitext-mining (eng↔X for X ∈ {pol, deu, fra, spa, rus, jpn},
1000 pairs each) shows all 4 variants land within 0.2 pp R@1 / 0.1 pp
R@10 of each other. CoreML fp32 is bit-exact with XNNPACK fp32; CoreML
fp16 differs at the 5th decimal; 8da4w drifts ~1% cosine but retrieval
is unaffected.

Drop:
- XNNPACK fp32 (bare _V2): Pareto-dominated on iPhone by COREML and on
  Android by 8DA4W (speed and memory). No retained quality benefit.
- COREML_FP16: identical retrieval quality to COREML fp32 but slower
  on iPhone (19 vs 15 ms) and uses more memory (143 vs 55 MB).

Ships as _8DA4W (Android) and _COREML (iOS) only.
@msluszniak msluszniak merged commit 04852be into main Apr 28, 2026
5 checks passed
@msluszniak msluszniak deleted the feat/multilingual-text-embeddings branch April 28, 2026 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants