From d51b75080383a5775dda452c36f89bfb1a7ef77e Mon Sep 17 00:00:00 2001
From: Lawrence Lane <llane@nvidia.com>
Date: Tue, 31 Mar 2026 10:09:15 -0400
Subject: [PATCH 1/4] docs: add text embeddings guide and release notes for PR
 #1346

Add Fern documentation for vLLM and Sentence Transformers embedding
support. Creates new Text Embeddings section with overview and vLLM
Embedder pages. Updates 26.04 release notes and expands semdedup page
with vLLM embedding example.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---
 fern/versions/v26.04.yml                      |   9 ++
 .../pages/about/release-notes/index.mdx       |  13 ++
 .../process-data/deduplication/semdedup.mdx   |  43 ++++++
 .../process-data/embeddings/index.mdx         | 107 ++++++++++++++
 .../process-data/embeddings/vllm-embedder.mdx | 138 ++++++++++++++++++
 5 files changed, 310 insertions(+)
 create mode 100644 fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
 create mode 100644 fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx

diff --git a/fern/versions/v26.04.yml b/fern/versions/v26.04.yml
index d414654330..a0e535a950 100644
--- a/fern/versions/v26.04.yml
+++ b/fern/versions/v26.04.yml
@@ -189,6 +189,15 @@ navigation:
                   - page: Text Cleaning
                     path: ./v26.04/pages/curate-text/process-data/content-processing/text-cleaning.mdx
                     slug: text-cleaning
+              - section: Embeddings
+                slug: embeddings
+                contents:
+                  - page: Overview
+                    path: ./v26.04/pages/curate-text/process-data/embeddings/index.mdx
+                    slug: ""
+                  - page: vLLM Embedder
+                    path: ./v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
+                    slug: vllm-embedder
               - section: Deduplication
                 slug: deduplication
                 contents:
diff --git a/fern/versions/v26.04/pages/about/release-notes/index.mdx b/fern/versions/v26.04/pages/about/release-notes/index.mdx
index aba4938146..13837ac14e 100644
--- a/fern/versions/v26.04/pages/about/release-notes/index.mdx
+++ b/fern/versions/v26.04/pages/about/release-notes/index.mdx
@@ -12,6 +12,17 @@ modality: "universal"
 
 ## What's New in 26.04
 
+### vLLM and Sentence Transformers Embedding Support (PR #1346)
+
+Added two new embedding backends for text curation, giving users flexibility to choose the best engine for their model size and throughput needs:
+
+- **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers.
+- **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem.
+- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location.
+- **New `vllm` optional dependency**: Install with `pip install nemo_curator[vllm]` (x86_64 Linux only). The `sentence-transformers` package is now included in the `text_cpu` extra.
+
+For usage details, see [Text Embeddings](/curate-text/process-data/embeddings) and [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder).
+
 ### Cosmos-Xenna 0.2.0 (PR #1571)
 
 Upgraded Cosmos-Xenna from 0.1.2 to 0.2.0 with a simplified resource model and improved GPU management:
@@ -24,6 +35,8 @@ Upgraded Cosmos-Xenna from 0.1.2 to 0.2.0 with a simplified resource model and i
 
 - **Cosmos-Xenna**: Updated from 0.1.2 to 0.2.0 with simplified resource model
 - **Ray**: Updated to 2.54
+- **sentence-transformers**: Added to the `text_cpu` optional dependency group
+- **vllm**: New `vllm` optional dependency group (`pip install nemo_curator[vllm]`, x86_64 Linux only)
 
 ## Breaking Changes
 
diff --git a/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx b/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx
index 846f7b1cf2..19941a8742 100644
--- a/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx
+++ b/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx
@@ -180,12 +180,55 @@ workflow = TextSemanticDeduplicationWorkflow(
 )
 ```
 
+**vLLM Embedder** (recommended for large models):
+
+For large embedding models, you can generate embeddings separately using `VLLMEmbeddingModelStage` before running the deduplication workflow. This provides better GPU utilization and throughput for models with 500M+ parameters. See [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) for details.
+
+```python
+from nemo_curator.backends.xenna import XennaExecutor
+from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
+from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
+from nemo_curator.pipeline import Pipeline
+from nemo_curator.stages.text.io.reader import ParquetReader
+from nemo_curator.stages.text.io.writer import ParquetWriter
+
+executor = XennaExecutor()
+
+# Step 1: Generate embeddings with vLLM
+embedding_pipeline = Pipeline(
+    name="vllm_embedding_pipeline",
+    stages=[
+        ParquetReader(file_paths=input_path, files_per_partition=1, fields=["text"], _generate_ids=True),
+        VLLMEmbeddingModelStage(
+            model_identifier="large-embedding-model",
+            text_field="text",
+            embedding_field="embeddings",
+            pretokenize=True,
+        ),
+        ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"]),
+    ],
+)
+embedding_pipeline.run(executor)
+
+# Step 2: Run deduplication on pre-computed embeddings
+semantic_workflow = SemanticDeduplicationWorkflow(
+    input_path=embedding_output_path,
+    output_path=output_path,
+    n_clusters=100,
+    eps=0.07,
+    id_field="_curator_dedup_id",
+    embedding_field="embeddings",
+)
+semantic_workflow.run()
+```
+
 **When choosing a model**:
 
 - Ensure compatibility with your data type
 - Adjust `embedding_model_inference_batch_size` for memory requirements
 - Choose models appropriate for your language or domain
 - Avoid generic decoder-only LLMs (e.g., OPT/GPT) for embeddings; prefer models trained for sentence embeddings (e.g., E5/BGE/SBERT)
+- For large models (>500M params), consider using [VLLMEmbeddingModelStage](/curate-text/process-data/embeddings/vllm-embedder) for better throughput
 </Accordion>
 
 <Accordion title="Advanced Configuration">
diff --git a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
new file mode 100644
index 0000000000..e82074b388
--- /dev/null
+++ b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
@@ -0,0 +1,107 @@
+---
+description: "Generate text embeddings using vLLM, Sentence Transformers, or HuggingFace models for deduplication, similarity search, and downstream tasks"
+categories: ["how-to-guides"]
+tags: ["embeddings", "vllm", "sentence-transformers", "gpu-accelerated", "similarity-search"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "intermediate"
+content_type: "how-to"
+modality: "text-only"
+---
+
+# Text Embedding
+
+Generate text embeddings for large-scale datasets using NeMo Curator's built-in embedding stages. Text embeddings enable downstream tasks such as semantic deduplication, similarity search, and clustering.
+
+## How It Works
+
+NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements:
+
+1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers and HuggingFace `AutoModel` backends via the `use_sentence_transformer` flag.
+2. **`VLLMEmbeddingModelStage`** — A standalone stage that uses vLLM for GPU-accelerated embedding generation with optional pretokenization. Best for large embedding models where vLLM's batching and GPU utilization provide significant throughput gains.
+3. **`SentenceTransformerEmbeddingModelStage`** — A model stage that uses the `sentence-transformers` library directly. Used internally by `EmbeddingCreatorStage` when `use_sentence_transformer=True`.
+
+## Choosing an Embedding Backend
+
+| Backend | Best For | GPU Utilization | Setup |
+| --- | --- | --- | --- |
+| `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Minimal — included in `text_cpu` extra |
+| `VLLMEmbeddingModelStage` | Large models (e.g., Gemma-based embedders) | Excellent | Requires `vllm` extra (`pip install nemo_curator[vllm]`) |
+| `EmbeddingCreatorStage` (AutoModel) | Custom pooling strategies | Good | Set `use_sentence_transformer=False` |
+
+<Note>
+Benchmarks on 5 GB of Common Crawl data show that vLLM outperforms Sentence Transformers for larger embedding models, while Sentence Transformers is faster for smaller models. The vLLM `pretokenize` mode provides the best per-task throughput across both model sizes when amortized over many tasks.
+</Note>
+
+## Quick Start
+
+### EmbeddingCreatorStage (Recommended for Most Use Cases)
+
+```python
+from nemo_curator.backends.xenna import XennaExecutor
+from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
+from nemo_curator.pipeline import Pipeline
+from nemo_curator.stages.text.io.reader import ParquetReader
+from nemo_curator.stages.text.io.writer import ParquetWriter
+
+pipeline = Pipeline(
+    name="text_embeddings",
+    stages=[
+        ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
+        EmbeddingCreatorStage(
+            model_identifier="sentence-transformers/all-MiniLM-L6-v2",
+            text_field="text",
+            embedding_field="embeddings",
+            model_inference_batch_size=256,
+        ),
+        ParquetWriter(path="output/", fields=["text", "embeddings"]),
+    ],
+)
+
+executor = XennaExecutor()
+pipeline.run(executor)
+```
+
+### VLLMEmbeddingModelStage (For Large Models)
+
+```python
+from nemo_curator.backends.xenna import XennaExecutor
+from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
+from nemo_curator.pipeline import Pipeline
+from nemo_curator.stages.text.io.reader import ParquetReader
+from nemo_curator.stages.text.io.writer import ParquetWriter
+
+pipeline = Pipeline(
+    name="vllm_embeddings",
+    stages=[
+        ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
+        VLLMEmbeddingModelStage(
+            model_identifier="google/gemma-embedding-model",
+            text_field="text",
+            embedding_field="embeddings",
+            pretokenize=True,  # Recommended for best throughput
+        ),
+        ParquetWriter(path="output/", fields=["text", "embeddings"]),
+    ],
+)
+
+executor = XennaExecutor()
+pipeline.run(executor)
+```
+
+---
+
+## Available Embedding Tools
+
+<Cards>
+
+<Card title="vLLM Embedder" href="/curate-text/process-data/embeddings/vllm-embedder">
+Generate embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models.
+</Card>
+
+</Cards>
+
+---
+
+## Integration with Semantic Deduplication
+
+Text embeddings are a key input for [semantic deduplication](/curate-text/process-data/deduplication/semdedup). The `TextSemanticDeduplicationWorkflow` uses `EmbeddingCreatorStage` internally, but you can also generate embeddings separately and feed them into the deduplication workflow for more control over the embedding process.
diff --git a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
new file mode 100644
index 0000000000..3314e1d7be
--- /dev/null
+++ b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
@@ -0,0 +1,138 @@
+---
+description: "Generate text embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models"
+categories: ["how-to-guides"]
+tags: ["embeddings", "vllm", "gpu-accelerated", "large-models"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "intermediate"
+content_type: "how-to"
+modality: "text-only"
+---
+
+# vLLM Embedder
+
+Generate text embeddings using vLLM's optimized inference engine. The `VLLMEmbeddingModelStage` provides high-throughput embedding generation, particularly for large embedding models where vLLM's batching and GPU memory management provide significant performance advantages over Sentence Transformers.
+
+<Note>
+**Installation**: The vLLM embedder requires the `vllm` optional dependency. Install it with:
+
+```bash
+pip install nemo_curator[vllm]
+```
+
+vLLM is only available on x86_64 Linux systems.
+</Note>
+
+## How It Works
+
+`VLLMEmbeddingModelStage` is a single-stage embedder that handles both tokenization and embedding generation within one stage. Unlike `EmbeddingCreatorStage` (which splits tokenization and model inference into separate stages), the vLLM embedder delegates all GPU operations to vLLM's inference engine.
+
+Key features:
+
+- **Optional pretokenization**: When `pretokenize=True`, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput
+- **vLLM-managed batching**: Leverages vLLM's built-in request scheduling for optimal GPU utilization
+- **Model download caching**: Automatically downloads and caches models from HuggingFace Hub
+- **Character truncation**: Optional `max_chars` parameter to limit input length before tokenization
+
+## Quick Start
+
+```python
+from nemo_curator.backends.xenna import XennaExecutor
+from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
+from nemo_curator.pipeline import Pipeline
+from nemo_curator.stages.text.io.reader import ParquetReader
+from nemo_curator.stages.text.io.writer import ParquetWriter
+
+pipeline = Pipeline(
+    name="vllm_embeddings",
+    stages=[
+        ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
+        VLLMEmbeddingModelStage(
+            model_identifier="google/gemma-embedding-model",
+            text_field="text",
+            embedding_field="embeddings",
+            pretokenize=True,
+        ),
+        ParquetWriter(path="output/", fields=["text", "embeddings"]),
+    ],
+)
+
+executor = XennaExecutor()
+pipeline.run(executor)
+```
+
+## Configuration
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+| --- | --- | --- | --- |
+| `model_identifier` | `str` | Required | HuggingFace model name or path for the embedding model |
+| `vllm_init_kwargs` | `dict` | `None` | Additional keyword arguments passed to `vllm.LLM()` for engine configuration |
+| `text_field` | `str` | `"text"` | Name of the input text column in the data |
+| `pretokenize` | `bool` | `False` | Tokenize text on CPU before passing to vLLM. Recommended for best throughput |
+| `embedding_field` | `str` | `"embeddings"` | Name of the output embedding column |
+| `max_chars` | `int` | `None` | Maximum characters per document (truncates before tokenization) |
+| `cache_dir` | `str` | `None` | Directory for caching downloaded model files |
+| `hf_token` | `str` | `None` | HuggingFace token for accessing gated models |
+| `verbose` | `bool` | `False` | Enable verbose logging and progress bars |
+
+### vLLM Engine Options
+
+Pass additional vLLM configuration through `vllm_init_kwargs`:
+
+```python
+VLLMEmbeddingModelStage(
+    model_identifier="large-embedding-model",
+    pretokenize=True,
+    vllm_init_kwargs={
+        "enforce_eager": True,       # Disable CUDA graph for debugging
+        "tensor_parallel_size": 2,   # Distribute across 2 GPUs
+        "gpu_memory_utilization": 0.9,
+        "max_model_len": 512,
+    },
+)
+```
+
+Default vLLM settings applied by the stage (can be overridden):
+
+- `enforce_eager=False` — Uses CUDA graphs for faster inference
+- `runner="pooling"` — Configures vLLM for embedding (pooling) tasks
+- `model_impl="vllm"` — Uses vLLM's native model implementation
+- `disable_log_stats=True` — Suppresses stats logging when `verbose=False`
+
+### Pretokenization
+
+When `pretokenize=True`, the stage:
+
+1. Loads a HuggingFace `AutoTokenizer` for the specified model
+2. Tokenizes the input text batch on CPU with truncation to `max_model_len`
+3. Passes token IDs directly to vLLM using `TokensPrompt`
+
+This mode is recommended for production workloads. Benchmarks show it provides the best per-task throughput across both small and large embedding models by reducing GPU idle time during tokenization.
+
+```python
+# Pretokenize mode (recommended)
+VLLMEmbeddingModelStage(
+    model_identifier="google/gemma-embedding-model",
+    pretokenize=True,  # Tokenize on CPU, embed on GPU
+)
+
+# Direct text mode (simpler, slightly less throughput)
+VLLMEmbeddingModelStage(
+    model_identifier="google/gemma-embedding-model",
+    pretokenize=False,  # vLLM handles tokenization internally
+)
+```
+
+## Resources
+
+The `VLLMEmbeddingModelStage` requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure `tensor_parallel_size` in `vllm_init_kwargs`.
+
+## When to Use vLLM vs. Sentence Transformers
+
+| Scenario | Recommendation |
+| --- | --- |
+| Large embedding models (>500M params) | vLLM — better GPU utilization and memory management |
+| Small embedding models (<100M params) | Sentence Transformers — lower overhead, faster startup |
+| High-throughput production pipelines | vLLM with `pretokenize=True` — best amortized throughput |
+| Quick prototyping | `EmbeddingCreatorStage` — simpler setup, no extra dependency |

From 50901341019b2ed1bb6ecd8ec012ca4fe6a072ca Mon Sep 17 00:00:00 2001
From: Lawrence Lane <llane@nvidia.com>
Date: Tue, 31 Mar 2026 10:16:31 -0400
Subject: [PATCH 2/4] docs: deduplicate code snippets and fix placeholder model
 names

Replace duplicated vLLM Quick Start in embeddings overview and semdedup
page with cross-references to the canonical vllm-embedder page. Replace
placeholder "large-embedding-model" with consistent model identifiers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---
 .../process-data/deduplication/semdedup.mdx   | 27 +++----------------
 .../process-data/embeddings/index.mdx         | 25 +----------------
 .../process-data/embeddings/vllm-embedder.mdx |  2 +-
 3 files changed, 5 insertions(+), 49 deletions(-)

diff --git a/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx b/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx
index 19941a8742..6bf4ca2fa8 100644
--- a/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx
+++ b/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx
@@ -184,33 +184,12 @@ workflow = TextSemanticDeduplicationWorkflow(
 
 For large embedding models, you can generate embeddings separately using `VLLMEmbeddingModelStage` before running the deduplication workflow. This provides better GPU utilization and throughput for models with 500M+ parameters. See [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) for details.
 
+Generate embeddings with `VLLMEmbeddingModelStage` using the [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) pipeline, then pass the output to `SemanticDeduplicationWorkflow`:
+
 ```python
-from nemo_curator.backends.xenna import XennaExecutor
-from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
 from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
-from nemo_curator.pipeline import Pipeline
-from nemo_curator.stages.text.io.reader import ParquetReader
-from nemo_curator.stages.text.io.writer import ParquetWriter
-
-executor = XennaExecutor()
-
-# Step 1: Generate embeddings with vLLM
-embedding_pipeline = Pipeline(
-    name="vllm_embedding_pipeline",
-    stages=[
-        ParquetReader(file_paths=input_path, files_per_partition=1, fields=["text"], _generate_ids=True),
-        VLLMEmbeddingModelStage(
-            model_identifier="large-embedding-model",
-            text_field="text",
-            embedding_field="embeddings",
-            pretokenize=True,
-        ),
-        ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"]),
-    ],
-)
-embedding_pipeline.run(executor)
 
-# Step 2: Run deduplication on pre-computed embeddings
+# After generating embeddings to embedding_output_path using VLLMEmbeddingModelStage
 semantic_workflow = SemanticDeduplicationWorkflow(
     input_path=embedding_output_path,
     output_path=output_path,
diff --git a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
index e82074b388..3d38d83eb1 100644
--- a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
+++ b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
@@ -63,30 +63,7 @@ pipeline.run(executor)
 
 ### VLLMEmbeddingModelStage (For Large Models)
 
-```python
-from nemo_curator.backends.xenna import XennaExecutor
-from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
-from nemo_curator.pipeline import Pipeline
-from nemo_curator.stages.text.io.reader import ParquetReader
-from nemo_curator.stages.text.io.writer import ParquetWriter
-
-pipeline = Pipeline(
-    name="vllm_embeddings",
-    stages=[
-        ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
-        VLLMEmbeddingModelStage(
-            model_identifier="google/gemma-embedding-model",
-            text_field="text",
-            embedding_field="embeddings",
-            pretokenize=True,  # Recommended for best throughput
-        ),
-        ParquetWriter(path="output/", fields=["text", "embeddings"]),
-    ],
-)
-
-executor = XennaExecutor()
-pipeline.run(executor)
-```
+For large embedding models (>500M parameters), use `VLLMEmbeddingModelStage` for better GPU utilization and throughput. See the [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) guide for setup, configuration, and code examples.
 
 ---
 
diff --git a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
index 3314e1d7be..7ad4706046 100644
--- a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
+++ b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
@@ -82,7 +82,7 @@ Pass additional vLLM configuration through `vllm_init_kwargs`:
 
 ```python
 VLLMEmbeddingModelStage(
-    model_identifier="large-embedding-model",
+    model_identifier="google/gemma-embedding-model",
     pretokenize=True,
     vllm_init_kwargs={
         "enforce_eager": True,       # Disable CUDA graph for debugging

From 023d66830d6a1e93d4e4bc55d96b78a23a5e6cdd Mon Sep 17 00:00:00 2001
From: Lawrence Lane <llane@nvidia.com>
Date: Mon, 6 Apr 2026 13:12:37 -0400
Subject: [PATCH 3/4] docs: fix internal stage reference for
 TextSemanticDeduplicationWorkflow

The workflow now uses VLLMEmbeddingModelStage internally, not
EmbeddingCreatorStage.

Signed-off-by: Logan Lane <lbliii@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---
 .../v26.04/pages/curate-text/process-data/embeddings/index.mdx  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
index 01344a4e07..a9cf636d7f 100644
--- a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
+++ b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
@@ -81,4 +81,4 @@ Generate embeddings using vLLM for high-throughput GPU-accelerated inference wit
 
 ## Integration with Semantic Deduplication
 
-Text embeddings are a key input for [semantic deduplication](/curate-text/process-data/deduplication/semdedup). The `TextSemanticDeduplicationWorkflow` uses `EmbeddingCreatorStage` internally, but you can also generate embeddings separately and feed them into the deduplication workflow for more control over the embedding process.
+Text embeddings are a key input for [semantic deduplication](/curate-text/process-data/deduplication/semdedup). The `TextSemanticDeduplicationWorkflow` uses `VLLMEmbeddingModelStage` internally, but you can also generate embeddings separately and feed them into the deduplication workflow for more control over the embedding process.

From 1a19dd866c32fb798dc39186b3e5f38177cd30d8 Mon Sep 17 00:00:00 2001
From: Lawrence Lane <llane@nvidia.com>
Date: Tue, 7 Apr 2026 15:49:11 -0400
Subject: [PATCH 4/4] docs: address sarahyurick review feedback on embedding
 docs

- Fix "HuggingFace" to "Hugging Face" everywhere
- Remove vllm install instructions (included in text_cuda12)
- Fix "classe" typo to "classes" in release notes
- Update Setup column to recommend text_cuda12
- Position vLLM as recommended for semantic dedup
- Fix pretokenize recommendation (model-dependent, not universal)
- Remove vLLM vs ST comparison table per reviewer request
- Use correct model identifier google/embeddinggemma-300m

Signed-off-by: Lawrence Lane <llane@nvidia.com>
---
 .../pages/about/release-notes/index.mdx       |  5 ++--
 .../process-data/embeddings/index.mdx         |  6 ++---
 .../process-data/embeddings/vllm-embedder.mdx | 24 +++++++------------
 3 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/fern/versions/v26.04/pages/about/release-notes/index.mdx b/fern/versions/v26.04/pages/about/release-notes/index.mdx
index 3ce4523aa3..6c04504c8f 100644
--- a/fern/versions/v26.04/pages/about/release-notes/index.mdx
+++ b/fern/versions/v26.04/pages/about/release-notes/index.mdx
@@ -18,8 +18,7 @@ Added two new embedding backends for text curation, giving users flexibility to
 
 - **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers.
 - **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem.
-- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location.
-- **New `vllm` optional dependency**: The `vllm` package is automatically included with the relevant modality installations (`text_cuda12`, `video_cuda12`, `math_cuda12`). The `sentence-transformers` package is now included in the `text_cpu` extra.
+- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes. Added `cache_dir` parameter for controlling model download location.
 
 For usage details, see [Text Embeddings](/curate-text/process-data/embeddings) and [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder).
 
@@ -117,7 +116,7 @@ Resolved four HIGH-severity vulnerabilities affecting Curator dependencies:
 - **Cosmos-Xenna**: Updated from 0.1.2 to 0.2.0 with simplified resource model
 - **Ray**: Updated to 2.54
 - **sentence-transformers**: Added to the `text_cpu` optional dependency group
-- **vllm**: New `vllm` optional dependency group
+- **vllm**: New vllm optional dependency group
 - **uv**: Added minimum required version (>=0.7.0) to prevent lockfile revision drift
 - **nemo-toolkit**: Bumped `nemo_toolkit[asr]` from `==2.4.0` to `>=2.7.2` to address deserialization CVEs. Only affects `audio_cpu` and `audio_cuda12` extras.
 - **xgrammar**: Moved from `constraint-dependencies` (`>=0.1.21`) to `override-dependencies` (`>=0.1.32`) to override vLLM's pinned version and address CVE-2026-25048.
diff --git a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
index a9cf636d7f..41341a8af2 100644
--- a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
+++ b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx
@@ -1,5 +1,5 @@
 ---
-description: "Generate text embeddings using vLLM, Sentence Transformers, or HuggingFace models for deduplication, similarity search, and downstream tasks"
+description: "Generate text embeddings using vLLM, Sentence Transformers, or Hugging Face models for deduplication, similarity search, and downstream tasks"
 categories: ["how-to-guides"]
 tags: ["embeddings", "vllm", "sentence-transformers", "gpu-accelerated", "similarity-search"]
 personas: ["data-scientist-focused", "mle-focused"]
@@ -24,7 +24,7 @@ NeMo Curator provides three embedding backends for text data, each suited to dif
 
 | Backend | Best For | GPU Utilization | Setup |
 | --- | --- | --- | --- |
-| `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cpu` extra |
+| `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cuda12` extra |
 | `VLLMEmbeddingModelStage` | Large models (e.g., `google/embeddinggemma-300m`) and semantic deduplication | Excellent | Included in `text_cuda12` extra |
 | `EmbeddingCreatorStage` (AutoModel) | Custom pooling strategies | Good | Set `use_sentence_transformer=False` |
 
@@ -34,7 +34,7 @@ Benchmarks on 5 GB of Common Crawl data show that vLLM outperforms Sentence Tran
 
 ## Quick Start
 
-### EmbeddingCreatorStage (Sentence Transformers)
+### EmbeddingCreatorStage
 
 ```python
 from nemo_curator.backends.xenna import XennaExecutor
diff --git a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
index 3e113801ac..d2dfcf325c 100644
--- a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
+++ b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
@@ -30,7 +30,7 @@ Key features:
 
 - **Optional pretokenization**: When `pretokenize=True`, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput
 - **vLLM-managed batching**: Leverages vLLM's built-in request scheduling for optimal GPU utilization
-- **Model download caching**: Automatically downloads and caches models from HuggingFace Hub
+- **Model download caching**: Automatically downloads and caches models from Hugging Face Hub
 - **Character truncation**: Optional `max_chars` parameter to limit input length before tokenization
 
 ## Quick Start
@@ -68,7 +68,7 @@ pipeline.run(executor)
 | `model_identifier` | `str` | Required | Hugging Face model name or path for the embedding model |
 | `vllm_init_kwargs` | `dict` | `None` | Additional keyword arguments passed to `vllm.LLM()` for engine configuration |
 | `text_field` | `str` | `"text"` | Name of the input text column in the data |
-| `pretokenize` | `bool` | `False` | Tokenize text on CPU before passing to vLLM. Recommended for best throughput |
+| `pretokenize` | `bool` | `False` | Tokenize text on CPU before passing to vLLM. Whether this improves throughput is model-dependent |
 | `embedding_field` | `str` | `"embeddings"` | Name of the output embedding column |
 | `max_chars` | `int` | `None` | Maximum characters per document (truncates before tokenization) |
 | `cache_dir` | `str` | `None` | Directory for caching downloaded model files |
@@ -103,23 +103,23 @@ Default vLLM settings applied by the stage (can be overridden):
 
 When `pretokenize=True`, the stage:
 
-1. Loads a Hugging Face `AutoTokenizer` for the specified model
+1. Loads a Hugging Face Auto Tokenizer for the specified model
 2. Tokenizes the input text batch on CPU with truncation to `max_model_len`
 3. Passes token IDs directly to vLLM using `TokensPrompt`
 
 Whether to use pretokenization depends on the model. For `google/embeddinggemma-300m` (the default for semantic deduplication), `pretokenize=False` is recommended and is the default. For other models, benchmarks show pretokenization can provide better per-task throughput by reducing GPU idle time during tokenization.
 
 ```python
-# Pretokenize mode (recommended)
+# Direct text mode (recommended for google/embeddinggemma-300m)
 VLLMEmbeddingModelStage(
     model_identifier="google/embeddinggemma-300m",
-    pretokenize=True,  # Tokenize on CPU, embed on GPU
+    pretokenize=False,  # vLLM handles tokenization internally
 )
 
-# Direct text mode (simpler, slightly less throughput)
+# Pretokenize mode (can improve throughput for other models)
 VLLMEmbeddingModelStage(
-    model_identifier="google/embeddinggemma-300m",
-    pretokenize=False,  # vLLM handles tokenization internally
+    model_identifier="intfloat/e5-large-v2",
+    pretokenize=True,  # Tokenize on CPU, embed on GPU
 )
 ```
 
@@ -127,11 +127,3 @@ VLLMEmbeddingModelStage(
 
 The `VLLMEmbeddingModelStage` requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure `tensor_parallel_size` in `vllm_init_kwargs`.
 
-## When to Use vLLM vs. Sentence Transformers
-
-| Scenario | Recommendation |
-| --- | --- |
-| Large embedding models (>500M params) | vLLM — better GPU utilization and memory management |
-| Small embedding models (<100M params) | Sentence Transformers — lower overhead, faster startup |
-| Semantic deduplication | vLLM with `google/embeddinggemma-300m` — the default backend |
-| Quick prototyping | `EmbeddingCreatorStage` — simpler setup |