NVIDIA-NeMo · lbliii · Apr 7, 2026 · Mar 31, 2026 · Mar 31, 2026 · Apr 6, 2026
@@ -189,6 +189,15 @@ navigation:
                   - page: Text Cleaning
                     path: ./v26.04/pages/curate-text/process-data/content-processing/text-cleaning.mdx
                     slug: text-cleaning
+              - section: Embeddings
+                slug: embeddings
+                contents:
+                  - page: Overview
+                    path: ./v26.04/pages/curate-text/process-data/embeddings/index.mdx
+                    slug: ""
+                  - page: vLLM Embedder
+                    path: ./v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
+                    slug: vllm-embedder
               - section: Deduplication
                 slug: deduplication
                 contents:

@@ -12,6 +12,16 @@ modality: "universal"
 
 ## What's New in 26.04
 
+### vLLM and Sentence Transformers Embedding Support (PR #1346)
+
+Added two new embedding backends for text curation, giving users flexibility to choose the best engine for their model size and throughput needs:
+
+- **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers.
+- **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem.
+- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes. Added `cache_dir` parameter for controlling model download location.
+
+For usage details, see [Text Embeddings](/curate-text/process-data/embeddings) and [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder).
+
 ### Inference Server (Ray Serve)
 
 Built-in LLM serving alongside curation pipelines using Ray Serve and vLLM:
@@ -105,6 +115,8 @@ Resolved four HIGH-severity vulnerabilities affecting Curator dependencies:
 
 - **Cosmos-Xenna**: Updated from 0.1.2 to 0.2.0 with simplified resource model
 - **Ray**: Updated to 2.54
+- **sentence-transformers**: Added to the `text_cpu` optional dependency group
+- **vllm**: New vllm optional dependency group
 - **uv**: Added minimum required version (>=0.7.0) to prevent lockfile revision drift
 - **nemo-toolkit**: Bumped `nemo_toolkit[asr]` from `==2.4.0` to `>=2.7.2` to address deserialization CVEs. Only affects `audio_cpu` and `audio_cuda12` extras.
 - **xgrammar**: Moved from `constraint-dependencies` (`>=0.1.21`) to `override-dependencies` (`>=0.1.32`) to override vLLM's pinned version and address CVE-2026-25048.

@@ -188,13 +188,38 @@ workflow = TextSemanticDeduplicationWorkflow(
 )
 ```
 
+**vLLM Embedder** (recommended for large models):
+
+For large embedding models, you can generate embeddings separately using `VLLMEmbeddingModelStage` before running the deduplication workflow. This provides better GPU utilization and throughput for models with 500M+ parameters. See [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) for details.
+
+Generate embeddings with `VLLMEmbeddingModelStage` using the [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) pipeline, then pass the output to `SemanticDeduplicationWorkflow`:
+
+```python
+from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
+
+# After generating embeddings to embedding_output_path using VLLMEmbeddingModelStage
+semantic_workflow = SemanticDeduplicationWorkflow(
+    input_path=embedding_output_path,
+    output_path=output_path,
+    n_clusters=100,
+    eps=0.07,
+    id_field="_curator_dedup_id",
+    embedding_field="embeddings",
+)
+semantic_workflow.run()
+
+# Step 3: Filter original text dataset using the IDs to remove
+# See TextDuplicatesRemovalWorkflow for the removal step
+```
+
 **When choosing a model**:
 
 - Use models that support vLLM pooling (embedding) mode
 - Choose models appropriate for your language or domain
 - Prefer models trained for sentence embeddings (for example, EmbeddingGemma, E5, BGE, or SBERT)
 - Use `embedding_pretokenize=True` for models that benefit from explicit tokenization control
 - Pass additional vLLM configuration through `embedding_vllm_init_kwargs`
+- For more control over the embedding process, consider using [VLLMEmbeddingModelStage](/curate-text/process-data/embeddings/vllm-embedder) separately
 </Accordion>
 
 <Accordion title="Advanced Configuration">

@@ -0,0 +1,84 @@
+---
+description: "Generate text embeddings using vLLM, Sentence Transformers, or Hugging Face models for deduplication, similarity search, and downstream tasks"
+categories: ["how-to-guides"]
+tags: ["embeddings", "vllm", "sentence-transformers", "gpu-accelerated", "similarity-search"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "intermediate"
+content_type: "how-to"
+modality: "text-only"
+---
+
+# Text Embedding
+
+Generate text embeddings for large-scale datasets using NeMo Curator's built-in embedding stages. Text embeddings enable downstream tasks such as semantic deduplication, similarity search, and clustering.
+
+## How It Works
+
+NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements:
+
+1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes via the `use_sentence_transformer` flag.
+2. **`VLLMEmbeddingModelStage`** — A standalone stage that uses vLLM for GPU-accelerated embedding generation with optional pretokenization. Best for large embedding models where vLLM's batching and GPU utilization provide significant throughput gains.
+3. **`SentenceTransformerEmbeddingModelStage`** — A model stage that uses the `sentence-transformers` library directly. Used internally by `EmbeddingCreatorStage` when `use_sentence_transformer=True`.
+
+## Choosing an Embedding Backend
+
+| Backend | Best For | GPU Utilization | Setup |
+| --- | --- | --- | --- |
+| `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cuda12` extra |
+| `VLLMEmbeddingModelStage` | Large models (e.g., `google/embeddinggemma-300m`) and semantic deduplication | Excellent | Included in `text_cuda12` extra |
+| `EmbeddingCreatorStage` (AutoModel) | Custom pooling strategies | Good | Set `use_sentence_transformer=False` |
+
+<Note>
+Benchmarks on 5 GB of Common Crawl data show that vLLM outperforms Sentence Transformers for larger embedding models, while Sentence Transformers is faster for smaller models. The vLLM `pretokenize` mode provides the best per-task throughput across both model sizes when amortized over many tasks.
+</Note>
+
+## Quick Start
+
+### EmbeddingCreatorStage
+
+```python
+from nemo_curator.backends.xenna import XennaExecutor
+from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
+from nemo_curator.pipeline import Pipeline
+from nemo_curator.stages.text.io.reader import ParquetReader
+from nemo_curator.stages.text.io.writer import ParquetWriter
+
+pipeline = Pipeline(
+    name="text_embeddings",
+    stages=[
+        ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
+        EmbeddingCreatorStage(
+            model_identifier="sentence-transformers/all-MiniLM-L6-v2",
+            text_field="text",
+            embedding_field="embeddings",
+            model_inference_batch_size=256,
+        ),
+        ParquetWriter(path="output/", fields=["text", "embeddings"]),
+    ],
+)
+
+executor = XennaExecutor()
+pipeline.run(executor)
+```
+
+### VLLMEmbeddingModelStage (Recommended for Semantic Deduplication)
+
+`VLLMEmbeddingModelStage` is the default embedding backend for semantic deduplication, using `google/embeddinggemma-300m`. It provides better GPU utilization and throughput for large embedding models. See the [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) guide for setup, configuration, and code examples.
+
+---
+
+## Available Embedding Tools
+
+<Cards>
+
+<Card title="vLLM Embedder" href="/curate-text/process-data/embeddings/vllm-embedder">
+Generate embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models.
+</Card>
+
+</Cards>
+
+---
+
+## Integration with Semantic Deduplication
+
+Text embeddings are a key input for [semantic deduplication](/curate-text/process-data/deduplication/semdedup). The `TextSemanticDeduplicationWorkflow` uses `VLLMEmbeddingModelStage` internally, but you can also generate embeddings separately and feed them into the deduplication workflow for more control over the embedding process.
@@ -0,0 +1,129 @@
+---
+description: "Generate text embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models"
+categories: ["how-to-guides"]
+tags: ["embeddings", "vllm", "gpu-accelerated", "large-models"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "intermediate"
+content_type: "how-to"
+modality: "text-only"
+---
+
+# vLLM Embedder
+
+Generate text embeddings using vLLM's optimized inference engine. The `VLLMEmbeddingModelStage` provides high-throughput embedding generation, particularly for large embedding models where vLLM's batching and GPU memory management provide significant performance advantages over Sentence Transformers.
+
+<Note>
+**Installation**: The vLLM embedder is included in the `text_cuda12` installation. Install it with:
+
+```bash
+uv pip install nemo_curator[text_cuda12]
+```
+
+vLLM is only available on x86_64 Linux systems.
+</Note>
+
+## How It Works
+
+`VLLMEmbeddingModelStage` is a single-stage embedder that handles both tokenization and embedding generation within one stage. Unlike `EmbeddingCreatorStage` (which splits tokenization and model inference into separate stages), the vLLM embedder delegates all GPU operations to vLLM's inference engine.
+
+Key features:
+
+- **Optional pretokenization**: When `pretokenize=True`, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput
+- **vLLM-managed batching**: Leverages vLLM's built-in request scheduling for optimal GPU utilization
+- **Model download caching**: Automatically downloads and caches models from Hugging Face Hub
+- **Character truncation**: Optional `max_chars` parameter to limit input length before tokenization
+
+## Quick Start
+
+```python
+from nemo_curator.backends.xenna import XennaExecutor
+from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
+from nemo_curator.pipeline import Pipeline
+from nemo_curator.stages.text.io.reader import ParquetReader
+from nemo_curator.stages.text.io.writer import ParquetWriter
+
+pipeline = Pipeline(
+    name="vllm_embeddings",
+    stages=[
+        ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
+        VLLMEmbeddingModelStage(
+            model_identifier="google/embeddinggemma-300m",
+            text_field="text",
+            embedding_field="embeddings",
+        ),
+        ParquetWriter(path="output/", fields=["text", "embeddings"]),
+    ],
+)
+
+executor = XennaExecutor()
+pipeline.run(executor)
+```
+
+## Configuration
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+| --- | --- | --- | --- |
+| `model_identifier` | `str` | Required | Hugging Face model name or path for the embedding model |
+| `vllm_init_kwargs` | `dict` | `None` | Additional keyword arguments passed to `vllm.LLM()` for engine configuration |
+| `text_field` | `str` | `"text"` | Name of the input text column in the data |
+| `pretokenize` | `bool` | `False` | Tokenize text on CPU before passing to vLLM. Whether this improves throughput is model-dependent |
+| `embedding_field` | `str` | `"embeddings"` | Name of the output embedding column |
+| `max_chars` | `int` | `None` | Maximum characters per document (truncates before tokenization) |
+| `cache_dir` | `str` | `None` | Directory for caching downloaded model files |
+| `hf_token` | `str` | `None` | Hugging Face token for accessing gated models |
+| `verbose` | `bool` | `False` | Enable verbose logging and progress bars |
+
+### vLLM Engine Options
+
+Pass additional vLLM configuration through `vllm_init_kwargs`:
+
+```python
+VLLMEmbeddingModelStage(
+    model_identifier="google/embeddinggemma-300m",
+    pretokenize=True,
+    vllm_init_kwargs={
+        "enforce_eager": True,       # Disable CUDA graph for debugging
+        "tensor_parallel_size": 2,   # Distribute across 2 GPUs
+        "gpu_memory_utilization": 0.9,
+        "max_model_len": 512,
+    },
+)
+```
+
+Default vLLM settings applied by the stage (can be overridden):
+
+- `enforce_eager=False` — Uses CUDA graphs for faster inference
+- `runner="pooling"` — Configures vLLM for embedding (pooling) tasks
+- `model_impl="vllm"` — Uses vLLM's native model implementation
+- `disable_log_stats=True` — Suppresses stats logging when `verbose=False`
+
+### Pretokenization
+
+When `pretokenize=True`, the stage:
+
+1. Loads a Hugging Face Auto Tokenizer for the specified model
+2. Tokenizes the input text batch on CPU with truncation to `max_model_len`
+3. Passes token IDs directly to vLLM using `TokensPrompt`
+
+Whether to use pretokenization depends on the model. For `google/embeddinggemma-300m` (the default for semantic deduplication), `pretokenize=False` is recommended and is the default. For other models, benchmarks show pretokenization can provide better per-task throughput by reducing GPU idle time during tokenization.
+
+```python
+# Direct text mode (recommended for google/embeddinggemma-300m)
+VLLMEmbeddingModelStage(
+    model_identifier="google/embeddinggemma-300m",
+    pretokenize=False,  # vLLM handles tokenization internally
+)
+
+# Pretokenize mode (can improve throughput for other models)
+VLLMEmbeddingModelStage(
+    model_identifier="intfloat/e5-large-v2",
+    pretokenize=True,  # Tokenize on CPU, embed on GPU
+)
+```
+
+## Resources
+
+The `VLLMEmbeddingModelStage` requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure `tensor_parallel_size` in `vllm_init_kwargs`.
+