Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions fern/versions/v26.04.yml
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,15 @@ navigation:
- page: Text Cleaning
path: ./v26.04/pages/curate-text/process-data/content-processing/text-cleaning.mdx
slug: text-cleaning
- section: Embeddings
slug: embeddings
contents:
- page: Overview
path: ./v26.04/pages/curate-text/process-data/embeddings/index.mdx
slug: ""
- page: vLLM Embedder
path: ./v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx
slug: vllm-embedder
- section: Deduplication
slug: deduplication
contents:
Expand Down
12 changes: 12 additions & 0 deletions fern/versions/v26.04/pages/about/release-notes/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,16 @@ modality: "universal"

## What's New in 26.04

### vLLM and Sentence Transformers Embedding Support (PR #1346)

Added two new embedding backends for text curation, giving users flexibility to choose the best engine for their model size and throughput needs:

- **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers.
- **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem.
- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes. Added `cache_dir` parameter for controlling model download location.

For usage details, see [Text Embeddings](/curate-text/process-data/embeddings) and [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder).

### Inference Server (Ray Serve)

Built-in LLM serving alongside curation pipelines using Ray Serve and vLLM:
Expand Down Expand Up @@ -105,6 +115,8 @@ Resolved four HIGH-severity vulnerabilities affecting Curator dependencies:

- **Cosmos-Xenna**: Updated from 0.1.2 to 0.2.0 with simplified resource model
- **Ray**: Updated to 2.54
- **sentence-transformers**: Added to the `text_cpu` optional dependency group
- **vllm**: New vllm optional dependency group
- **uv**: Added minimum required version (>=0.7.0) to prevent lockfile revision drift
- **nemo-toolkit**: Bumped `nemo_toolkit[asr]` from `==2.4.0` to `>=2.7.2` to address deserialization CVEs. Only affects `audio_cpu` and `audio_cuda12` extras.
- **xgrammar**: Moved from `constraint-dependencies` (`>=0.1.21`) to `override-dependencies` (`>=0.1.32`) to override vLLM's pinned version and address CVE-2026-25048.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -188,13 +188,38 @@ workflow = TextSemanticDeduplicationWorkflow(
)
```

**vLLM Embedder** (recommended for large models):

For large embedding models, you can generate embeddings separately using `VLLMEmbeddingModelStage` before running the deduplication workflow. This provides better GPU utilization and throughput for models with 500M+ parameters. See [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) for details.

Generate embeddings with `VLLMEmbeddingModelStage` using the [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) pipeline, then pass the output to `SemanticDeduplicationWorkflow`:

```python
from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow

# After generating embeddings to embedding_output_path using VLLMEmbeddingModelStage
semantic_workflow = SemanticDeduplicationWorkflow(
input_path=embedding_output_path,
output_path=output_path,
n_clusters=100,
eps=0.07,
id_field="_curator_dedup_id",
embedding_field="embeddings",
)
semantic_workflow.run()
Comment on lines +197 to +209
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing imports and undefined executor in vLLM code block

The new vLLM embedding code block is missing several imports and uses an undefined variable executor:

  1. Undefined executor: embedding_pipeline.run(executor) references executor, which is never defined in this snippet. The analogous Step-by-Step Workflow above calls embedding_pipeline.run() with no executor argument. Either add executor = XennaExecutor() (plus its import) or remove the argument to match the existing pattern.

  2. SemanticDeduplicationWorkflow not imported: Used on line 207 but not imported. The correct import is from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow.

  3. Pipeline, ParquetReader, ParquetWriter not imported: Needed for the first half of the snippet but absent. Users copying this block will see multiple NameErrors.

A standalone, self-contained snippet should include all its imports:

from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import ParquetReader
from nemo_curator.stages.text.io.writer import ParquetWriter

executor = XennaExecutor()


# Step 3: Filter original text dataset using the IDs to remove
# See TextDuplicatesRemovalWorkflow for the removal step
```

**When choosing a model**:

- Use models that support vLLM pooling (embedding) mode
- Choose models appropriate for your language or domain
- Prefer models trained for sentence embeddings (for example, EmbeddingGemma, E5, BGE, or SBERT)
- Use `embedding_pretokenize=True` for models that benefit from explicit tokenization control
- Pass additional vLLM configuration through `embedding_vllm_init_kwargs`
- For more control over the embedding process, consider using [VLLMEmbeddingModelStage](/curate-text/process-data/embeddings/vllm-embedder) separately
</Accordion>

<Accordion title="Advanced Configuration">
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
description: "Generate text embeddings using vLLM, Sentence Transformers, or Hugging Face models for deduplication, similarity search, and downstream tasks"
categories: ["how-to-guides"]
tags: ["embeddings", "vllm", "sentence-transformers", "gpu-accelerated", "similarity-search"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "how-to"
modality: "text-only"
---

# Text Embedding

Generate text embeddings for large-scale datasets using NeMo Curator's built-in embedding stages. Text embeddings enable downstream tasks such as semantic deduplication, similarity search, and clustering.

## How It Works

NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements:

1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes via the `use_sentence_transformer` flag.
2. **`VLLMEmbeddingModelStage`** — A standalone stage that uses vLLM for GPU-accelerated embedding generation with optional pretokenization. Best for large embedding models where vLLM's batching and GPU utilization provide significant throughput gains.
3. **`SentenceTransformerEmbeddingModelStage`** — A model stage that uses the `sentence-transformers` library directly. Used internally by `EmbeddingCreatorStage` when `use_sentence_transformer=True`.

## Choosing an Embedding Backend

| Backend | Best For | GPU Utilization | Setup |
| --- | --- | --- | --- |
| `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cuda12` extra |
| `VLLMEmbeddingModelStage` | Large models (e.g., `google/embeddinggemma-300m`) and semantic deduplication | Excellent | Included in `text_cuda12` extra |
| `EmbeddingCreatorStage` (AutoModel) | Custom pooling strategies | Good | Set `use_sentence_transformer=False` |
Comment thread
sarahyurick marked this conversation as resolved.

<Note>
Benchmarks on 5 GB of Common Crawl data show that vLLM outperforms Sentence Transformers for larger embedding models, while Sentence Transformers is faster for smaller models. The vLLM `pretokenize` mode provides the best per-task throughput across both model sizes when amortized over many tasks.
</Note>

## Quick Start

### EmbeddingCreatorStage

```python
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import ParquetReader
from nemo_curator.stages.text.io.writer import ParquetWriter

pipeline = Pipeline(
name="text_embeddings",
stages=[
ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
EmbeddingCreatorStage(
model_identifier="sentence-transformers/all-MiniLM-L6-v2",
Comment thread
sarahyurick marked this conversation as resolved.
text_field="text",
embedding_field="embeddings",
model_inference_batch_size=256,
),
ParquetWriter(path="output/", fields=["text", "embeddings"]),
],
)

executor = XennaExecutor()
pipeline.run(executor)
Comment on lines +60 to +61
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing XennaExecutor import in both code snippets

XennaExecutor is used on lines 59 and 85 without an import. Users copying either snippet will get a NameError: name 'XennaExecutor' is not defined. The correct import (as used throughout the tutorials) is from nemo_curator.backends.xenna import XennaExecutor.

This applies to both the EmbeddingCreatorStage (line 59) and VLLMEmbeddingModelStage (line 85) examples. Add the import at the top of each snippet:

from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
from nemo_curator.stages.text.io.reader import ParquetReader
from nemo_curator.stages.text.io.writer import ParquetWriter

The same missing import also affects vllm-embedder.mdx line 58.

```

### VLLMEmbeddingModelStage (Recommended for Semantic Deduplication)

`VLLMEmbeddingModelStage` is the default embedding backend for semantic deduplication, using `google/embeddinggemma-300m`. It provides better GPU utilization and throughput for large embedding models. See the [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) guide for setup, configuration, and code examples.

---

## Available Embedding Tools

<Cards>

<Card title="vLLM Embedder" href="/curate-text/process-data/embeddings/vllm-embedder">
Generate embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models.
</Card>

</Cards>

---

## Integration with Semantic Deduplication

Text embeddings are a key input for [semantic deduplication](/curate-text/process-data/deduplication/semdedup). The `TextSemanticDeduplicationWorkflow` uses `VLLMEmbeddingModelStage` internally, but you can also generate embeddings separately and feed them into the deduplication workflow for more control over the embedding process.
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
description: "Generate text embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models"
categories: ["how-to-guides"]
tags: ["embeddings", "vllm", "gpu-accelerated", "large-models"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "how-to"
modality: "text-only"
---

# vLLM Embedder

Generate text embeddings using vLLM's optimized inference engine. The `VLLMEmbeddingModelStage` provides high-throughput embedding generation, particularly for large embedding models where vLLM's batching and GPU memory management provide significant performance advantages over Sentence Transformers.

<Note>
**Installation**: The vLLM embedder is included in the `text_cuda12` installation. Install it with:

```bash
uv pip install nemo_curator[text_cuda12]
```

vLLM is only available on x86_64 Linux systems.
</Note>
Comment thread
sarahyurick marked this conversation as resolved.

## How It Works

`VLLMEmbeddingModelStage` is a single-stage embedder that handles both tokenization and embedding generation within one stage. Unlike `EmbeddingCreatorStage` (which splits tokenization and model inference into separate stages), the vLLM embedder delegates all GPU operations to vLLM's inference engine.

Key features:

- **Optional pretokenization**: When `pretokenize=True`, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput
- **vLLM-managed batching**: Leverages vLLM's built-in request scheduling for optimal GPU utilization
- **Model download caching**: Automatically downloads and caches models from Hugging Face Hub
- **Character truncation**: Optional `max_chars` parameter to limit input length before tokenization

## Quick Start

```python
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import ParquetReader
from nemo_curator.stages.text.io.writer import ParquetWriter

pipeline = Pipeline(
name="vllm_embeddings",
stages=[
ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
VLLMEmbeddingModelStage(
model_identifier="google/embeddinggemma-300m",
text_field="text",
embedding_field="embeddings",
),
ParquetWriter(path="output/", fields=["text", "embeddings"]),
Comment on lines +48 to +54
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Quick Start uses a small model that contradicts vLLM guidance

The Quick Start example uses sentence-transformers/all-MiniLM-L6-v2 (~22M parameters), but the entire page is dedicated to vLLM for large models. The comparison table at the bottom of this same page explicitly states "Small embedding models (<100M params) | Sentence Transformers — lower overhead, faster startup", making this example self-contradictory.

Consider replacing the model identifier with a large embedding model or a descriptive placeholder like "your-large-embedding-model" that signals this backend is intended for large models. Using a small model here risks training users to reach for vLLM even when Sentence Transformers would perform better.

],
)

executor = XennaExecutor()
pipeline.run(executor)
```

## Configuration

### Parameters

| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
| `model_identifier` | `str` | Required | Hugging Face model name or path for the embedding model |
| `vllm_init_kwargs` | `dict` | `None` | Additional keyword arguments passed to `vllm.LLM()` for engine configuration |
| `text_field` | `str` | `"text"` | Name of the input text column in the data |
| `pretokenize` | `bool` | `False` | Tokenize text on CPU before passing to vLLM. Whether this improves throughput is model-dependent |
| `embedding_field` | `str` | `"embeddings"` | Name of the output embedding column |
| `max_chars` | `int` | `None` | Maximum characters per document (truncates before tokenization) |
| `cache_dir` | `str` | `None` | Directory for caching downloaded model files |
| `hf_token` | `str` | `None` | Hugging Face token for accessing gated models |
| `verbose` | `bool` | `False` | Enable verbose logging and progress bars |

### vLLM Engine Options

Pass additional vLLM configuration through `vllm_init_kwargs`:

```python
VLLMEmbeddingModelStage(
model_identifier="google/embeddinggemma-300m",
pretokenize=True,
vllm_init_kwargs={
"enforce_eager": True, # Disable CUDA graph for debugging
"tensor_parallel_size": 2, # Distribute across 2 GPUs
"gpu_memory_utilization": 0.9,
"max_model_len": 512,
},
)
```

Default vLLM settings applied by the stage (can be overridden):

- `enforce_eager=False` — Uses CUDA graphs for faster inference
- `runner="pooling"` — Configures vLLM for embedding (pooling) tasks
- `model_impl="vllm"` — Uses vLLM's native model implementation
- `disable_log_stats=True` — Suppresses stats logging when `verbose=False`

### Pretokenization

When `pretokenize=True`, the stage:

1. Loads a Hugging Face Auto Tokenizer for the specified model
2. Tokenizes the input text batch on CPU with truncation to `max_model_len`
3. Passes token IDs directly to vLLM using `TokensPrompt`

Whether to use pretokenization depends on the model. For `google/embeddinggemma-300m` (the default for semantic deduplication), `pretokenize=False` is recommended and is the default. For other models, benchmarks show pretokenization can provide better per-task throughput by reducing GPU idle time during tokenization.

```python
# Direct text mode (recommended for google/embeddinggemma-300m)
VLLMEmbeddingModelStage(
model_identifier="google/embeddinggemma-300m",
pretokenize=False, # vLLM handles tokenization internally
)

# Pretokenize mode (can improve throughput for other models)
VLLMEmbeddingModelStage(
model_identifier="intfloat/e5-large-v2",
pretokenize=True, # Tokenize on CPU, embed on GPU
)
```

## Resources

The `VLLMEmbeddingModelStage` requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure `tensor_parallel_size` in `vllm_init_kwargs`.

Loading