docs: add text embeddings guide and release notes for PR #1346 by lbliii · Pull Request #1687 · NVIDIA-NeMo/Curator

lbliii · 2026-03-31T13:56:47Z

Description

Adds Fern documentation for the vLLM and Sentence Transformers embedding support introduced in PR #1346. Creates a new Text Embeddings section under Curate Text > Process Data with an overview page covering all three embedding backends and a dedicated vLLM Embedder guide. Updates 26.04 release notes with the feature summary and dependency additions. Expands the semantic deduplication page with a vLLM-based embedding example for large models.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

greptile-apps · 2026-03-31T14:16:29Z

+# Step 2: Run deduplication on pre-computed embeddings
+semantic_workflow = SemanticDeduplicationWorkflow(
+    input_path=embedding_output_path,
+    output_path=output_path,
+    n_clusters=100,
+    eps=0.07,
+    id_field="_curator_dedup_id",
+    embedding_field="embeddings",
+)
+semantic_workflow.run()


Incomplete workflow — output_path contains IDs to remove, not deduplicated text

SemanticDeduplicationWorkflow writes a file of IDs to remove to output_path (see the docstring: "Directory to write output files (i.e. ids to remove)"). After semantic_workflow.run(), users still need a Step 3 to filter their original text dataset using those IDs.

The existing Step-by-Step Workflow accordion on this same page acknowledges this with # Step 6: Remove duplicates from original dataset, but the new vLLM block gives no indication that the workflow is incomplete, which will confuse users who expect output_path to contain final deduplicated documents.

Consider adding a placeholder comment (or a link to TextDuplicatesRemovalWorkflow) so that the gap is visible:

semantic_workflow.run() # Step 3: Filter original text dataset using the IDs to remove # See TextDuplicatesRemovalWorkflow for the removal step

Replace duplicated vLLM Quick Start in embeddings overview and semdedup page with cross-references to the canonical vllm-embedder page. Replace placeholder "large-embedding-model" with consistent model identifiers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

sarahyurick · 2026-04-06T16:31:51Z

- **New install extras**: `inference_server` (Ray Serve + vLLM dependencies) and `sdg_cuda12` (SDG with local inference support).
+- **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers.
+- **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem.
+- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location.


Suggested change

- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location.

- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location.

sarahyurick · 2026-04-06T16:33:41Z

+- **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers.
+- **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem.
+- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location.
+- **New `vllm` optional dependency**: Install with `pip install nemo_curator[vllm]` (x86_64 Linux only). The `sentence-transformers` package is now included in the `text_cpu` extra.


Users should never install Curator with only the vllm dependency. It is automatically included with the relevant modality installations (text_cuda12, video_cuda12, math_cuda12).

sarahyurick · 2026-04-06T16:33:52Z

-
-Fixed a race condition in `CaptionGenerationStage` and `CaptionEnhancementStage` where multiple workers simultaneously initializing vLLM would race on the shared `torch.compile` cache directory, causing `FileNotFoundError`. Model initialization now runs once per node in `setup_on_node()` instead of per-worker in `setup()`, matching the pattern used by text vLLM stages.
+- **sentence-transformers**: Added to the `text_cpu` optional dependency group
+- **vllm**: New `vllm` optional dependency group (`pip install nemo_curator[vllm]`, x86_64 Linux only)


Suggested change

- **vllm**: New `vllm` optional dependency group (`pip install nemo_curator[vllm]`, x86_64 Linux only)

- **vllm**: New `vllm` optional dependency group

sarahyurick · 2026-04-06T16:34:49Z

+
+NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements:
+
+1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers and HuggingFace `AutoModel` backends via the `use_sentence_transformer` flag.


Suggested change

1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers and HuggingFace `AutoModel` backends via the `use_sentence_transformer` flag.

1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes via the `use_sentence_transformer` flag.

sarahyurick · 2026-04-06T16:40:06Z

+
+| Parameter | Type | Default | Description |
+| --- | --- | --- | --- |
+| `model_identifier` | `str` | Required | HuggingFace model name or path for the embedding model |


sarahyurick · 2026-04-06T16:40:12Z

+| `embedding_field` | `str` | `"embeddings"` | Name of the output embedding column |
+| `max_chars` | `int` | `None` | Maximum characters per document (truncates before tokenization) |
+| `cache_dir` | `str` | `None` | Directory for caching downloaded model files |
+| `hf_token` | `str` | `None` | HuggingFace token for accessing gated models |


sarahyurick · 2026-04-06T16:40:28Z

+
+When `pretokenize=True`, the stage:
+
+1. Loads a HuggingFace `AutoTokenizer` for the specified model


Suggested change

1. Loads a HuggingFace `AutoTokenizer` for the specified model

1. Loads a HuggingFace `Auto Tokenizer` for the specified model

sarahyurick · 2026-04-06T16:41:05Z

+2. Tokenizes the input text batch on CPU with truncation to `max_model_len`
+3. Passes token IDs directly to vLLM using `TokensPrompt`
+
+This mode is recommended for production workloads. Benchmarks show it provides the best per-task throughput across both small and large embedding models by reducing GPU idle time during tokenization.


pretokenize=False is recommended (for Embedding Gemma), that is why it is the default for semantic deduplication. It can be very model dependent.

sarahyurick · 2026-04-06T16:41:48Z

+| --- | --- |
+| Large embedding models (>500M params) | vLLM — better GPU utilization and memory management |
+| Small embedding models (<100M params) | Sentence Transformers — lower overhead, faster startup |
+| High-throughput production pipelines | vLLM with `pretokenize=True` — best amortized throughput |


Suggested change

| High-throughput production pipelines | vLLM with `pretokenize=True` — best amortized throughput |

- Apply sarahyurick's review feedback: - Fix EmbeddingCreatorStage description to reference SentenceTransformer and AutoModel classes - Update vllm dependency info: included via text_cuda12, not installed separately - Use uv instead of pip in install commands - Fix model identifier to google/embeddinggemma-300m - Update vLLM as recommended backend for semantic dedup - Fix pretokenize recommendation (False for Embedding Gemma) - Fix HuggingFace -> Hugging Face capitalization - Update comparison tables and recommendations - Resolve merge conflicts with 26.04-staging - Add missing removal step comment in semdedup vLLM workflow Signed-off-by: Logan Lane <lbliii@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot · 2026-04-06T17:01:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The workflow now uses VLLMEmbeddingModelStage internally, not EmbeddingCreatorStage. Signed-off-by: Logan Lane <lbliii@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

sarahyurick

Thanks! Left a few more small comments.

sarahyurick · 2026-04-06T22:37:44Z

+
+- **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers.
+- **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem.
+- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location.


Suggested change

- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location.

- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes. Added `cache_dir` parameter for controlling model download location.

sarahyurick · 2026-04-06T22:39:11Z

+
+| Backend | Best For | GPU Utilization | Setup |
+| --- | --- | --- | --- |
+| `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cpu` extra |


sarahyurick · 2026-04-06T22:40:12Z

+
+- **Optional pretokenization**: When `pretokenize=True`, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput
+- **vLLM-managed batching**: Leverages vLLM's built-in request scheduling for optimal GPU utilization
+- **Model download caching**: Automatically downloads and caches models from HuggingFace Hub


Suggested change

- **Model download caching**: Automatically downloads and caches models from HuggingFace Hub

- **Model download caching**: Automatically downloads and caches models from Hugging Face Hub

- Fix "HuggingFace" to "Hugging Face" everywhere - Remove vllm install instructions (included in text_cuda12) - Fix "classe" typo to "classes" in release notes - Update Setup column to recommend text_cuda12 - Position vLLM as recommended for semantic dedup - Fix pretokenize recommendation (model-dependent, not universal) - Remove vLLM vs ST comparison table per reviewer request - Use correct model identifier google/embeddinggemma-300m Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii requested a review from a team as a code owner March 31, 2026 13:56

lbliii requested review from suiyoubi and removed request for a team March 31, 2026 13:56

copy-pr-bot Bot temporarily deployed to test March 31, 2026 13:57 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci March 31, 2026 13:57 Error

copy-pr-bot Bot temporarily deployed to nemo-ci March 31, 2026 13:57 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci March 31, 2026 13:57 Error

copy-pr-bot Bot temporarily deployed to nemo-ci March 31, 2026 13:57 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci March 31, 2026 13:57 Error

copy-pr-bot Bot temporarily deployed to nemo-ci March 31, 2026 13:57 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 31, 2026 14:10 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci March 31, 2026 14:10 Error

copy-pr-bot Bot temporarily deployed to nemo-ci March 31, 2026 14:10 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci March 31, 2026 14:10 Error

copy-pr-bot Bot temporarily deployed to nemo-ci March 31, 2026 14:10 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci March 31, 2026 14:10 Error

copy-pr-bot Bot temporarily deployed to nemo-ci March 31, 2026 14:10 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci March 31, 2026 14:10 Error

greptile-apps Bot reviewed Mar 31, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test March 31, 2026 14:17 Inactive

sarahyurick requested changes Apr 6, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx Outdated

sarahyurick reviewed Apr 6, 2026

View reviewed changes

	- `EmbeddingCreatorStage` enhancements: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location.
	- `EmbeddingCreatorStage` enhancements: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location.

	- vllm: New `vllm` optional dependency group (`pip install nemo_curator[vllm]`, x86_64 Linux only)
	- vllm: New `vllm` optional dependency group


		NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements:

		1. `EmbeddingCreatorStage` — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers and HuggingFace `AutoModel` backends via the `use_sentence_transformer` flag.

	\| `model_identifier` \| `str` \| Required \| HuggingFace model name or path for the embedding model \|
	\| `model_identifier` \| `str` \| Required \| Hugging Face model name or path for the embedding model \|

	\| `hf_token` \| `str` \| `None` \| HuggingFace token for accessing gated models \|
	\| `hf_token` \| `str` \| `None` \| Hugging Face token for accessing gated models \|


		When `pretokenize=True`, the stage:

		1. Loads a HuggingFace `AutoTokenizer` for the specified model

	1. Loads a HuggingFace `AutoTokenizer` for the specified model
	1. Loads a HuggingFace `Auto Tokenizer` for the specified model

	\| `EmbeddingCreatorStage` (Sentence Transformers) \| Small to medium models (e.g., all-MiniLM-L6-v2) \| Good \| Included in `text_cpu` extra \|
	\| `EmbeddingCreatorStage` (Sentence Transformers) \| Small to medium models (e.g., all-MiniLM-L6-v2) \| Good \| Included in `text_cuda12` extra \|

	- Model download caching: Automatically downloads and caches models from HuggingFace Hub
	- Model download caching: Automatically downloads and caches models from Hugging Face Hub

Conversation

lbliii commented Mar 31, 2026

Description

Checklist

Uh oh!

greptile-apps Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot Bot commented Apr 6, 2026

Uh oh!

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants