docs: add text embeddings guide and release notes for PR #1346#1687
docs: add text embeddings guide and release notes for PR #1346#1687lbliii merged 5 commits into26.04-stagingfrom
Conversation
| # Step 2: Run deduplication on pre-computed embeddings | ||
| semantic_workflow = SemanticDeduplicationWorkflow( | ||
| input_path=embedding_output_path, | ||
| output_path=output_path, | ||
| n_clusters=100, | ||
| eps=0.07, | ||
| id_field="_curator_dedup_id", | ||
| embedding_field="embeddings", | ||
| ) | ||
| semantic_workflow.run() |
There was a problem hiding this comment.
Incomplete workflow —
output_path contains IDs to remove, not deduplicated text
SemanticDeduplicationWorkflow writes a file of IDs to remove to output_path (see the docstring: "Directory to write output files (i.e. ids to remove)"). After semantic_workflow.run(), users still need a Step 3 to filter their original text dataset using those IDs.
The existing Step-by-Step Workflow accordion on this same page acknowledges this with # Step 6: Remove duplicates from original dataset, but the new vLLM block gives no indication that the workflow is incomplete, which will confuse users who expect output_path to contain final deduplicated documents.
Consider adding a placeholder comment (or a link to TextDuplicatesRemovalWorkflow) so that the gap is visible:
semantic_workflow.run()
# Step 3: Filter original text dataset using the IDs to remove
# See TextDuplicatesRemovalWorkflow for the removal stepReplace duplicated vLLM Quick Start in embeddings overview and semdedup page with cross-references to the canonical vllm-embedder page. Replace placeholder "large-embedding-model" with consistent model identifiers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
| - **New install extras**: `inference_server` (Ray Serve + vLLM dependencies) and `sdg_cuda12` (SDG with local inference support). | ||
| - **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers. | ||
| - **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem. | ||
| - **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location. |
There was a problem hiding this comment.
| - **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location. | |
| - **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location. |
| - **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers. | ||
| - **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem. | ||
| - **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location. | ||
| - **New `vllm` optional dependency**: Install with `pip install nemo_curator[vllm]` (x86_64 Linux only). The `sentence-transformers` package is now included in the `text_cpu` extra. |
There was a problem hiding this comment.
Users should never install Curator with only the vllm dependency. It is automatically included with the relevant modality installations (text_cuda12, video_cuda12, math_cuda12).
|
|
||
| Fixed a race condition in `CaptionGenerationStage` and `CaptionEnhancementStage` where multiple workers simultaneously initializing vLLM would race on the shared `torch.compile` cache directory, causing `FileNotFoundError`. Model initialization now runs once per node in `setup_on_node()` instead of per-worker in `setup()`, matching the pattern used by text vLLM stages. | ||
| - **sentence-transformers**: Added to the `text_cpu` optional dependency group | ||
| - **vllm**: New `vllm` optional dependency group (`pip install nemo_curator[vllm]`, x86_64 Linux only) |
There was a problem hiding this comment.
| - **vllm**: New `vllm` optional dependency group (`pip install nemo_curator[vllm]`, x86_64 Linux only) | |
| - **vllm**: New `vllm` optional dependency group |
|
|
||
| NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements: | ||
|
|
||
| 1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers and HuggingFace `AutoModel` backends via the `use_sentence_transformer` flag. |
There was a problem hiding this comment.
| 1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers and HuggingFace `AutoModel` backends via the `use_sentence_transformer` flag. | |
| 1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes via the `use_sentence_transformer` flag. |
|
|
||
| | Parameter | Type | Default | Description | | ||
| | --- | --- | --- | --- | | ||
| | `model_identifier` | `str` | Required | HuggingFace model name or path for the embedding model | |
There was a problem hiding this comment.
| | `model_identifier` | `str` | Required | HuggingFace model name or path for the embedding model | | |
| | `model_identifier` | `str` | Required | Hugging Face model name or path for the embedding model | |
| | `embedding_field` | `str` | `"embeddings"` | Name of the output embedding column | | ||
| | `max_chars` | `int` | `None` | Maximum characters per document (truncates before tokenization) | | ||
| | `cache_dir` | `str` | `None` | Directory for caching downloaded model files | | ||
| | `hf_token` | `str` | `None` | HuggingFace token for accessing gated models | |
There was a problem hiding this comment.
| | `hf_token` | `str` | `None` | HuggingFace token for accessing gated models | | |
| | `hf_token` | `str` | `None` | Hugging Face token for accessing gated models | |
|
|
||
| When `pretokenize=True`, the stage: | ||
|
|
||
| 1. Loads a HuggingFace `AutoTokenizer` for the specified model |
There was a problem hiding this comment.
| 1. Loads a HuggingFace `AutoTokenizer` for the specified model | |
| 1. Loads a HuggingFace `Auto Tokenizer` for the specified model |
| 2. Tokenizes the input text batch on CPU with truncation to `max_model_len` | ||
| 3. Passes token IDs directly to vLLM using `TokensPrompt` | ||
|
|
||
| This mode is recommended for production workloads. Benchmarks show it provides the best per-task throughput across both small and large embedding models by reducing GPU idle time during tokenization. |
There was a problem hiding this comment.
pretokenize=False is recommended (for Embedding Gemma), that is why it is the default for semantic deduplication. It can be very model dependent.
| | --- | --- | | ||
| | Large embedding models (>500M params) | vLLM — better GPU utilization and memory management | | ||
| | Small embedding models (<100M params) | Sentence Transformers — lower overhead, faster startup | | ||
| | High-throughput production pipelines | vLLM with `pretokenize=True` — best amortized throughput | |
There was a problem hiding this comment.
| | High-throughput production pipelines | vLLM with `pretokenize=True` — best amortized throughput | |
- Apply sarahyurick's review feedback: - Fix EmbeddingCreatorStage description to reference SentenceTransformer and AutoModel classes - Update vllm dependency info: included via text_cuda12, not installed separately - Use uv instead of pip in install commands - Fix model identifier to google/embeddinggemma-300m - Update vLLM as recommended backend for semantic dedup - Fix pretokenize recommendation (False for Embedding Gemma) - Fix HuggingFace -> Hugging Face capitalization - Update comparison tables and recommendations - Resolve merge conflicts with 26.04-staging - Add missing removal step comment in semdedup vLLM workflow Signed-off-by: Logan Lane <lbliii@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
The workflow now uses VLLMEmbeddingModelStage internally, not EmbeddingCreatorStage. Signed-off-by: Logan Lane <lbliii@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
sarahyurick
left a comment
There was a problem hiding this comment.
Thanks! Left a few more small comments.
|
|
||
| - **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers. | ||
| - **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem. | ||
| - **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location. |
There was a problem hiding this comment.
| - **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location. | |
| - **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes. Added `cache_dir` parameter for controlling model download location. |
|
|
||
| | Backend | Best For | GPU Utilization | Setup | | ||
| | --- | --- | --- | --- | | ||
| | `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cpu` extra | |
There was a problem hiding this comment.
| | `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cpu` extra | | |
| | `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cuda12` extra | |
Since all embedding stages need GPU, we should recommend this one.
|
|
||
| - **Optional pretokenization**: When `pretokenize=True`, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput | ||
| - **vLLM-managed batching**: Leverages vLLM's built-in request scheduling for optimal GPU utilization | ||
| - **Model download caching**: Automatically downloads and caches models from HuggingFace Hub |
There was a problem hiding this comment.
| - **Model download caching**: Automatically downloads and caches models from HuggingFace Hub | |
| - **Model download caching**: Automatically downloads and caches models from Hugging Face Hub |
- Fix "HuggingFace" to "Hugging Face" everywhere - Remove vllm install instructions (included in text_cuda12) - Fix "classe" typo to "classes" in release notes - Update Setup column to recommend text_cuda12 - Position vLLM as recommended for semantic dedup - Fix pretokenize recommendation (model-dependent, not universal) - Remove vLLM vs ST comparison table per reviewer request - Use correct model identifier google/embeddinggemma-300m Signed-off-by: Lawrence Lane <llane@nvidia.com>
Description
Adds Fern documentation for the vLLM and Sentence Transformers embedding support introduced in PR #1346. Creates a new Text Embeddings section under Curate Text > Process Data with an overview page covering all three embedding backends and a dedicated vLLM Embedder guide. Updates 26.04 release notes with the feature summary and dependency additions. Expands the semantic deduplication page with a vLLM-based embedding example for large models.
Checklist