Skip to content

docs: add text embeddings guide and release notes for PR #1346#1687

Merged
lbliii merged 5 commits into26.04-stagingfrom
lbliii/pr-1346-docs
Apr 7, 2026
Merged

docs: add text embeddings guide and release notes for PR #1346#1687
lbliii merged 5 commits into26.04-stagingfrom
lbliii/pr-1346-docs

Conversation

@lbliii
Copy link
Copy Markdown
Contributor

@lbliii lbliii commented Mar 31, 2026

Description

Adds Fern documentation for the vLLM and Sentence Transformers embedding support introduced in PR #1346. Creates a new Text Embeddings section under Curate Text > Process Data with an overview page covering all three embedding backends and a dedicated vLLM Embedder guide. Updates 26.04 release notes with the feature summary and dependency additions. Expands the semantic deduplication page with a vLLM-based embedding example for large models.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@lbliii lbliii requested a review from a team as a code owner March 31, 2026 13:56
@lbliii lbliii requested review from suiyoubi and removed request for a team March 31, 2026 13:56
Comment on lines +213 to +222
# Step 2: Run deduplication on pre-computed embeddings
semantic_workflow = SemanticDeduplicationWorkflow(
input_path=embedding_output_path,
output_path=output_path,
n_clusters=100,
eps=0.07,
id_field="_curator_dedup_id",
embedding_field="embeddings",
)
semantic_workflow.run()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Incomplete workflow — output_path contains IDs to remove, not deduplicated text

SemanticDeduplicationWorkflow writes a file of IDs to remove to output_path (see the docstring: "Directory to write output files (i.e. ids to remove)"). After semantic_workflow.run(), users still need a Step 3 to filter their original text dataset using those IDs.

The existing Step-by-Step Workflow accordion on this same page acknowledges this with # Step 6: Remove duplicates from original dataset, but the new vLLM block gives no indication that the workflow is incomplete, which will confuse users who expect output_path to contain final deduplicated documents.

Consider adding a placeholder comment (or a link to TextDuplicatesRemovalWorkflow) so that the gap is visible:

semantic_workflow.run()

# Step 3: Filter original text dataset using the IDs to remove
# See TextDuplicatesRemovalWorkflow for the removal step

Replace duplicated vLLM Quick Start in embeddings overview and semdedup
page with cross-references to the canonical vllm-embedder page. Replace
placeholder "large-embedding-model" with consistent model identifiers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
- **New install extras**: `inference_server` (Ray Serve + vLLM dependencies) and `sdg_cuda12` (SDG with local inference support).
- **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers.
- **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem.
- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location.
- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location.

- **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers.
- **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem.
- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers and HuggingFace `AutoModel` backends. Added `cache_dir` parameter for controlling model download location.
- **New `vllm` optional dependency**: Install with `pip install nemo_curator[vllm]` (x86_64 Linux only). The `sentence-transformers` package is now included in the `text_cpu` extra.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should never install Curator with only the vllm dependency. It is automatically included with the relevant modality installations (text_cuda12, video_cuda12, math_cuda12).


Fixed a race condition in `CaptionGenerationStage` and `CaptionEnhancementStage` where multiple workers simultaneously initializing vLLM would race on the shared `torch.compile` cache directory, causing `FileNotFoundError`. Model initialization now runs once per node in `setup_on_node()` instead of per-worker in `setup()`, matching the pattern used by text vLLM stages.
- **sentence-transformers**: Added to the `text_cpu` optional dependency group
- **vllm**: New `vllm` optional dependency group (`pip install nemo_curator[vllm]`, x86_64 Linux only)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **vllm**: New `vllm` optional dependency group (`pip install nemo_curator[vllm]`, x86_64 Linux only)
- **vllm**: New `vllm` optional dependency group


NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements:

1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers and HuggingFace `AutoModel` backends via the `use_sentence_transformer` flag.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers and HuggingFace `AutoModel` backends via the `use_sentence_transformer` flag.
1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes via the `use_sentence_transformer` flag.


| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
| `model_identifier` | `str` | Required | HuggingFace model name or path for the embedding model |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `model_identifier` | `str` | Required | HuggingFace model name or path for the embedding model |
| `model_identifier` | `str` | Required | Hugging Face model name or path for the embedding model |

| `embedding_field` | `str` | `"embeddings"` | Name of the output embedding column |
| `max_chars` | `int` | `None` | Maximum characters per document (truncates before tokenization) |
| `cache_dir` | `str` | `None` | Directory for caching downloaded model files |
| `hf_token` | `str` | `None` | HuggingFace token for accessing gated models |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `hf_token` | `str` | `None` | HuggingFace token for accessing gated models |
| `hf_token` | `str` | `None` | Hugging Face token for accessing gated models |


When `pretokenize=True`, the stage:

1. Loads a HuggingFace `AutoTokenizer` for the specified model
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Loads a HuggingFace `AutoTokenizer` for the specified model
1. Loads a HuggingFace `Auto Tokenizer` for the specified model

2. Tokenizes the input text batch on CPU with truncation to `max_model_len`
3. Passes token IDs directly to vLLM using `TokensPrompt`

This mode is recommended for production workloads. Benchmarks show it provides the best per-task throughput across both small and large embedding models by reducing GPU idle time during tokenization.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretokenize=False is recommended (for Embedding Gemma), that is why it is the default for semantic deduplication. It can be very model dependent.

| --- | --- |
| Large embedding models (>500M params) | vLLM — better GPU utilization and memory management |
| Small embedding models (<100M params) | Sentence Transformers — lower overhead, faster startup |
| High-throughput production pipelines | vLLM with `pretokenize=True` — best amortized throughput |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| High-throughput production pipelines | vLLM with `pretokenize=True` — best amortized throughput |

- Apply sarahyurick's review feedback:
  - Fix EmbeddingCreatorStage description to reference SentenceTransformer and AutoModel classes
  - Update vllm dependency info: included via text_cuda12, not installed separately
  - Use uv instead of pip in install commands
  - Fix model identifier to google/embeddinggemma-300m
  - Update vLLM as recommended backend for semantic dedup
  - Fix pretokenize recommendation (False for Embedding Gemma)
  - Fix HuggingFace -> Hugging Face capitalization
  - Update comparison tables and recommendations
- Resolve merge conflicts with 26.04-staging
- Add missing removal step comment in semdedup vLLM workflow

Signed-off-by: Logan Lane <lbliii@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx Outdated
The workflow now uses VLLMEmbeddingModelStage internally, not
EmbeddingCreatorStage.

Signed-off-by: Logan Lane <lbliii@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left a few more small comments.


- **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers.
- **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem.
- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classe. Added `cache_dir` parameter for controlling model download location.
- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes. Added `cache_dir` parameter for controlling model download location.


| Backend | Best For | GPU Utilization | Setup |
| --- | --- | --- | --- |
| `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cpu` extra |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cpu` extra |
| `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cuda12` extra |

Since all embedding stages need GPU, we should recommend this one.


- **Optional pretokenization**: When `pretokenize=True`, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput
- **vLLM-managed batching**: Leverages vLLM's built-in request scheduling for optimal GPU utilization
- **Model download caching**: Automatically downloads and caches models from HuggingFace Hub
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Model download caching**: Automatically downloads and caches models from HuggingFace Hub
- **Model download caching**: Automatically downloads and caches models from Hugging Face Hub

- Fix "HuggingFace" to "Hugging Face" everywhere
- Remove vllm install instructions (included in text_cuda12)
- Fix "classe" typo to "classes" in release notes
- Update Setup column to recommend text_cuda12
- Position vLLM as recommended for semantic dedup
- Fix pretokenize recommendation (model-dependent, not universal)
- Remove vLLM vs ST comparison table per reviewer request
- Use correct model identifier google/embeddinggemma-300m

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants