-
Notifications
You must be signed in to change notification settings - Fork 264
docs: add text embeddings guide and release notes for PR #1346 #1687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d51b750
5090134
05d3ee4
023d668
1a19dd8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| --- | ||
| description: "Generate text embeddings using vLLM, Sentence Transformers, or Hugging Face models for deduplication, similarity search, and downstream tasks" | ||
| categories: ["how-to-guides"] | ||
| tags: ["embeddings", "vllm", "sentence-transformers", "gpu-accelerated", "similarity-search"] | ||
| personas: ["data-scientist-focused", "mle-focused"] | ||
| difficulty: "intermediate" | ||
| content_type: "how-to" | ||
| modality: "text-only" | ||
| --- | ||
|
|
||
| # Text Embedding | ||
|
|
||
| Generate text embeddings for large-scale datasets using NeMo Curator's built-in embedding stages. Text embeddings enable downstream tasks such as semantic deduplication, similarity search, and clustering. | ||
|
|
||
| ## How It Works | ||
|
|
||
| NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements: | ||
|
|
||
| 1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes via the `use_sentence_transformer` flag. | ||
| 2. **`VLLMEmbeddingModelStage`** — A standalone stage that uses vLLM for GPU-accelerated embedding generation with optional pretokenization. Best for large embedding models where vLLM's batching and GPU utilization provide significant throughput gains. | ||
| 3. **`SentenceTransformerEmbeddingModelStage`** — A model stage that uses the `sentence-transformers` library directly. Used internally by `EmbeddingCreatorStage` when `use_sentence_transformer=True`. | ||
|
|
||
| ## Choosing an Embedding Backend | ||
|
|
||
| | Backend | Best For | GPU Utilization | Setup | | ||
| | --- | --- | --- | --- | | ||
| | `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cuda12` extra | | ||
| | `VLLMEmbeddingModelStage` | Large models (e.g., `google/embeddinggemma-300m`) and semantic deduplication | Excellent | Included in `text_cuda12` extra | | ||
| | `EmbeddingCreatorStage` (AutoModel) | Custom pooling strategies | Good | Set `use_sentence_transformer=False` | | ||
|
sarahyurick marked this conversation as resolved.
|
||
|
|
||
| <Note> | ||
| Benchmarks on 5 GB of Common Crawl data show that vLLM outperforms Sentence Transformers for larger embedding models, while Sentence Transformers is faster for smaller models. The vLLM `pretokenize` mode provides the best per-task throughput across both model sizes when amortized over many tasks. | ||
| </Note> | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### EmbeddingCreatorStage | ||
|
|
||
| ```python | ||
| from nemo_curator.backends.xenna import XennaExecutor | ||
| from nemo_curator.stages.text.embedders import EmbeddingCreatorStage | ||
| from nemo_curator.pipeline import Pipeline | ||
| from nemo_curator.stages.text.io.reader import ParquetReader | ||
| from nemo_curator.stages.text.io.writer import ParquetWriter | ||
|
|
||
| pipeline = Pipeline( | ||
| name="text_embeddings", | ||
| stages=[ | ||
| ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]), | ||
| EmbeddingCreatorStage( | ||
| model_identifier="sentence-transformers/all-MiniLM-L6-v2", | ||
|
sarahyurick marked this conversation as resolved.
|
||
| text_field="text", | ||
| embedding_field="embeddings", | ||
| model_inference_batch_size=256, | ||
| ), | ||
| ParquetWriter(path="output/", fields=["text", "embeddings"]), | ||
| ], | ||
| ) | ||
|
|
||
| executor = XennaExecutor() | ||
| pipeline.run(executor) | ||
|
Comment on lines
+60
to
+61
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This applies to both the from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
from nemo_curator.stages.text.io.reader import ParquetReader
from nemo_curator.stages.text.io.writer import ParquetWriterThe same missing import also affects |
||
| ``` | ||
|
|
||
| ### VLLMEmbeddingModelStage (Recommended for Semantic Deduplication) | ||
|
|
||
| `VLLMEmbeddingModelStage` is the default embedding backend for semantic deduplication, using `google/embeddinggemma-300m`. It provides better GPU utilization and throughput for large embedding models. See the [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) guide for setup, configuration, and code examples. | ||
|
|
||
| --- | ||
|
|
||
| ## Available Embedding Tools | ||
|
|
||
| <Cards> | ||
|
|
||
| <Card title="vLLM Embedder" href="/curate-text/process-data/embeddings/vllm-embedder"> | ||
| Generate embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models. | ||
| </Card> | ||
|
|
||
| </Cards> | ||
|
|
||
| --- | ||
|
|
||
| ## Integration with Semantic Deduplication | ||
|
|
||
| Text embeddings are a key input for [semantic deduplication](/curate-text/process-data/deduplication/semdedup). The `TextSemanticDeduplicationWorkflow` uses `VLLMEmbeddingModelStage` internally, but you can also generate embeddings separately and feed them into the deduplication workflow for more control over the embedding process. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| --- | ||
| description: "Generate text embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models" | ||
| categories: ["how-to-guides"] | ||
| tags: ["embeddings", "vllm", "gpu-accelerated", "large-models"] | ||
| personas: ["data-scientist-focused", "mle-focused"] | ||
| difficulty: "intermediate" | ||
| content_type: "how-to" | ||
| modality: "text-only" | ||
| --- | ||
|
|
||
| # vLLM Embedder | ||
|
|
||
| Generate text embeddings using vLLM's optimized inference engine. The `VLLMEmbeddingModelStage` provides high-throughput embedding generation, particularly for large embedding models where vLLM's batching and GPU memory management provide significant performance advantages over Sentence Transformers. | ||
|
|
||
| <Note> | ||
| **Installation**: The vLLM embedder is included in the `text_cuda12` installation. Install it with: | ||
|
|
||
| ```bash | ||
| uv pip install nemo_curator[text_cuda12] | ||
| ``` | ||
|
|
||
| vLLM is only available on x86_64 Linux systems. | ||
| </Note> | ||
|
sarahyurick marked this conversation as resolved.
|
||
|
|
||
| ## How It Works | ||
|
|
||
| `VLLMEmbeddingModelStage` is a single-stage embedder that handles both tokenization and embedding generation within one stage. Unlike `EmbeddingCreatorStage` (which splits tokenization and model inference into separate stages), the vLLM embedder delegates all GPU operations to vLLM's inference engine. | ||
|
|
||
| Key features: | ||
|
|
||
| - **Optional pretokenization**: When `pretokenize=True`, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput | ||
| - **vLLM-managed batching**: Leverages vLLM's built-in request scheduling for optimal GPU utilization | ||
| - **Model download caching**: Automatically downloads and caches models from Hugging Face Hub | ||
| - **Character truncation**: Optional `max_chars` parameter to limit input length before tokenization | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ```python | ||
| from nemo_curator.backends.xenna import XennaExecutor | ||
| from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage | ||
| from nemo_curator.pipeline import Pipeline | ||
| from nemo_curator.stages.text.io.reader import ParquetReader | ||
| from nemo_curator.stages.text.io.writer import ParquetWriter | ||
|
|
||
| pipeline = Pipeline( | ||
| name="vllm_embeddings", | ||
| stages=[ | ||
| ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]), | ||
| VLLMEmbeddingModelStage( | ||
| model_identifier="google/embeddinggemma-300m", | ||
| text_field="text", | ||
| embedding_field="embeddings", | ||
| ), | ||
| ParquetWriter(path="output/", fields=["text", "embeddings"]), | ||
|
Comment on lines
+48
to
+54
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The Quick Start example uses Consider replacing the model identifier with a large embedding model or a descriptive placeholder like |
||
| ], | ||
| ) | ||
|
|
||
| executor = XennaExecutor() | ||
| pipeline.run(executor) | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| ### Parameters | ||
|
|
||
| | Parameter | Type | Default | Description | | ||
| | --- | --- | --- | --- | | ||
| | `model_identifier` | `str` | Required | Hugging Face model name or path for the embedding model | | ||
| | `vllm_init_kwargs` | `dict` | `None` | Additional keyword arguments passed to `vllm.LLM()` for engine configuration | | ||
| | `text_field` | `str` | `"text"` | Name of the input text column in the data | | ||
| | `pretokenize` | `bool` | `False` | Tokenize text on CPU before passing to vLLM. Whether this improves throughput is model-dependent | | ||
| | `embedding_field` | `str` | `"embeddings"` | Name of the output embedding column | | ||
| | `max_chars` | `int` | `None` | Maximum characters per document (truncates before tokenization) | | ||
| | `cache_dir` | `str` | `None` | Directory for caching downloaded model files | | ||
| | `hf_token` | `str` | `None` | Hugging Face token for accessing gated models | | ||
| | `verbose` | `bool` | `False` | Enable verbose logging and progress bars | | ||
|
|
||
| ### vLLM Engine Options | ||
|
|
||
| Pass additional vLLM configuration through `vllm_init_kwargs`: | ||
|
|
||
| ```python | ||
| VLLMEmbeddingModelStage( | ||
| model_identifier="google/embeddinggemma-300m", | ||
| pretokenize=True, | ||
| vllm_init_kwargs={ | ||
| "enforce_eager": True, # Disable CUDA graph for debugging | ||
| "tensor_parallel_size": 2, # Distribute across 2 GPUs | ||
| "gpu_memory_utilization": 0.9, | ||
| "max_model_len": 512, | ||
| }, | ||
| ) | ||
| ``` | ||
|
|
||
| Default vLLM settings applied by the stage (can be overridden): | ||
|
|
||
| - `enforce_eager=False` — Uses CUDA graphs for faster inference | ||
| - `runner="pooling"` — Configures vLLM for embedding (pooling) tasks | ||
| - `model_impl="vllm"` — Uses vLLM's native model implementation | ||
| - `disable_log_stats=True` — Suppresses stats logging when `verbose=False` | ||
|
|
||
| ### Pretokenization | ||
|
|
||
| When `pretokenize=True`, the stage: | ||
|
|
||
| 1. Loads a Hugging Face Auto Tokenizer for the specified model | ||
| 2. Tokenizes the input text batch on CPU with truncation to `max_model_len` | ||
| 3. Passes token IDs directly to vLLM using `TokensPrompt` | ||
|
|
||
| Whether to use pretokenization depends on the model. For `google/embeddinggemma-300m` (the default for semantic deduplication), `pretokenize=False` is recommended and is the default. For other models, benchmarks show pretokenization can provide better per-task throughput by reducing GPU idle time during tokenization. | ||
|
|
||
| ```python | ||
| # Direct text mode (recommended for google/embeddinggemma-300m) | ||
| VLLMEmbeddingModelStage( | ||
| model_identifier="google/embeddinggemma-300m", | ||
| pretokenize=False, # vLLM handles tokenization internally | ||
| ) | ||
|
|
||
| # Pretokenize mode (can improve throughput for other models) | ||
| VLLMEmbeddingModelStage( | ||
| model_identifier="intfloat/e5-large-v2", | ||
| pretokenize=True, # Tokenize on CPU, embed on GPU | ||
| ) | ||
| ``` | ||
|
|
||
| ## Resources | ||
|
|
||
| The `VLLMEmbeddingModelStage` requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure `tensor_parallel_size` in `vllm_init_kwargs`. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
executorin vLLM code blockThe new vLLM embedding code block is missing several imports and uses an undefined variable
executor:Undefined
executor:embedding_pipeline.run(executor)referencesexecutor, which is never defined in this snippet. The analogous Step-by-Step Workflow above callsembedding_pipeline.run()with no executor argument. Either addexecutor = XennaExecutor()(plus its import) or remove the argument to match the existing pattern.SemanticDeduplicationWorkflownot imported: Used on line 207 but not imported. The correct import isfrom nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow.Pipeline,ParquetReader,ParquetWriternot imported: Needed for the first half of the snippet but absent. Users copying this block will see multipleNameErrors.A standalone, self-contained snippet should include all its imports: