diff --git a/fern/versions/v26.04.yml b/fern/versions/v26.04.yml index eb4ea19bd3..372963dfe3 100644 --- a/fern/versions/v26.04.yml +++ b/fern/versions/v26.04.yml @@ -189,6 +189,15 @@ navigation: - page: Text Cleaning path: ./v26.04/pages/curate-text/process-data/content-processing/text-cleaning.mdx slug: text-cleaning + - section: Embeddings + slug: embeddings + contents: + - page: Overview + path: ./v26.04/pages/curate-text/process-data/embeddings/index.mdx + slug: "" + - page: vLLM Embedder + path: ./v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx + slug: vllm-embedder - section: Deduplication slug: deduplication contents: diff --git a/fern/versions/v26.04/pages/about/release-notes/index.mdx b/fern/versions/v26.04/pages/about/release-notes/index.mdx index ae1a9ec90d..6c04504c8f 100644 --- a/fern/versions/v26.04/pages/about/release-notes/index.mdx +++ b/fern/versions/v26.04/pages/about/release-notes/index.mdx @@ -12,6 +12,16 @@ modality: "universal" ## What's New in 26.04 +### vLLM and Sentence Transformers Embedding Support (PR #1346) + +Added two new embedding backends for text curation, giving users flexibility to choose the best engine for their model size and throughput needs: + +- **`VLLMEmbeddingModelStage`**: A new standalone embedding stage powered by [vLLM](https://docs.vllm.ai/) for high-throughput GPU-accelerated inference. Supports optional pretokenization (`pretokenize=True`) for best per-task throughput. Ideal for large embedding models where vLLM's batching and memory management outperform Sentence Transformers. +- **`SentenceTransformerEmbeddingModelStage`**: A new embedding stage using the `sentence-transformers` library directly, providing native support for models from the Sentence Transformers ecosystem. +- **`EmbeddingCreatorStage` enhancements**: Added `use_sentence_transformer` flag (defaults to `True`) to select between Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes. Added `cache_dir` parameter for controlling model download location. + +For usage details, see [Text Embeddings](/curate-text/process-data/embeddings) and [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder). + ### Inference Server (Ray Serve) Built-in LLM serving alongside curation pipelines using Ray Serve and vLLM: @@ -105,6 +115,8 @@ Resolved four HIGH-severity vulnerabilities affecting Curator dependencies: - **Cosmos-Xenna**: Updated from 0.1.2 to 0.2.0 with simplified resource model - **Ray**: Updated to 2.54 +- **sentence-transformers**: Added to the `text_cpu` optional dependency group +- **vllm**: New vllm optional dependency group - **uv**: Added minimum required version (>=0.7.0) to prevent lockfile revision drift - **nemo-toolkit**: Bumped `nemo_toolkit[asr]` from `==2.4.0` to `>=2.7.2` to address deserialization CVEs. Only affects `audio_cpu` and `audio_cuda12` extras. - **xgrammar**: Moved from `constraint-dependencies` (`>=0.1.21`) to `override-dependencies` (`>=0.1.32`) to override vLLM's pinned version and address CVE-2026-25048. diff --git a/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx b/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx index 95d9886282..3a5d60c65f 100644 --- a/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx +++ b/fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx @@ -188,6 +188,30 @@ workflow = TextSemanticDeduplicationWorkflow( ) ``` +**vLLM Embedder** (recommended for large models): + +For large embedding models, you can generate embeddings separately using `VLLMEmbeddingModelStage` before running the deduplication workflow. This provides better GPU utilization and throughput for models with 500M+ parameters. See [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) for details. + +Generate embeddings with `VLLMEmbeddingModelStage` using the [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) pipeline, then pass the output to `SemanticDeduplicationWorkflow`: + +```python +from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow + +# After generating embeddings to embedding_output_path using VLLMEmbeddingModelStage +semantic_workflow = SemanticDeduplicationWorkflow( + input_path=embedding_output_path, + output_path=output_path, + n_clusters=100, + eps=0.07, + id_field="_curator_dedup_id", + embedding_field="embeddings", +) +semantic_workflow.run() + +# Step 3: Filter original text dataset using the IDs to remove +# See TextDuplicatesRemovalWorkflow for the removal step +``` + **When choosing a model**: - Use models that support vLLM pooling (embedding) mode @@ -195,6 +219,7 @@ workflow = TextSemanticDeduplicationWorkflow( - Prefer models trained for sentence embeddings (for example, EmbeddingGemma, E5, BGE, or SBERT) - Use `embedding_pretokenize=True` for models that benefit from explicit tokenization control - Pass additional vLLM configuration through `embedding_vllm_init_kwargs` +- For more control over the embedding process, consider using [VLLMEmbeddingModelStage](/curate-text/process-data/embeddings/vllm-embedder) separately diff --git a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx new file mode 100644 index 0000000000..41341a8af2 --- /dev/null +++ b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/index.mdx @@ -0,0 +1,84 @@ +--- +description: "Generate text embeddings using vLLM, Sentence Transformers, or Hugging Face models for deduplication, similarity search, and downstream tasks" +categories: ["how-to-guides"] +tags: ["embeddings", "vllm", "sentence-transformers", "gpu-accelerated", "similarity-search"] +personas: ["data-scientist-focused", "mle-focused"] +difficulty: "intermediate" +content_type: "how-to" +modality: "text-only" +--- + +# Text Embedding + +Generate text embeddings for large-scale datasets using NeMo Curator's built-in embedding stages. Text embeddings enable downstream tasks such as semantic deduplication, similarity search, and clustering. + +## How It Works + +NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements: + +1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes via the `use_sentence_transformer` flag. +2. **`VLLMEmbeddingModelStage`** — A standalone stage that uses vLLM for GPU-accelerated embedding generation with optional pretokenization. Best for large embedding models where vLLM's batching and GPU utilization provide significant throughput gains. +3. **`SentenceTransformerEmbeddingModelStage`** — A model stage that uses the `sentence-transformers` library directly. Used internally by `EmbeddingCreatorStage` when `use_sentence_transformer=True`. + +## Choosing an Embedding Backend + +| Backend | Best For | GPU Utilization | Setup | +| --- | --- | --- | --- | +| `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cuda12` extra | +| `VLLMEmbeddingModelStage` | Large models (e.g., `google/embeddinggemma-300m`) and semantic deduplication | Excellent | Included in `text_cuda12` extra | +| `EmbeddingCreatorStage` (AutoModel) | Custom pooling strategies | Good | Set `use_sentence_transformer=False` | + + +Benchmarks on 5 GB of Common Crawl data show that vLLM outperforms Sentence Transformers for larger embedding models, while Sentence Transformers is faster for smaller models. The vLLM `pretokenize` mode provides the best per-task throughput across both model sizes when amortized over many tasks. + + +## Quick Start + +### EmbeddingCreatorStage + +```python +from nemo_curator.backends.xenna import XennaExecutor +from nemo_curator.stages.text.embedders import EmbeddingCreatorStage +from nemo_curator.pipeline import Pipeline +from nemo_curator.stages.text.io.reader import ParquetReader +from nemo_curator.stages.text.io.writer import ParquetWriter + +pipeline = Pipeline( + name="text_embeddings", + stages=[ + ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]), + EmbeddingCreatorStage( + model_identifier="sentence-transformers/all-MiniLM-L6-v2", + text_field="text", + embedding_field="embeddings", + model_inference_batch_size=256, + ), + ParquetWriter(path="output/", fields=["text", "embeddings"]), + ], +) + +executor = XennaExecutor() +pipeline.run(executor) +``` + +### VLLMEmbeddingModelStage (Recommended for Semantic Deduplication) + +`VLLMEmbeddingModelStage` is the default embedding backend for semantic deduplication, using `google/embeddinggemma-300m`. It provides better GPU utilization and throughput for large embedding models. See the [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) guide for setup, configuration, and code examples. + +--- + +## Available Embedding Tools + + + + +Generate embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models. + + + + +--- + +## Integration with Semantic Deduplication + +Text embeddings are a key input for [semantic deduplication](/curate-text/process-data/deduplication/semdedup). The `TextSemanticDeduplicationWorkflow` uses `VLLMEmbeddingModelStage` internally, but you can also generate embeddings separately and feed them into the deduplication workflow for more control over the embedding process. diff --git a/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx new file mode 100644 index 0000000000..d2dfcf325c --- /dev/null +++ b/fern/versions/v26.04/pages/curate-text/process-data/embeddings/vllm-embedder.mdx @@ -0,0 +1,129 @@ +--- +description: "Generate text embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models" +categories: ["how-to-guides"] +tags: ["embeddings", "vllm", "gpu-accelerated", "large-models"] +personas: ["data-scientist-focused", "mle-focused"] +difficulty: "intermediate" +content_type: "how-to" +modality: "text-only" +--- + +# vLLM Embedder + +Generate text embeddings using vLLM's optimized inference engine. The `VLLMEmbeddingModelStage` provides high-throughput embedding generation, particularly for large embedding models where vLLM's batching and GPU memory management provide significant performance advantages over Sentence Transformers. + + +**Installation**: The vLLM embedder is included in the `text_cuda12` installation. Install it with: + +```bash +uv pip install nemo_curator[text_cuda12] +``` + +vLLM is only available on x86_64 Linux systems. + + +## How It Works + +`VLLMEmbeddingModelStage` is a single-stage embedder that handles both tokenization and embedding generation within one stage. Unlike `EmbeddingCreatorStage` (which splits tokenization and model inference into separate stages), the vLLM embedder delegates all GPU operations to vLLM's inference engine. + +Key features: + +- **Optional pretokenization**: When `pretokenize=True`, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput +- **vLLM-managed batching**: Leverages vLLM's built-in request scheduling for optimal GPU utilization +- **Model download caching**: Automatically downloads and caches models from Hugging Face Hub +- **Character truncation**: Optional `max_chars` parameter to limit input length before tokenization + +## Quick Start + +```python +from nemo_curator.backends.xenna import XennaExecutor +from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage +from nemo_curator.pipeline import Pipeline +from nemo_curator.stages.text.io.reader import ParquetReader +from nemo_curator.stages.text.io.writer import ParquetWriter + +pipeline = Pipeline( + name="vllm_embeddings", + stages=[ + ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]), + VLLMEmbeddingModelStage( + model_identifier="google/embeddinggemma-300m", + text_field="text", + embedding_field="embeddings", + ), + ParquetWriter(path="output/", fields=["text", "embeddings"]), + ], +) + +executor = XennaExecutor() +pipeline.run(executor) +``` + +## Configuration + +### Parameters + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `model_identifier` | `str` | Required | Hugging Face model name or path for the embedding model | +| `vllm_init_kwargs` | `dict` | `None` | Additional keyword arguments passed to `vllm.LLM()` for engine configuration | +| `text_field` | `str` | `"text"` | Name of the input text column in the data | +| `pretokenize` | `bool` | `False` | Tokenize text on CPU before passing to vLLM. Whether this improves throughput is model-dependent | +| `embedding_field` | `str` | `"embeddings"` | Name of the output embedding column | +| `max_chars` | `int` | `None` | Maximum characters per document (truncates before tokenization) | +| `cache_dir` | `str` | `None` | Directory for caching downloaded model files | +| `hf_token` | `str` | `None` | Hugging Face token for accessing gated models | +| `verbose` | `bool` | `False` | Enable verbose logging and progress bars | + +### vLLM Engine Options + +Pass additional vLLM configuration through `vllm_init_kwargs`: + +```python +VLLMEmbeddingModelStage( + model_identifier="google/embeddinggemma-300m", + pretokenize=True, + vllm_init_kwargs={ + "enforce_eager": True, # Disable CUDA graph for debugging + "tensor_parallel_size": 2, # Distribute across 2 GPUs + "gpu_memory_utilization": 0.9, + "max_model_len": 512, + }, +) +``` + +Default vLLM settings applied by the stage (can be overridden): + +- `enforce_eager=False` — Uses CUDA graphs for faster inference +- `runner="pooling"` — Configures vLLM for embedding (pooling) tasks +- `model_impl="vllm"` — Uses vLLM's native model implementation +- `disable_log_stats=True` — Suppresses stats logging when `verbose=False` + +### Pretokenization + +When `pretokenize=True`, the stage: + +1. Loads a Hugging Face Auto Tokenizer for the specified model +2. Tokenizes the input text batch on CPU with truncation to `max_model_len` +3. Passes token IDs directly to vLLM using `TokensPrompt` + +Whether to use pretokenization depends on the model. For `google/embeddinggemma-300m` (the default for semantic deduplication), `pretokenize=False` is recommended and is the default. For other models, benchmarks show pretokenization can provide better per-task throughput by reducing GPU idle time during tokenization. + +```python +# Direct text mode (recommended for google/embeddinggemma-300m) +VLLMEmbeddingModelStage( + model_identifier="google/embeddinggemma-300m", + pretokenize=False, # vLLM handles tokenization internally +) + +# Pretokenize mode (can improve throughput for other models) +VLLMEmbeddingModelStage( + model_identifier="intfloat/e5-large-v2", + pretokenize=True, # Tokenize on CPU, embed on GPU +) +``` + +## Resources + +The `VLLMEmbeddingModelStage` requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure `tensor_parallel_size` in `vllm_init_kwargs`. +