Add vLLM and Sentence Transformers support for embedding generation#1346
Conversation
Greptile SummaryThis PR adds vLLM and Sentence Transformers support for embedding generation, providing performance improvements for large models based on experimental benchmarks with 5GB of Common Crawl data. Key Changes:
Performance Findings: Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant VLLMStage as VLLMEmbeddingModelStage
participant vLLM
participant GPU
User->>VLLMStage: initialize(model_identifier, pretokenize)
User->>VLLMStage: setup_on_node()
VLLMStage->>vLLM: Download model files
VLLMStage->>vLLM: Initialize LLM instance
vLLM->>GPU: Load model weights
GPU-->>VLLMStage: Model ready on GPU
User->>VLLMStage: setup()
alt pretokenize enabled
VLLMStage->>VLLMStage: Load separate tokenizer
end
User->>VLLMStage: process(batch)
Note over VLLMStage: Extract text from batch<br/>Apply character limit
alt pretokenize enabled
VLLMStage->>VLLMStage: Tokenize text batch<br/>with truncation
VLLMStage->>vLLM: embed(preprocessed_ids)
else pretokenize disabled
VLLMStage->>vLLM: embed(raw_text)
end
vLLM->>GPU: Compute embeddings
GPU-->>vLLM: Return embedding vectors
vLLM-->>VLLMStage: EmbeddingRequestOutput
VLLMStage-->>User: DocumentBatch with embeddings
|
| import torch | ||
| import torch.nn.functional as F # noqa: N812 | ||
| from loguru import logger | ||
| from sentence_transformers import SentenceTransformer |
There was a problem hiding this comment.
[P0] This unconditional import will cause ImportError for users who install without the text_cuda12 extra. The import should be conditional or guarded since sentence-transformers is in optional dependencies. Consider importing inside the class methods or using a try/except block.
| if self.pretokenize: | ||
| from transformers import AutoTokenizer | ||
|
|
||
| self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier) |
There was a problem hiding this comment.
[P2] The tokenizer loading should respect the cache directory and authentication settings passed to the stage constructor. Currently only model_identifier is used, which may cause inconsistent behavior with model downloads.
| raise ValueError(msg) | ||
|
|
||
| t0 = time.perf_counter() | ||
| max_model_len = self.model.model_config.max_model_len |
There was a problem hiding this comment.
the code accesses self.model.model_config.max_model_len without checking if self.model is initialized. if process() is called before setup(), this will raise an AttributeError: 'NoneType' object has no attribute 'model_config'
| max_model_len = self.model.model_config.max_model_len | |
| if self.model is None: | |
| msg = "Model is not initialized. Please call setup() before processing." | |
| raise ValueError(msg) | |
| max_model_len = self.model.model_config.max_model_len |
| if not self.verbose and "disable_log_stats" not in vllm_init_kwargs: | ||
| vllm_init_kwargs["disable_log_stats"] = True | ||
|
|
||
| self.model = LLM(model=self.model_identifier, **vllm_init_kwargs) |
There was a problem hiding this comment.
the cache_dir parameter from __init__ is not passed to the LLM initialization. vLLM's LLM class accepts a download_dir parameter that should be used here. currently, vLLM will use its default cache location which may differ from where snapshot_download cached the model in setup_on_node(), potentially causing duplicate downloads or cache misses
| self.model = LLM(model=self.model_identifier, **vllm_init_kwargs) | |
| self.model = LLM(model=self.model_identifier, download_dir=self.cache_dir, **vllm_init_kwargs) |
|
|
||
| def setup(self, _: WorkerMetadata | None = None) -> None: | ||
| """Load the model for inference.""" | ||
| self.model = SentenceTransformer(self.model_identifier, local_files_only=True) |
There was a problem hiding this comment.
the SentenceTransformer initialization doesn't support custom cache directory. the SentenceTransformer class accepts a cache_folder parameter, but this stage doesn't accept or use it. consider adding cache_dir parameter support for consistency with other model stages:
| self.model = SentenceTransformer(self.model_identifier, local_files_only=True) | |
| self.model = SentenceTransformer( | |
| self.model_identifier, | |
| cache_folder=self.cache_dir if hasattr(self, 'cache_dir') and self.cache_dir else None, | |
| local_files_only=True | |
| ) |
and update __init__ to accept and store cache_dir parameter
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Additional Comments (1)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
| super().__init__( | ||
| model_identifier=model_identifier, | ||
| hf_token=hf_token, | ||
| model_inference_batch_size=model_inference_batch_size, | ||
| has_seq_order=has_seq_order, | ||
| padding_side=padding_side, | ||
| autocast=autocast, | ||
| ) | ||
| # Override unpack_inference_batch to False as SentenceTransformer expects a dictionary input | ||
| self.unpack_inference_batch = False | ||
| self.embedding_field = embedding_field |
There was a problem hiding this comment.
the parent EmbeddingModelStage.__init__ sets self.embedding_field using the default value, and then line 124 sets it again with the same value from the parameter. This is redundant. Additionally, the parent sets self.pooling = "mean_pooling" (the default) which is never used by SentenceTransformer, creating a misleading attribute.
While this doesn't cause runtime errors, it's confusing for maintainability. Consider either:
- Passing
embedding_fieldandpoolingto the parent constructor explicitly, or - Not setting
self.embedding_fieldagain after the parent call
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| if self.pretokenize: | ||
| from transformers import AutoTokenizer | ||
|
|
||
| self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier) |
There was a problem hiding this comment.
the tokenizer loading doesn't pass cache_dir, which means it will look in the default HuggingFace cache directory instead of using self.cache_dir specified in the constructor. This will fail when a custom cache directory is used, since setup_on_node downloads to self.cache_dir but the tokenizer loads from the default location.
| self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier) | |
| self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier, cache_dir=self.cache_dir) |
| super().__init__( | ||
| model_identifier=model_identifier, | ||
| hf_token=hf_token, | ||
| model_inference_batch_size=model_inference_batch_size, | ||
| has_seq_order=has_seq_order, | ||
| padding_side=padding_side, | ||
| autocast=autocast, | ||
| ) | ||
| # Override unpack_inference_batch to False as SentenceTransformer expects a dictionary input | ||
| self.unpack_inference_batch = False | ||
| self.embedding_field = embedding_field |
There was a problem hiding this comment.
the initialization pattern here is fragile. super().__init__() is called with parameters that set self.unpack_inference_batch = True (via parent's init at line 53), then line 123 immediately overrides it to False. while this works, it's error-prone because:
- the parent class initialization does unnecessary work that's immediately overridden
- if the parent's
__init__implementation changes, this could break - it's not immediately clear to future maintainers why this override is needed
consider either:
- not passing the parameter to parent if it will be overridden
- or better yet, modify the parent class to accept this as a parameter so the override isn't needed
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| if self.use_sentence_transformer: | ||
| logger.warning("Using SentenceTransformer for embedding model ignoring embedding_pooling") |
There was a problem hiding this comment.
when use_sentence_transformer=True, the code logs a warning that embedding_pooling is ignored, but there's no validation to check if the user explicitly set a non-default pooling value. if a user explicitly sets embedding_pooling="last_token" with use_sentence_transformer=True, they'll only get a warning, which they might miss. consider validating that embedding_pooling has its default value when use_sentence_transformer=True, or raise an error if it's been explicitly changed.
| if self.pretokenize: | ||
| from transformers import AutoTokenizer | ||
|
|
||
| self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier) |
There was a problem hiding this comment.
the tokenizer is loaded without the cache directory, authentication parameter, or local-only flag that are used elsewhere in the codebase (see line 95 for snapshot_download). if the tokenizer isn't already cached or requires authentication, this will fail or make unexpected network calls during processing. the tokenizer loading should include cache_dir=self.cache_dir and local_files_only=True parameters for consistency.
| t0 = time.perf_counter() | ||
| vllm_output = self.model.embed(input_data, truncate_prompt_tokens=-1, use_tqdm=self.verbose) | ||
| metrics["vllm_embedding_time"] = time.perf_counter() - t0 |
There was a problem hiding this comment.
the self.model.embed() call lacks error handling. if the embedding operation fails (GPU issues, invalid input, model errors), it will crash the entire processing pipeline. wrap this in a try-except block to provide better error messages and allow for graceful failure recovery.
Signed-off-by: Praateek <praateekm@gmail.com>
|
|
||
| [project.optional-dependencies] | ||
| cuda12 = ["gpustat", "nvidia-ml-py"] | ||
| vllm = ["vllm>=0.13; (platform_machine == 'x86_64' and platform_system != 'Darwin')"] |
There was a problem hiding this comment.
The vLLM version constraint change from ==0.11.1 (in video_cuda12) to >=0.13 may cause compatibility issues for video curation features that depend on vLLM 0.11.1 specifically. The video_cuda12 dependency previously pinned vLLM to 0.11.1, but now it will use 0.13+.
Before merging, please verify that:
- Video curation features work correctly with vLLM 0.13+
- There are no breaking API changes between vLLM 0.11.1 and 0.13 that affect video processing
If video curation requires vLLM 0.11.1 specifically, consider creating a separate vLLM version constraint for video vs text embedding use cases.
Additional Comments (2)
Currently, when This is a pre-existing issue in EmbeddingModelStage, but SentenceTransformerEmbeddingModelStage inherits it. Consider refactoring to implement |
Signed-off-by: Praateek <praateekm@gmail.com>
| if self.pretokenize: | ||
| from transformers import AutoTokenizer | ||
|
|
||
| self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier) |
There was a problem hiding this comment.
the tokenizer is loaded without proper parameters for distributed execution. it's missing local_files_only=True which could cause failures if model files aren't already cached locally. it should match the pattern used in setup_on_node() at line 95 where snapshot_download is called with local_files_only=False to download, and then this setup should use local_files_only=True.
| def setup(self, _: WorkerMetadata | None = None) -> None: | ||
| """Load the model for inference.""" | ||
| self.model = SentenceTransformer(self.model_identifier, local_files_only=True) | ||
| self.model.eval().to("cuda") |
There was a problem hiding this comment.
SentenceTransformerEmbeddingModelStage has the same issue as EmbeddingModelStage - it overrides setup() but doesn't implement the _setup(local_files_only) method expected by the parent ModelStage class. this will cause issues with model downloading in distributed settings where setup_on_node() is called first to download models to each node.
the parent class ModelStage's setup_on_node() method (model.py lines 85-102) expects subclasses to implement _setup so it can be called with local_files_only=False during download and local_files_only=True during worker setup.
| model_class = SentenceTransformerEmbeddingModelStage if self.use_sentence_transformer else EmbeddingModelStage | ||
|
|
||
| if self.use_sentence_transformer: | ||
| logger.warning("Using SentenceTransformer for embedding model ignoring embedding_pooling") |
There was a problem hiding this comment.
the warning message says "ignoring embedding_pooling" but it should be more clear about what the expected behavior is. consider rephrasing to: "SentenceTransformer uses its own internal pooling configuration; the embedding_pooling parameter will be ignored"
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Additional Comments (3)
see how other ModelStage subclasses implement this pattern (e.g., nemo_curator/stages/text/classifiers/base.py line 115, nemo_curator/stages/text/models/tokenizer.py line 114).
while this is unlikely in practice (there should always be at least one real token), consider adding a check or documenting this assumption.
without proper cleanup, GPU memory may not be released between processing stages, potentially causing out-of-memory errors in pipelines with multiple stages. |
Signed-off-by: Praateek <praateekm@gmail.com>
| metrics["tokenization_time"] = time.perf_counter() - t0 | ||
|
|
||
| t0 = time.perf_counter() | ||
| vllm_output = self.model.embed(input_data, truncate_prompt_tokens=-1, use_tqdm=self.verbose) |
There was a problem hiding this comment.
[P1] When self.model is None at line 129, calling self.model.embed() will cause AttributeError. Add model initialization check before this line to fail fast with a clear error message.
| raise ValueError(msg) | ||
|
|
||
| t0 = time.perf_counter() | ||
| max_model_len = self.model.model_config.max_model_len |
There was a problem hiding this comment.
[P1] Accessing model_config when model is None will cause AttributeError. This can happen if process() is called before setup() completes. Add check before line 123 to ensure model is initialized.
| def setup(self, _: WorkerMetadata | None = None) -> None: | ||
| """Load the model for inference.""" | ||
| self.model = SentenceTransformer(self.model_identifier, local_files_only=True) | ||
| self.model.eval().to("cuda") |
There was a problem hiding this comment.
[P2] SentenceTransformerEmbeddingModelStage.setup() also overrides parent without implementing _setup() method. Same distributed setup issue as EmbeddingModelStage - model won't load when setup_on_node() is called in distributed environments.
Additional Comments (1)
|
Signed-off-by: Praateek <praateekm@gmail.com>
| raise ValueError(msg) | ||
|
|
||
| t0 = time.perf_counter() | ||
| max_model_len = self.model.model_config.max_model_len |
There was a problem hiding this comment.
Accessing model.model_config without checking if model is None. When pretokenize is True and process is called before setup, this line will crash with AttributeError. Should validate self.model is initialized like the tokenizer check above.
VibhuJawa
left a comment
There was a problem hiding this comment.
We discussed offline. The main concerns will be addressed by :
- Fast Follow to make VLLM default for text curation use cases.
- Align on a abstract class model to inherit and verify class signatures being used.
Signed-off-by: Praateek <praateekm@gmail.com>
| def setup(self, _: WorkerMetadata | None = None) -> None: | ||
| """Load the model for inference.""" | ||
| self.model = AutoModel.from_pretrained(self.model_identifier, local_files_only=True) | ||
| self.model = AutoModel.from_pretrained(self.model_identifier, cache_dir=self.cache_dir, local_files_only=True) | ||
| self.model.eval().to("cuda") |
There was a problem hiding this comment.
EmbeddingModelStage overrides setup() but doesn't implement _setup(local_files_only) expected by parent ModelStage. The parent's setup_on_node() (lines 85-102 in model.py) downloads the model then calls _setup() to load it. Without _setup(), distributed model downloading won't work - model downloads but never loads.
see how other ModelStage subclasses implement this (e.g., DebertaClassificationStage in classifiers/base.py line 115, TokenizerStage in models/tokenizer.py line 114)
| def setup(self, _: WorkerMetadata | None = None) -> None: | |
| """Load the model for inference.""" | |
| self.model = AutoModel.from_pretrained(self.model_identifier, local_files_only=True) | |
| self.model = AutoModel.from_pretrained(self.model_identifier, cache_dir=self.cache_dir, local_files_only=True) | |
| self.model.eval().to("cuda") | |
| def _setup(self, local_files_only: bool = True) -> None: | |
| """Load the model for inference.""" | |
| self.model = AutoModel.from_pretrained(self.model_identifier, cache_dir=self.cache_dir, local_files_only=local_files_only) | |
| self.model.eval().to("cuda") | |
| def setup(self, _: WorkerMetadata | None = None) -> None: | |
| """Load the model for inference.""" | |
| self._setup(local_files_only=True) |
| raise ValueError(msg) | ||
|
|
||
| t0 = time.perf_counter() | ||
| max_model_len = self.model.model_config.max_model_len |
There was a problem hiding this comment.
accessing self.model.model_config before checking if self.model is None could cause AttributeError if setup() wasn't called. add a check:
| max_model_len = self.model.model_config.max_model_len | |
| if self.model is None: | |
| msg = "Model is not initialized. Please call setup() before processing." | |
| raise ValueError(msg) | |
| max_model_len = self.model.model_config.max_model_len |
| if self.use_sentence_transformer: | ||
| logger.warning("Using SentenceTransformer for embedding model ignoring embedding_pooling") |
There was a problem hiding this comment.
warning message says "ignoring embedding_pooling" but uses incorrect grammar. should be "ignoring embedding_pooling parameter"
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| def setup(self, _: WorkerMetadata | None = None) -> None: | ||
| """Load the model for inference.""" | ||
| self.model = SentenceTransformer( | ||
| self.model_identifier, | ||
| cache_folder=self.cache_dir, | ||
| use_auth_token=self.hf_token, | ||
| local_files_only=True, | ||
| ) | ||
| self.model.eval().to("cuda") |
There was a problem hiding this comment.
SentenceTransformerEmbeddingModelStage has the same issue - overrides setup() but doesn't implement _setup(local_files_only). distributed model downloading via setup_on_node() won't work correctly.
| def setup(self, _: WorkerMetadata | None = None) -> None: | |
| """Load the model for inference.""" | |
| self.model = SentenceTransformer( | |
| self.model_identifier, | |
| cache_folder=self.cache_dir, | |
| use_auth_token=self.hf_token, | |
| local_files_only=True, | |
| ) | |
| self.model.eval().to("cuda") | |
| def _setup(self, local_files_only: bool = True) -> None: | |
| """Load the model for inference.""" | |
| self.model = SentenceTransformer( | |
| self.model_identifier, | |
| cache_folder=self.cache_dir, | |
| local_files_only=local_files_only, | |
| ) | |
| self.model.eval().to("cuda") | |
| def setup(self, _: WorkerMetadata | None = None) -> None: | |
| """Load the model for inference.""" | |
| self._setup(local_files_only=True) |
Additional Comments (2)
while unlikely in practice, add a safety check:
|
Greptile's behavior is changing!From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
Signed-off-by: Praateek <praateekm@gmail.com>
…/NeMo-Curator into praateek/vllm-embeddings Signed-off-by: Praateek <praateekm@gmail.com>
| raise ValueError(msg) | ||
|
|
||
| t0 = time.perf_counter() | ||
| max_model_len = self.model.model_config.max_model_len |
There was a problem hiding this comment.
logic: validate model initialization before accessing config property, similar to tokenizer validation at line 123
Signed-off-by: Praateek <praateekm@gmail.com>
| import torch | ||
| import torch.nn.functional as F # noqa: N812 | ||
| from loguru import logger | ||
| from sentence_transformers import SentenceTransformer |
There was a problem hiding this comment.
logic: unconditional import breaks users without text_cpu/text_cuda12 extras. since __init__.py imports EmbeddingCreatorStage from this file, importing the embedders package will fail with ImportError: No module named 'sentence_transformers'
wrap in try/except or use conditional import:
| from sentence_transformers import SentenceTransformer | |
| try: | |
| from sentence_transformers import SentenceTransformer | |
| except ImportError: | |
| SentenceTransformer = None |
then add validation in SentenceTransformerEmbeddingModelStage.__init__:
if SentenceTransformer is None:
raise ImportError("sentence-transformers required. Install with: pip install nemo-curator[text_cpu]")| raise ValueError(msg) | ||
|
|
||
| t0 = time.perf_counter() | ||
| max_model_len = self.model.model_config.max_model_len |
There was a problem hiding this comment.
logic: accessing model.model_config without checking if model is initialized. if process() called before setup(), raises AttributeError: 'NoneType' object has no attribute 'model_config'
| max_model_len = self.model.model_config.max_model_len | |
| if self.model is None: | |
| raise ValueError("Model is not initialized. Please call setup() before processing.") | |
| max_model_len = self.model.model_config.max_model_len |
Description
Tried out few models
sentence_transformer: Current implementation of TokenizerStage + ModelStagevllm_text: vLLM with text as input - Single stage of vLLM onlyvllm_text_4cpus: Same as above, except we dedicate 4 CPUs to the stagevllm_tokens: vLLM with tokens as input, tokenization done by TokenizerStagevllm_text_pretokenized: vLLM with text as input, tokenization done within the stageThe experiment ran on 5gb of common crawl data and the findings were
vllm_text_pretokenized. This suggests given a large number of tasks and amortized startup cost, we should seevllm_text_pretokenizedcome out ahead.vllm_textis slowest, likely due to tokenization happening sentence by sentence leading to higher GPU idle time, and not justifying the small model runtime on GPU. Increasing cpu allocation for the stage doesn’t improve runtimes.vllm_text_pretokenized;Usage
# Add snippet demonstrating usageChecklist