Skip to content

Add vLLM and Sentence Transformers support for embedding generation#1346

Merged
praateekmahajan merged 22 commits intoNVIDIA-NeMo:mainfrom
praateekmahajan:praateek/vllm-embeddings
Jan 14, 2026
Merged

Add vLLM and Sentence Transformers support for embedding generation#1346
praateekmahajan merged 22 commits intoNVIDIA-NeMo:mainfrom
praateekmahajan:praateek/vllm-embeddings

Conversation

@praateekmahajan
Copy link
Copy Markdown
Contributor

@praateekmahajan praateekmahajan commented Dec 31, 2025

Description

Tried out few models

  1. sentence_transformer : Current implementation of TokenizerStage + ModelStage
  2. vllm_text : vLLM with text as input - Single stage of vLLM only
    1. vllm_text_4cpus : Same as above, except we dedicate 4 CPUs to the stage
  3. vllm_tokens : vLLM with tokens as input, tokenization done by TokenizerStage
  4. vllm_text_pretokenized : vLLM with text as input, tokenization done within the stage

The experiment ran on 5gb of common crawl data and the findings were

  1. Dependent on model size
    1. Larger model (embedding gemma): vLLM wins
    2. Smaller model (miniLM) : Sentence Transformer Current implementation wins
  2. When measuring average time per task / partition (combining tokenization and modeling) the fastest, irrespective of model size is vllm_text_pretokenized. This suggests given a large number of tasks and amortized startup cost, we should see vllm_text_pretokenized come out ahead.
  3. For large model
    1. Among vLLM variations, total time taken is similar, but that includes startup time, so we should measure average time spent per task.
  4. For small model
    1. Among vLLM variations, vllm_text is slowest, likely due to tokenization happening sentence by sentence leading to higher GPU idle time, and not justifying the small model runtime on GPU. Increasing cpu allocation for the stage doesn’t improve runtimes.
  5. vllm_tokens is also fast for both small / large models, but it increases the following cost for a negligible overhead compared to vllm_text_pretokenized;
    1. maintenance overhead of maintaining a separate tokenization stage
    2. serialization overhead of transferring data from one actor to another
image image

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <praateekm@gmail.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Dec 31, 2025

Greptile Summary

This PR adds vLLM and Sentence Transformers support for embedding generation, providing performance improvements for large models based on experimental benchmarks with 5GB of Common Crawl data.

Key Changes:

  • New VLLMEmbeddingModelStage for vLLM-based embedding generation with optional pretokenization
  • New SentenceTransformerEmbeddingModelStage for sentence-transformers library integration
  • Updated EmbeddingCreatorStage with use_sentence_transformer flag to choose between implementations
  • Added cache_dir parameter support across embedding stages
  • Updated dependencies: vLLM version constraint changed to >=0.13, added sentence-transformers, added scikit-learn version constraint
  • Comprehensive test coverage validating embeddings against reference implementations

Performance Findings:
According to the PR description, the vLLM text-based approach without separate pretokenization shows the fastest average time per task when amortized across many tasks, making it recommended for large-scale embedding generation.

Confidence Score: 4/5

  • This PR is safe to merge with minor fixes needed for the identified import and initialization issues
  • The implementation is well-tested with comprehensive test coverage validating against reference implementations. The code follows existing patterns in the codebase. Two critical issues need addressing: (1) unconditional sentence_transformers import will break installations without text extras, and (2) missing model initialization check before accessing model properties. These are straightforward fixes that don't affect the core logic.
  • Pay close attention to nemo_curator/stages/text/embedders/base.py (import issue) and nemo_curator/stages/text/embedders/vllm.py (initialization check)

Important Files Changed

Filename Overview
nemo_curator/stages/text/embedders/vllm.py New vLLM embedding stage implementation with pretokenization support, properly handles cache directory configuration
nemo_curator/stages/text/embedders/base.py Added SentenceTransformer support and cache_dir parameter to embedding stages
pyproject.toml Added vLLM and sentence-transformers dependencies with proper versioning constraints

Sequence Diagram

sequenceDiagram
    participant User
    participant VLLMStage as VLLMEmbeddingModelStage
    participant vLLM
    participant GPU

    User->>VLLMStage: initialize(model_identifier, pretokenize)
    
    User->>VLLMStage: setup_on_node()
    VLLMStage->>vLLM: Download model files
    VLLMStage->>vLLM: Initialize LLM instance
    vLLM->>GPU: Load model weights
    GPU-->>VLLMStage: Model ready on GPU
    
    User->>VLLMStage: setup()
    alt pretokenize enabled
        VLLMStage->>VLLMStage: Load separate tokenizer
    end
    
    User->>VLLMStage: process(batch)
    Note over VLLMStage: Extract text from batch<br/>Apply character limit
    
    alt pretokenize enabled
        VLLMStage->>VLLMStage: Tokenize text batch<br/>with truncation
        VLLMStage->>vLLM: embed(preprocessed_ids)
    else pretokenize disabled
        VLLMStage->>vLLM: embed(raw_text)
    end
    
    vLLM->>GPU: Compute embeddings
    GPU-->>vLLM: Return embedding vectors
    vLLM-->>VLLMStage: EmbeddingRequestOutput
    VLLMStage-->>User: DocumentBatch with embeddings
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

import torch
import torch.nn.functional as F # noqa: N812
from loguru import logger
from sentence_transformers import SentenceTransformer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] This unconditional import will cause ImportError for users who install without the text_cuda12 extra. The import should be conditional or guarded since sentence-transformers is in optional dependencies. Consider importing inside the class methods or using a try/except block.

if self.pretokenize:
from transformers import AutoTokenizer

self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] The tokenizer loading should respect the cache directory and authentication settings passed to the stage constructor. Currently only model_identifier is used, which may cause inconsistent behavior with model downloads.

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

raise ValueError(msg)

t0 = time.perf_counter()
max_model_len = self.model.model_config.max_model_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code accesses self.model.model_config.max_model_len without checking if self.model is initialized. if process() is called before setup(), this will raise an AttributeError: 'NoneType' object has no attribute 'model_config'

Suggested change
max_model_len = self.model.model_config.max_model_len
if self.model is None:
msg = "Model is not initialized. Please call setup() before processing."
raise ValueError(msg)
max_model_len = self.model.model_config.max_model_len

if not self.verbose and "disable_log_stats" not in vllm_init_kwargs:
vllm_init_kwargs["disable_log_stats"] = True

self.model = LLM(model=self.model_identifier, **vllm_init_kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cache_dir parameter from __init__ is not passed to the LLM initialization. vLLM's LLM class accepts a download_dir parameter that should be used here. currently, vLLM will use its default cache location which may differ from where snapshot_download cached the model in setup_on_node(), potentially causing duplicate downloads or cache misses

Suggested change
self.model = LLM(model=self.model_identifier, **vllm_init_kwargs)
self.model = LLM(model=self.model_identifier, download_dir=self.cache_dir, **vllm_init_kwargs)

Comment thread nemo_curator/stages/text/embedders/vllm.py
Comment thread nemo_curator/stages/text/embedders/base.py
Comment thread nemo_curator/stages/text/embedders/vllm.py

def setup(self, _: WorkerMetadata | None = None) -> None:
"""Load the model for inference."""
self.model = SentenceTransformer(self.model_identifier, local_files_only=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the SentenceTransformer initialization doesn't support custom cache directory. the SentenceTransformer class accepts a cache_folder parameter, but this stage doesn't accept or use it. consider adding cache_dir parameter support for consistency with other model stages:

Suggested change
self.model = SentenceTransformer(self.model_identifier, local_files_only=True)
self.model = SentenceTransformer(
self.model_identifier,
cache_folder=self.cache_dir if hasattr(self, 'cache_dir') and self.cache_dir else None,
local_files_only=True
)

and update __init__ to accept and store cache_dir parameter

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 12, 2026

Additional Comments (1)

nemo_curator/stages/text/embedders/__init__.py
the new VLLMEmbeddingModelStage class is not exported from the module. users need to import it directly from .vllm instead of from the main embedders module. consider adding it to __all__ for consistency:

from .base import EmbeddingCreatorStage
from .vllm import VLLMEmbeddingModelStage

__all__ = ["EmbeddingCreatorStage", "VLLMEmbeddingModelStage"]

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +114 to +124
super().__init__(
model_identifier=model_identifier,
hf_token=hf_token,
model_inference_batch_size=model_inference_batch_size,
has_seq_order=has_seq_order,
padding_side=padding_side,
autocast=autocast,
)
# Override unpack_inference_batch to False as SentenceTransformer expects a dictionary input
self.unpack_inference_batch = False
self.embedding_field = embedding_field
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parent EmbeddingModelStage.__init__ sets self.embedding_field using the default value, and then line 124 sets it again with the same value from the parameter. This is redundant. Additionally, the parent sets self.pooling = "mean_pooling" (the default) which is never used by SentenceTransformer, creating a misleading attribute.

While this doesn't cause runtime errors, it's confusing for maintainability. Consider either:

  1. Passing embedding_field and pooling to the parent constructor explicitly, or
  2. Not setting self.embedding_field again after the parent call

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

if self.pretokenize:
from transformers import AutoTokenizer

self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tokenizer loading doesn't pass cache_dir, which means it will look in the default HuggingFace cache directory instead of using self.cache_dir specified in the constructor. This will fail when a custom cache directory is used, since setup_on_node downloads to self.cache_dir but the tokenizer loads from the default location.

Suggested change
self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier)
self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier, cache_dir=self.cache_dir)

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +114 to +124
super().__init__(
model_identifier=model_identifier,
hf_token=hf_token,
model_inference_batch_size=model_inference_batch_size,
has_seq_order=has_seq_order,
padding_side=padding_side,
autocast=autocast,
)
# Override unpack_inference_batch to False as SentenceTransformer expects a dictionary input
self.unpack_inference_batch = False
self.embedding_field = embedding_field
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the initialization pattern here is fragile. super().__init__() is called with parameters that set self.unpack_inference_batch = True (via parent's init at line 53), then line 123 immediately overrides it to False. while this works, it's error-prone because:

  1. the parent class initialization does unnecessary work that's immediately overridden
  2. if the parent's __init__ implementation changes, this could break
  3. it's not immediately clear to future maintainers why this override is needed

consider either:

  • not passing the parameter to parent if it will be overridden
  • or better yet, modify the parent class to accept this as a parameter so the override isn't needed

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +162 to +163
if self.use_sentence_transformer:
logger.warning("Using SentenceTransformer for embedding model ignoring embedding_pooling")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when use_sentence_transformer=True, the code logs a warning that embedding_pooling is ignored, but there's no validation to check if the user explicitly set a non-default pooling value. if a user explicitly sets embedding_pooling="last_token" with use_sentence_transformer=True, they'll only get a warning, which they might miss. consider validating that embedding_pooling has its default value when use_sentence_transformer=True, or raise an error if it's been explicitly changed.

if self.pretokenize:
from transformers import AutoTokenizer

self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tokenizer is loaded without the cache directory, authentication parameter, or local-only flag that are used elsewhere in the codebase (see line 95 for snapshot_download). if the tokenizer isn't already cached or requires authentication, this will fail or make unexpected network calls during processing. the tokenizer loading should include cache_dir=self.cache_dir and local_files_only=True parameters for consistency.

Comment on lines +128 to +130
t0 = time.perf_counter()
vllm_output = self.model.embed(input_data, truncate_prompt_tokens=-1, use_tqdm=self.verbose)
metrics["vllm_embedding_time"] = time.perf_counter() - t0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the self.model.embed() call lacks error handling. if the embedding operation fails (GPU issues, invalid input, model errors), it will crash the entire processing pipeline. wrap this in a try-except block to provide better error messages and allow for graceful failure recovery.

Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread pyproject.toml
Comment on lines 70 to +73

[project.optional-dependencies]
cuda12 = ["gpustat", "nvidia-ml-py"]
vllm = ["vllm>=0.13; (platform_machine == 'x86_64' and platform_system != 'Darwin')"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vLLM version constraint change from ==0.11.1 (in video_cuda12) to >=0.13 may cause compatibility issues for video curation features that depend on vLLM 0.11.1 specifically. The video_cuda12 dependency previously pinned vLLM to 0.11.1, but now it will use 0.13+.

Before merging, please verify that:

  1. Video curation features work correctly with vLLM 0.13+
  2. There are no breaking API changes between vLLM 0.11.1 and 0.13 that affect video processing

If video curation requires vLLM 0.11.1 specifically, consider creating a separate vLLM version constraint for video vs text embedding use cases.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 12, 2026

Additional Comments (2)

nemo_curator/stages/text/embedders/vllm.py
VLLMEmbeddingModelStage is missing a teardown method to properly clean up GPU resources after processing. The parent class ModelStage implements teardown with gc.collect() and torch.cuda.empty_cache(), and this stage should do the same since it manages GPU-resident vLLM models.

    def teardown(self) -> None:
        """Clean up GPU resources."""
        if self.model is not None:
            del self.model
            self.model = None
        import gc
        gc.collect()
        import torch
        torch.cuda.empty_cache()

nemo_curator/stages/text/embedders/base.py
EmbeddingModelStage and SentenceTransformerEmbeddingModelStage override setup() but don't properly implement model downloading in setup_on_node(). The parent ModelStage expects subclasses to implement a _setup(local_files_only) method that gets called by both setup_on_node() (with local_files_only=False for downloading) and setup() (with local_files_only=True for loading).

Currently, when setup_on_node() is called, it inherits the parent's implementation which tries to call _setup(), but since these classes don't implement _setup(), it only logs a warning and doesn't download the model. Then when setup() is called with local_files_only=True, it will fail if the model wasn't already cached.

This is a pre-existing issue in EmbeddingModelStage, but SentenceTransformerEmbeddingModelStage inherits it. Consider refactoring to implement _setup() instead of overriding setup() directly, similar to how ClassifierModelStage is implemented. Alternatively, override setup_on_node() to properly download models before loading them.

Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

if self.pretokenize:
from transformers import AutoTokenizer

self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tokenizer is loaded without proper parameters for distributed execution. it's missing local_files_only=True which could cause failures if model files aren't already cached locally. it should match the pattern used in setup_on_node() at line 95 where snapshot_download is called with local_files_only=False to download, and then this setup should use local_files_only=True.

Comment on lines +129 to +132
def setup(self, _: WorkerMetadata | None = None) -> None:
"""Load the model for inference."""
self.model = SentenceTransformer(self.model_identifier, local_files_only=True)
self.model.eval().to("cuda")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SentenceTransformerEmbeddingModelStage has the same issue as EmbeddingModelStage - it overrides setup() but doesn't implement the _setup(local_files_only) method expected by the parent ModelStage class. this will cause issues with model downloading in distributed settings where setup_on_node() is called first to download models to each node.

the parent class ModelStage's setup_on_node() method (model.py lines 85-102) expects subclasses to implement _setup so it can be called with local_files_only=False during download and local_files_only=True during worker setup.

model_class = SentenceTransformerEmbeddingModelStage if self.use_sentence_transformer else EmbeddingModelStage

if self.use_sentence_transformer:
logger.warning("Using SentenceTransformer for embedding model ignoring embedding_pooling")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the warning message says "ignoring embedding_pooling" but it should be more clear about what the expected behavior is. consider rephrasing to: "SentenceTransformer uses its own internal pooling configuration; the embedding_pooling parameter will be ignored"

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 12, 2026

Additional Comments (3)

nemo_curator/stages/text/embedders/base.py
EmbeddingModelStage overrides setup() but doesn't follow the parent ModelStage's expected pattern of implementing a _setup(local_files_only) method. the parent ModelStage's setup_on_node() (lines 85-102 in model.py) calls snapshot_download then looks for a _setup method to complete initialization. without implementing _setup, model downloading in distributed settings won't work correctly.

see how other ModelStage subclasses implement this pattern (e.g., nemo_curator/stages/text/classifiers/base.py line 115, nemo_curator/stages/text/models/tokenizer.py line 114).


nemo_curator/stages/text/embedders/base.py
edge case: _get_last_token could fail if a sequence has all padding tokens (attention_mask all zeros). in this case, attention_mask.sum(dim=1) - 1 would be -1, causing an invalid index error at line 99.

while this is unlikely in practice (there should always be at least one real token), consider adding a check or documenting this assumption.


nemo_curator/stages/text/embedders/vllm.py
VLLMEmbeddingModelStage is missing a teardown() method to properly clean up GPU resources after processing. the parent ModelStage class implements teardown with gc.collect() and torch.cuda.empty_cache() (see nemo_curator/stages/text/models/model.py lines 195-197), and this stage should do the same since it manages GPU-resident vLLM models.

without proper cleanup, GPU memory may not be released between processing stages, potentially causing out-of-memory errors in pipelines with multiple stages.

Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

metrics["tokenization_time"] = time.perf_counter() - t0

t0 = time.perf_counter()
vllm_output = self.model.embed(input_data, truncate_prompt_tokens=-1, use_tqdm=self.verbose)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] When self.model is None at line 129, calling self.model.embed() will cause AttributeError. Add model initialization check before this line to fail fast with a clear error message.

raise ValueError(msg)

t0 = time.perf_counter()
max_model_len = self.model.model_config.max_model_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Accessing model_config when model is None will cause AttributeError. This can happen if process() is called before setup() completes. Add check before line 123 to ensure model is initialized.

Comment on lines +129 to +132
def setup(self, _: WorkerMetadata | None = None) -> None:
"""Load the model for inference."""
self.model = SentenceTransformer(self.model_identifier, local_files_only=True)
self.model.eval().to("cuda")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] SentenceTransformerEmbeddingModelStage.setup() also overrides parent without implementing _setup() method. Same distributed setup issue as EmbeddingModelStage - model won't load when setup_on_node() is called in distributed environments.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 12, 2026

Additional Comments (1)

nemo_curator/stages/text/embedders/base.py
[P2] Both EmbeddingModelStage and SentenceTransformerEmbeddingModelStage override setup() but don't implement the _setup(local_files_only) method expected by parent ModelStage. The parent's setup_on_node() (lines 85-102 in model.py) calls snapshot_download then looks for a _setup method to complete initialization. Without implementing _setup, distributed model downloading via setup_on_node() won't work correctly - the model will be downloaded but never loaded. See how other ModelStage subclasses implement this (e.g., DebertaClassificationStage in classifiers/base.py line 115).

Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

raise ValueError(msg)

t0 = time.perf_counter()
max_model_len = self.model.model_config.max_model_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing model.model_config without checking if model is None. When pretokenize is True and process is called before setup, this line will crash with AttributeError. Should validate self.model is initialized like the tokenizer check above.

Copy link
Copy Markdown
Contributor

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline. The main concerns will be addressed by :

  1. Fast Follow to make VLLM default for text curation use cases.
  2. Align on a abstract class model to inherit and verify class signatures being used.

Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 64 to 67
def setup(self, _: WorkerMetadata | None = None) -> None:
"""Load the model for inference."""
self.model = AutoModel.from_pretrained(self.model_identifier, local_files_only=True)
self.model = AutoModel.from_pretrained(self.model_identifier, cache_dir=self.cache_dir, local_files_only=True)
self.model.eval().to("cuda")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EmbeddingModelStage overrides setup() but doesn't implement _setup(local_files_only) expected by parent ModelStage. The parent's setup_on_node() (lines 85-102 in model.py) downloads the model then calls _setup() to load it. Without _setup(), distributed model downloading won't work - model downloads but never loads.

see how other ModelStage subclasses implement this (e.g., DebertaClassificationStage in classifiers/base.py line 115, TokenizerStage in models/tokenizer.py line 114)

Suggested change
def setup(self, _: WorkerMetadata | None = None) -> None:
"""Load the model for inference."""
self.model = AutoModel.from_pretrained(self.model_identifier, local_files_only=True)
self.model = AutoModel.from_pretrained(self.model_identifier, cache_dir=self.cache_dir, local_files_only=True)
self.model.eval().to("cuda")
def _setup(self, local_files_only: bool = True) -> None:
"""Load the model for inference."""
self.model = AutoModel.from_pretrained(self.model_identifier, cache_dir=self.cache_dir, local_files_only=local_files_only)
self.model.eval().to("cuda")
def setup(self, _: WorkerMetadata | None = None) -> None:
"""Load the model for inference."""
self._setup(local_files_only=True)

raise ValueError(msg)

t0 = time.perf_counter()
max_model_len = self.model.model_config.max_model_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessing self.model.model_config before checking if self.model is None could cause AttributeError if setup() wasn't called. add a check:

Suggested change
max_model_len = self.model.model_config.max_model_len
if self.model is None:
msg = "Model is not initialized. Please call setup() before processing."
raise ValueError(msg)
max_model_len = self.model.model_config.max_model_len

Comment on lines +172 to +173
if self.use_sentence_transformer:
logger.warning("Using SentenceTransformer for embedding model ignoring embedding_pooling")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning message says "ignoring embedding_pooling" but uses incorrect grammar. should be "ignoring embedding_pooling parameter"

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +133 to +141
def setup(self, _: WorkerMetadata | None = None) -> None:
"""Load the model for inference."""
self.model = SentenceTransformer(
self.model_identifier,
cache_folder=self.cache_dir,
use_auth_token=self.hf_token,
local_files_only=True,
)
self.model.eval().to("cuda")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SentenceTransformerEmbeddingModelStage has the same issue - overrides setup() but doesn't implement _setup(local_files_only). distributed model downloading via setup_on_node() won't work correctly.

Suggested change
def setup(self, _: WorkerMetadata | None = None) -> None:
"""Load the model for inference."""
self.model = SentenceTransformer(
self.model_identifier,
cache_folder=self.cache_dir,
use_auth_token=self.hf_token,
local_files_only=True,
)
self.model.eval().to("cuda")
def _setup(self, local_files_only: bool = True) -> None:
"""Load the model for inference."""
self.model = SentenceTransformer(
self.model_identifier,
cache_folder=self.cache_dir,
local_files_only=local_files_only,
)
self.model.eval().to("cuda")
def setup(self, _: WorkerMetadata | None = None) -> None:
"""Load the model for inference."""
self._setup(local_files_only=True)

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 13, 2026

Additional Comments (2)

nemo_curator/stages/text/embedders/base.py
_get_last_token will fail if a sequence has all padding tokens (attention_mask all zeros). attention_mask.sum(dim=1) - 1 would be -1, causing invalid index at line 101.

while unlikely in practice, add a safety check:

    def _get_last_token(self, model_output: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        token_embeddings = model_output[0]
        # Get indices of last non-padded tokens for each sequence in batch
        last_token_indices = attention_mask.sum(dim=1) - 1  # -1 for 0-based indexing
        last_token_indices = last_token_indices.clamp(min=0)  # Ensure non-negative
        last_token_indices = last_token_indices.to(torch.long)  # Ensure indices are of type long
        batch_size = attention_mask.size(0)
        batch_indices = torch.arange(batch_size, device=attention_mask.device)
        # Get embeddings of last non-padded tokens
        last_token_embeddings = token_embeddings[batch_indices, last_token_indices]
        return F.normalize(last_token_embeddings, dim=1)

nemo_curator/stages/text/embedders/__init__.py
VLLMEmbeddingModelStage is not exported. users need to import it directly from .vllm instead of from the main embedders module.

from .base import EmbeddingCreatorStage
from .vllm import VLLMEmbeddingModelStage

__all__ = ["EmbeddingCreatorStage", "VLLMEmbeddingModelStage"]

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 13, 2026

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Signed-off-by: Praateek <praateekm@gmail.com>
…/NeMo-Curator into praateek/vllm-embeddings

Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

raise ValueError(msg)

t0 = time.perf_counter()
max_model_len = self.model.model_config.max_model_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: validate model initialization before accessing config property, similar to tokenizer validation at line 123

Signed-off-by: Praateek <praateekm@gmail.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

import torch
import torch.nn.functional as F # noqa: N812
from loguru import logger
from sentence_transformers import SentenceTransformer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: unconditional import breaks users without text_cpu/text_cuda12 extras. since __init__.py imports EmbeddingCreatorStage from this file, importing the embedders package will fail with ImportError: No module named 'sentence_transformers'

wrap in try/except or use conditional import:

Suggested change
from sentence_transformers import SentenceTransformer
try:
from sentence_transformers import SentenceTransformer
except ImportError:
SentenceTransformer = None

then add validation in SentenceTransformerEmbeddingModelStage.__init__:

if SentenceTransformer is None:
    raise ImportError("sentence-transformers required. Install with: pip install nemo-curator[text_cpu]")

raise ValueError(msg)

t0 = time.perf_counter()
max_model_len = self.model.model_config.max_model_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: accessing model.model_config without checking if model is initialized. if process() called before setup(), raises AttributeError: 'NoneType' object has no attribute 'model_config'

Suggested change
max_model_len = self.model.model_config.max_model_len
if self.model is None:
raise ValueError("Model is not initialized. Please call setup() before processing.")
max_model_len = self.model.model_config.max_model_len

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants