Add vLLM and Sentence Transformers support for embedding generation by praateekmahajan · Pull Request #1346 · NVIDIA-NeMo/Curator

praateekmahajan · 2025-12-31T00:21:54Z

Description

Tried out few models

sentence_transformer : Current implementation of TokenizerStage + ModelStage
vllm_text : vLLM with text as input - Single stage of vLLM only
1. vllm_text_4cpus : Same as above, except we dedicate 4 CPUs to the stage
vllm_tokens : vLLM with tokens as input, tokenization done by TokenizerStage
vllm_text_pretokenized : vLLM with text as input, tokenization done within the stage

The experiment ran on 5gb of common crawl data and the findings were

Dependent on model size
1. Larger model (embedding gemma): vLLM wins
2. Smaller model (miniLM) : Sentence Transformer Current implementation wins
When measuring average time per task / partition (combining tokenization and modeling) the fastest, irrespective of model size is vllm_text_pretokenized. This suggests given a large number of tasks and amortized startup cost, we should see vllm_text_pretokenized come out ahead.
For large model
1. Among vLLM variations, total time taken is similar, but that includes startup time, so we should measure average time spent per task.
For small model
1. Among vLLM variations, vllm_text is slowest, likely due to tokenization happening sentence by sentence leading to higher GPU idle time, and not justifying the small model runtime on GPU. Increasing cpu allocation for the stage doesn’t improve runtimes.
vllm_tokens is also fast for both small / large models, but it increases the following cost for a negligible overhead compared to vllm_text_pretokenized;
1. maintenance overhead of maintaining a separate tokenization stage
2. serialization overhead of transferring data from one actor to another

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Praateek <praateekm@gmail.com>

greptile-apps · 2025-12-31T00:28:55Z

Greptile Summary

This PR adds vLLM and Sentence Transformers support for embedding generation, providing performance improvements for large models based on experimental benchmarks with 5GB of Common Crawl data.

Key Changes:

New VLLMEmbeddingModelStage for vLLM-based embedding generation with optional pretokenization
New SentenceTransformerEmbeddingModelStage for sentence-transformers library integration
Updated EmbeddingCreatorStage with use_sentence_transformer flag to choose between implementations
Added cache_dir parameter support across embedding stages
Updated dependencies: vLLM version constraint changed to >=0.13, added sentence-transformers, added scikit-learn version constraint
Comprehensive test coverage validating embeddings against reference implementations

Performance Findings:
According to the PR description, the vLLM text-based approach without separate pretokenization shows the fastest average time per task when amortized across many tasks, making it recommended for large-scale embedding generation.

Confidence Score: 4/5

This PR is safe to merge with minor fixes needed for the identified import and initialization issues
The implementation is well-tested with comprehensive test coverage validating against reference implementations. The code follows existing patterns in the codebase. Two critical issues need addressing: (1) unconditional sentence_transformers import will break installations without text extras, and (2) missing model initialization check before accessing model properties. These are straightforward fixes that don't affect the core logic.
Pay close attention to nemo_curator/stages/text/embedders/base.py (import issue) and nemo_curator/stages/text/embedders/vllm.py (initialization check)

Important Files Changed

Filename	Overview
nemo_curator/stages/text/embedders/vllm.py	New vLLM embedding stage implementation with pretokenization support, properly handles cache directory configuration
nemo_curator/stages/text/embedders/base.py	Added SentenceTransformer support and cache_dir parameter to embedding stages
pyproject.toml	Added vLLM and sentence-transformers dependencies with proper versioning constraints

Sequence Diagram

sequenceDiagram
    participant User
    participant VLLMStage as VLLMEmbeddingModelStage
    participant vLLM
    participant GPU

    User->>VLLMStage: initialize(model_identifier, pretokenize)
    
    User->>VLLMStage: setup_on_node()
    VLLMStage->>vLLM: Download model files
    VLLMStage->>vLLM: Initialize LLM instance
    vLLM->>GPU: Load model weights
    GPU-->>VLLMStage: Model ready on GPU
    
    User->>VLLMStage: setup()
    alt pretokenize enabled
        VLLMStage->>VLLMStage: Load separate tokenizer
    end
    
    User->>VLLMStage: process(batch)
    Note over VLLMStage: Extract text from batch<br/>Apply character limit
    
    alt pretokenize enabled
        VLLMStage->>VLLMStage: Tokenize text batch<br/>with truncation
        VLLMStage->>vLLM: embed(preprocessed_ids)
    else pretokenize disabled
        VLLMStage->>vLLM: embed(raw_text)
    end
    
    vLLM->>GPU: Compute embeddings
    GPU-->>vLLM: Return embedding vectors
    vLLM-->>VLLMStage: EmbeddingRequestOutput
    VLLMStage-->>User: DocumentBatch with embeddings

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T19:51:14Z

 import torch
 import torch.nn.functional as F  # noqa: N812
+from loguru import logger
+from sentence_transformers import SentenceTransformer


[P0] This unconditional import will cause ImportError for users who install without the text_cuda12 extra. The import should be conditional or guarded since sentence-transformers is in optional dependencies. Consider importing inside the class methods or using a try/except block.

greptile-apps · 2026-01-12T19:51:14Z

+        if self.pretokenize:
+            from transformers import AutoTokenizer
+
+            self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier)


[P2] The tokenizer loading should respect the cache directory and authentication settings passed to the stage constructor. Currently only model_identifier is used, which may cause inconsistent behavior with model downloads.

greptile-apps

_{7 files reviewed, 7 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T19:56:53Z

+                raise ValueError(msg)
+
+            t0 = time.perf_counter()
+            max_model_len = self.model.model_config.max_model_len


the code accesses self.model.model_config.max_model_len without checking if self.model is initialized. if process() is called before setup(), this will raise an AttributeError: 'NoneType' object has no attribute 'model_config'

Suggested change

max_model_len = self.model.model_config.max_model_len

if self.model is None:

msg = "Model is not initialized. Please call setup() before processing."

raise ValueError(msg)

max_model_len = self.model.model_config.max_model_len

greptile-apps · 2026-01-12T19:56:54Z

+        if not self.verbose and "disable_log_stats" not in vllm_init_kwargs:
+            vllm_init_kwargs["disable_log_stats"] = True
+
+        self.model = LLM(model=self.model_identifier, **vllm_init_kwargs)


the cache_dir parameter from __init__ is not passed to the LLM initialization. vLLM's LLM class accepts a download_dir parameter that should be used here. currently, vLLM will use its default cache location which may differ from where snapshot_download cached the model in setup_on_node(), potentially causing duplicate downloads or cache misses

Suggested change

self.model = LLM(model=self.model_identifier, **vllm_init_kwargs)

self.model = LLM(model=self.model_identifier, download_dir=self.cache_dir, **vllm_init_kwargs)

greptile-apps · 2026-01-12T19:57:01Z

+
+    def setup(self, _: WorkerMetadata | None = None) -> None:
+        """Load the model for inference."""
+        self.model = SentenceTransformer(self.model_identifier, local_files_only=True)


the SentenceTransformer initialization doesn't support custom cache directory. the SentenceTransformer class accepts a cache_folder parameter, but this stage doesn't accept or use it. consider adding cache_dir parameter support for consistency with other model stages:

Suggested change

self.model = SentenceTransformer(self.model_identifier, local_files_only=True)

self.model = SentenceTransformer(

self.model_identifier,

cache_folder=self.cache_dir if hasattr(self, 'cache_dir') and self.cache_dir else None,

local_files_only=True

)

and update __init__ to accept and store cache_dir parameter

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-12T19:57:04Z

Additional Comments (1)

nemo_curator/stages/text/embedders/__init__.py
the new VLLMEmbeddingModelStage class is not exported from the module. users need to import it directly from .vllm instead of from the main embedders module. consider adding it to __all__ for consistency:

from .base import EmbeddingCreatorStage
from .vllm import VLLMEmbeddingModelStage

__all__ = ["EmbeddingCreatorStage", "VLLMEmbeddingModelStage"]

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Signed-off-by: Praateek <praateekm@gmail.com>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T20:31:13Z

+        super().__init__(
+            model_identifier=model_identifier,
+            hf_token=hf_token,
+            model_inference_batch_size=model_inference_batch_size,
+            has_seq_order=has_seq_order,
+            padding_side=padding_side,
+            autocast=autocast,
+        )
+        # Override unpack_inference_batch to False as SentenceTransformer expects a dictionary input
+        self.unpack_inference_batch = False
+        self.embedding_field = embedding_field


the parent EmbeddingModelStage.__init__ sets self.embedding_field using the default value, and then line 124 sets it again with the same value from the parameter. This is redundant. Additionally, the parent sets self.pooling = "mean_pooling" (the default) which is never used by SentenceTransformer, creating a misleading attribute.

While this doesn't cause runtime errors, it's confusing for maintainability. Consider either:

Passing embedding_field and pooling to the parent constructor explicitly, or

Not setting self.embedding_field again after the parent call

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-12T20:31:13Z

+        if self.pretokenize:
+            from transformers import AutoTokenizer
+
+            self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier)


the tokenizer loading doesn't pass cache_dir, which means it will look in the default HuggingFace cache directory instead of using self.cache_dir specified in the constructor. This will fail when a custom cache directory is used, since setup_on_node downloads to self.cache_dir but the tokenizer loads from the default location.

Suggested change

self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier)

self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier, cache_dir=self.cache_dir)

greptile-apps

_{4 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T20:33:52Z

+        super().__init__(
+            model_identifier=model_identifier,
+            hf_token=hf_token,
+            model_inference_batch_size=model_inference_batch_size,
+            has_seq_order=has_seq_order,
+            padding_side=padding_side,
+            autocast=autocast,
+        )
+        # Override unpack_inference_batch to False as SentenceTransformer expects a dictionary input
+        self.unpack_inference_batch = False
+        self.embedding_field = embedding_field


the initialization pattern here is fragile. super().__init__() is called with parameters that set self.unpack_inference_batch = True (via parent's init at line 53), then line 123 immediately overrides it to False. while this works, it's error-prone because:

the parent class initialization does unnecessary work that's immediately overridden

if the parent's __init__ implementation changes, this could break

it's not immediately clear to future maintainers why this override is needed

consider either:

not passing the parameter to parent if it will be overridden

or better yet, modify the parent class to accept this as a parameter so the override isn't needed

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-12T20:33:53Z

+        if self.use_sentence_transformer:
+            logger.warning("Using SentenceTransformer for embedding model ignoring embedding_pooling")


when use_sentence_transformer=True, the code logs a warning that embedding_pooling is ignored, but there's no validation to check if the user explicitly set a non-default pooling value. if a user explicitly sets embedding_pooling="last_token" with use_sentence_transformer=True, they'll only get a warning, which they might miss. consider validating that embedding_pooling has its default value when use_sentence_transformer=True, or raise an error if it's been explicitly changed.

greptile-apps · 2026-01-12T20:33:55Z

+        if self.pretokenize:
+            from transformers import AutoTokenizer
+
+            self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier)


the tokenizer is loaded without the cache directory, authentication parameter, or local-only flag that are used elsewhere in the codebase (see line 95 for snapshot_download). if the tokenizer isn't already cached or requires authentication, this will fail or make unexpected network calls during processing. the tokenizer loading should include cache_dir=self.cache_dir and local_files_only=True parameters for consistency.

greptile-apps · 2026-01-12T20:33:55Z

+        t0 = time.perf_counter()
+        vllm_output = self.model.embed(input_data, truncate_prompt_tokens=-1, use_tqdm=self.verbose)
+        metrics["vllm_embedding_time"] = time.perf_counter() - t0


the self.model.embed() call lacks error handling. if the embedding operation fails (GPU issues, invalid input, model errors), it will crash the entire processing pipeline. wrap this in a try-except block to provide better error messages and allow for graceful failure recovery.

Signed-off-by: Praateek <praateekm@gmail.com>

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T20:54:34Z


 [project.optional-dependencies]
 cuda12 = ["gpustat", "nvidia-ml-py"]
+vllm = ["vllm>=0.13; (platform_machine == 'x86_64' and platform_system != 'Darwin')"]


The vLLM version constraint change from ==0.11.1 (in video_cuda12) to >=0.13 may cause compatibility issues for video curation features that depend on vLLM 0.11.1 specifically. The video_cuda12 dependency previously pinned vLLM to 0.11.1, but now it will use 0.13+.

Before merging, please verify that:

Video curation features work correctly with vLLM 0.13+

There are no breaking API changes between vLLM 0.11.1 and 0.13 that affect video processing

If video curation requires vLLM 0.11.1 specifically, consider creating a separate vLLM version constraint for video vs text embedding use cases.

greptile-apps · 2026-01-12T20:54:40Z

Additional Comments (2)

nemo_curator/stages/text/embedders/vllm.py
VLLMEmbeddingModelStage is missing a teardown method to properly clean up GPU resources after processing. The parent class ModelStage implements teardown with gc.collect() and torch.cuda.empty_cache(), and this stage should do the same since it manages GPU-resident vLLM models.

    def teardown(self) -> None:
        """Clean up GPU resources."""
        if self.model is not None:
            del self.model
            self.model = None
        import gc
        gc.collect()
        import torch
        torch.cuda.empty_cache()

nemo_curator/stages/text/embedders/base.py
EmbeddingModelStage and SentenceTransformerEmbeddingModelStage override setup() but don't properly implement model downloading in setup_on_node(). The parent ModelStage expects subclasses to implement a _setup(local_files_only) method that gets called by both setup_on_node() (with local_files_only=False for downloading) and setup() (with local_files_only=True for loading).

Currently, when setup_on_node() is called, it inherits the parent's implementation which tries to call _setup(), but since these classes don't implement _setup(), it only logs a warning and doesn't download the model. Then when setup() is called with local_files_only=True, it will fail if the model wasn't already cached.

This is a pre-existing issue in EmbeddingModelStage, but SentenceTransformerEmbeddingModelStage inherits it. Consider refactoring to implement _setup() instead of overriding setup() directly, similar to how ClassifierModelStage is implemented. Alternatively, override setup_on_node() to properly download models before loading them.

Signed-off-by: Praateek <praateekm@gmail.com>

greptile-apps

_{6 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T21:58:07Z

+        if self.pretokenize:
+            from transformers import AutoTokenizer
+
+            self.tokenizer = AutoTokenizer.from_pretrained(self.model_identifier)


the tokenizer is loaded without proper parameters for distributed execution. it's missing local_files_only=True which could cause failures if model files aren't already cached locally. it should match the pattern used in setup_on_node() at line 95 where snapshot_download is called with local_files_only=False to download, and then this setup should use local_files_only=True.

greptile-apps · 2026-01-12T21:58:12Z

+    def setup(self, _: WorkerMetadata | None = None) -> None:
+        """Load the model for inference."""
+        self.model = SentenceTransformer(self.model_identifier, local_files_only=True)
+        self.model.eval().to("cuda")


SentenceTransformerEmbeddingModelStage has the same issue as EmbeddingModelStage - it overrides setup() but doesn't implement the _setup(local_files_only) method expected by the parent ModelStage class. this will cause issues with model downloading in distributed settings where setup_on_node() is called first to download models to each node.

the parent class ModelStage's setup_on_node() method (model.py lines 85-102) expects subclasses to implement _setup so it can be called with local_files_only=False during download and local_files_only=True during worker setup.

greptile-apps · 2026-01-12T21:58:19Z

+        model_class = SentenceTransformerEmbeddingModelStage if self.use_sentence_transformer else EmbeddingModelStage
+
+        if self.use_sentence_transformer:
+            logger.warning("Using SentenceTransformer for embedding model ignoring embedding_pooling")


the warning message says "ignoring embedding_pooling" but it should be more clear about what the expected behavior is. consider rephrasing to: "SentenceTransformer uses its own internal pooling configuration; the embedding_pooling parameter will be ignored"

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-12T21:58:22Z

Additional Comments (3)

nemo_curator/stages/text/embedders/base.py
EmbeddingModelStage overrides setup() but doesn't follow the parent ModelStage's expected pattern of implementing a _setup(local_files_only) method. the parent ModelStage's setup_on_node() (lines 85-102 in model.py) calls snapshot_download then looks for a _setup method to complete initialization. without implementing _setup, model downloading in distributed settings won't work correctly.

see how other ModelStage subclasses implement this pattern (e.g., nemo_curator/stages/text/classifiers/base.py line 115, nemo_curator/stages/text/models/tokenizer.py line 114).

nemo_curator/stages/text/embedders/base.py
edge case: _get_last_token could fail if a sequence has all padding tokens (attention_mask all zeros). in this case, attention_mask.sum(dim=1) - 1 would be -1, causing an invalid index error at line 99.

while this is unlikely in practice (there should always be at least one real token), consider adding a check or documenting this assumption.

nemo_curator/stages/text/embedders/vllm.py
VLLMEmbeddingModelStage is missing a teardown() method to properly clean up GPU resources after processing. the parent ModelStage class implements teardown with gc.collect() and torch.cuda.empty_cache() (see nemo_curator/stages/text/models/model.py lines 195-197), and this stage should do the same since it manages GPU-resident vLLM models.

without proper cleanup, GPU memory may not be released between processing stages, potentially causing out-of-memory errors in pipelines with multiple stages.

Signed-off-by: Praateek <praateekm@gmail.com>

greptile-apps

_{4 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T22:09:48Z

+            metrics["tokenization_time"] = time.perf_counter() - t0
+
+        t0 = time.perf_counter()
+        vllm_output = self.model.embed(input_data, truncate_prompt_tokens=-1, use_tqdm=self.verbose)


[P1] When self.model is None at line 129, calling self.model.embed() will cause AttributeError. Add model initialization check before this line to fail fast with a clear error message.

greptile-apps · 2026-01-12T22:09:49Z

+                raise ValueError(msg)
+
+            t0 = time.perf_counter()
+            max_model_len = self.model.model_config.max_model_len


[P1] Accessing model_config when model is None will cause AttributeError. This can happen if process() is called before setup() completes. Add check before line 123 to ensure model is initialized.

greptile-apps · 2026-01-12T22:09:50Z

+    def setup(self, _: WorkerMetadata | None = None) -> None:
+        """Load the model for inference."""
+        self.model = SentenceTransformer(self.model_identifier, local_files_only=True)
+        self.model.eval().to("cuda")


[P2] SentenceTransformerEmbeddingModelStage.setup() also overrides parent without implementing _setup() method. Same distributed setup issue as EmbeddingModelStage - model won't load when setup_on_node() is called in distributed environments.

greptile-apps · 2026-01-12T22:09:52Z

Additional Comments (1)

nemo_curator/stages/text/embedders/base.py
[P2] Both EmbeddingModelStage and SentenceTransformerEmbeddingModelStage override setup() but don't implement the _setup(local_files_only) method expected by parent ModelStage. The parent's setup_on_node() (lines 85-102 in model.py) calls snapshot_download then looks for a _setup method to complete initialization. Without implementing _setup, distributed model downloading via setup_on_node() won't work correctly - the model will be downloaded but never loaded. See how other ModelStage subclasses implement this (e.g., DebertaClassificationStage in classifiers/base.py line 115).

Signed-off-by: Praateek <praateekm@gmail.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T22:41:42Z

+                raise ValueError(msg)
+
+            t0 = time.perf_counter()
+            max_model_len = self.model.model_config.max_model_len


Accessing model.model_config without checking if model is None. When pretokenize is True and process is called before setup, this line will crash with AttributeError. Should validate self.model is initialized like the tokenizer check above.

VibhuJawa

We discussed offline. The main concerns will be addressed by :

Fast Follow to make VLLM default for text curation use cases.
Align on a abstract class model to inherit and verify class signatures being used.

Signed-off-by: Praateek <praateekm@gmail.com>

greptile-apps

_{6 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T17:46:59Z

    def setup(self, _: WorkerMetadata | None = None) -> None:
        """Load the model for inference."""
-        self.model = AutoModel.from_pretrained(self.model_identifier, local_files_only=True)
+        self.model = AutoModel.from_pretrained(self.model_identifier, cache_dir=self.cache_dir, local_files_only=True)
        self.model.eval().to("cuda")


EmbeddingModelStage overrides setup() but doesn't implement _setup(local_files_only) expected by parent ModelStage. The parent's setup_on_node() (lines 85-102 in model.py) downloads the model then calls _setup() to load it. Without _setup(), distributed model downloading won't work - model downloads but never loads.

see how other ModelStage subclasses implement this (e.g., DebertaClassificationStage in classifiers/base.py line 115, TokenizerStage in models/tokenizer.py line 114)

Suggested change

def setup(self, _: WorkerMetadata | None = None) -> None:

"""Load the model for inference."""

self.model = AutoModel.from_pretrained(self.model_identifier, local_files_only=True)

self.model = AutoModel.from_pretrained(self.model_identifier, cache_dir=self.cache_dir, local_files_only=True)

self.model.eval().to("cuda")

def _setup(self, local_files_only: bool = True) -> None:

"""Load the model for inference."""

self.model = AutoModel.from_pretrained(self.model_identifier, cache_dir=self.cache_dir, local_files_only=local_files_only)

self.model.eval().to("cuda")

def setup(self, _: WorkerMetadata | None = None) -> None:

"""Load the model for inference."""

self._setup(local_files_only=True)

greptile-apps · 2026-01-13T17:47:06Z

+                raise ValueError(msg)
+
+            t0 = time.perf_counter()
+            max_model_len = self.model.model_config.max_model_len


accessing self.model.model_config before checking if self.model is None could cause AttributeError if setup() wasn't called. add a check:

Suggested change

max_model_len = self.model.model_config.max_model_len

if self.model is None:

msg = "Model is not initialized. Please call setup() before processing."

raise ValueError(msg)

max_model_len = self.model.model_config.max_model_len

greptile-apps · 2026-01-13T17:47:07Z

+        if self.use_sentence_transformer:
+            logger.warning("Using SentenceTransformer for embedding model ignoring embedding_pooling")


warning message says "ignoring embedding_pooling" but uses incorrect grammar. should be "ignoring embedding_pooling parameter"

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-13T17:47:08Z

+    def setup(self, _: WorkerMetadata | None = None) -> None:
+        """Load the model for inference."""
+        self.model = SentenceTransformer(
+            self.model_identifier,
+            cache_folder=self.cache_dir,
+            use_auth_token=self.hf_token,
+            local_files_only=True,
+        )
+        self.model.eval().to("cuda")


SentenceTransformerEmbeddingModelStage has the same issue - overrides setup() but doesn't implement _setup(local_files_only). distributed model downloading via setup_on_node() won't work correctly.

Suggested change

def setup(self, _: WorkerMetadata | None = None) -> None:

"""Load the model for inference."""

self.model = SentenceTransformer(

self.model_identifier,

cache_folder=self.cache_dir,

use_auth_token=self.hf_token,

local_files_only=True,

)

self.model.eval().to("cuda")

def _setup(self, local_files_only: bool = True) -> None:

"""Load the model for inference."""

self.model = SentenceTransformer(

self.model_identifier,

cache_folder=self.cache_dir,

local_files_only=local_files_only,

)

self.model.eval().to("cuda")

def setup(self, _: WorkerMetadata | None = None) -> None:

"""Load the model for inference."""

self._setup(local_files_only=True)

greptile-apps · 2026-01-13T17:47:11Z

Additional Comments (2)

nemo_curator/stages/text/embedders/base.py
_get_last_token will fail if a sequence has all padding tokens (attention_mask all zeros). attention_mask.sum(dim=1) - 1 would be -1, causing invalid index at line 101.

while unlikely in practice, add a safety check:

    def _get_last_token(self, model_output: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        token_embeddings = model_output[0]
        # Get indices of last non-padded tokens for each sequence in batch
        last_token_indices = attention_mask.sum(dim=1) - 1  # -1 for 0-based indexing
        last_token_indices = last_token_indices.clamp(min=0)  # Ensure non-negative
        last_token_indices = last_token_indices.to(torch.long)  # Ensure indices are of type long
        batch_size = attention_mask.size(0)
        batch_indices = torch.arange(batch_size, device=attention_mask.device)
        # Get embeddings of last non-padded tokens
        last_token_embeddings = token_embeddings[batch_indices, last_token_indices]
        return F.normalize(last_token_embeddings, dim=1)

nemo_curator/stages/text/embedders/__init__.py
VLLMEmbeddingModelStage is not exported. users need to import it directly from .vllm instead of from the main embedders module.

from .base import EmbeddingCreatorStage
from .vllm import VLLMEmbeddingModelStage

__all__ = ["EmbeddingCreatorStage", "VLLMEmbeddingModelStage"]

greptile-apps · 2026-01-13T18:32:54Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

Signed-off-by: Praateek <praateekm@gmail.com>

…/NeMo-Curator into praateek/vllm-embeddings Signed-off-by: Praateek <praateekm@gmail.com>

greptile-apps

_{7 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T20:03:48Z

+                raise ValueError(msg)
+
+            t0 = time.perf_counter()
+            max_model_len = self.model.model_config.max_model_len


logic: validate model initialization before accessing config property, similar to tokenizer validation at line 123

Signed-off-by: Praateek <praateekm@gmail.com>

greptile-apps

_{7 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T22:10:14Z

 import torch
 import torch.nn.functional as F  # noqa: N812
+from loguru import logger
+from sentence_transformers import SentenceTransformer


logic: unconditional import breaks users without text_cpu/text_cuda12 extras. since __init__.py imports EmbeddingCreatorStage from this file, importing the embedders package will fail with ImportError: No module named 'sentence_transformers'

wrap in try/except or use conditional import:

Suggested change

from sentence_transformers import SentenceTransformer

try:

from sentence_transformers import SentenceTransformer

except ImportError:

SentenceTransformer = None

then add validation in SentenceTransformerEmbeddingModelStage.__init__:

if SentenceTransformer is None: raise ImportError("sentence-transformers required. Install with: pip install nemo-curator[text_cpu]")

greptile-apps · 2026-01-13T22:10:15Z

+                raise ValueError(msg)
+
+            t0 = time.perf_counter()
+            max_model_len = self.model.model_config.max_model_len


logic: accessing model.model_config without checking if model is initialized. if process() called before setup(), raises AttributeError: 'NoneType' object has no attribute 'model_config'

Suggested change

max_model_len = self.model.model_config.max_model_len

if self.model is None:

raise ValueError("Model is not initialized. Please call setup() before processing.")

max_model_len = self.model.model_config.max_model_len

..

c06c705

Signed-off-by: Praateek <praateekm@gmail.com>

copy-pr-bot Bot temporarily deployed to test December 31, 2025 00:22 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci December 31, 2025 00:22 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci December 31, 2025 00:22 Inactive