Conversation
There was a problem hiding this comment.
Pull request overview
Adds CLI options to control which MTEB splits/subsets to evaluate and optionally cap dataset size for faster smoke-test runs.
Changes:
- Added
--eval_splits,--hf_subsets, and--max_samplesCLI parameters. - Implemented in-place dataset truncation after
task.load_data()when--max_samplesis set. - Tightened
OVMSModel.encode()return type annotation tonp.ndarray.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
demos/embeddings/ovms_mteb.py
Outdated
| rng = random.Random(seed) | ||
|
|
||
| def _truncate_split(dataset, n): | ||
| if len(dataset) <= n: | ||
| return dataset | ||
| indices = list(range(len(dataset))) | ||
| rng.shuffle(indices) | ||
| return dataset.select(sorted(indices[:n])) |
There was a problem hiding this comment.
This truncation path materializes a full indices list of size len(dataset), shuffles it, then sorts a slice. For large datasets this is unnecessarily memory/CPU heavy. Prefer using the Hugging Face datasets primitives (e.g., dataset.shuffle(seed=...).select(range(n)) or equivalent) to avoid building/shuffling a full index list and to remove the extra sorted(...) cost.
| rng = random.Random(seed) | |
| def _truncate_split(dataset, n): | |
| if len(dataset) <= n: | |
| return dataset | |
| indices = list(range(len(dataset))) | |
| rng.shuffle(indices) | |
| return dataset.select(sorted(indices[:n])) | |
| def _truncate_split(dataset, n): | |
| if len(dataset) <= n: | |
| return dataset | |
| # Use Hugging Face datasets primitives for efficient shuffling and selection | |
| shuffled = dataset.shuffle(seed=seed) | |
| return shuffled.select(range(n)) |
demos/embeddings/ovms_mteb.py
Outdated
| for key in task.dataset: | ||
| value = task.dataset[key] | ||
| if isinstance(value, DatasetDict): | ||
| # Multilingual: subset_name -> DatasetDict(split -> Dataset) | ||
| for split in value: | ||
| value[split] = _truncate_split(value[split], max_samples) | ||
| else: | ||
| # Flat: split -> Dataset | ||
| task.dataset[key] = _truncate_split(value, max_samples) |
There was a problem hiding this comment.
The key/value naming here makes it harder to follow the two supported shapes (multilingual subset→splits vs flat split→dataset). Renaming to something shape-specific (e.g., subset_name/splits and split_name/split_ds) would make the control flow and data model much clearer, especially since both levels use dictionary iteration.
| for key in task.dataset: | |
| value = task.dataset[key] | |
| if isinstance(value, DatasetDict): | |
| # Multilingual: subset_name -> DatasetDict(split -> Dataset) | |
| for split in value: | |
| value[split] = _truncate_split(value[split], max_samples) | |
| else: | |
| # Flat: split -> Dataset | |
| task.dataset[key] = _truncate_split(value, max_samples) | |
| for subset_name in task.dataset: | |
| subset_data = task.dataset[subset_name] | |
| if isinstance(subset_data, DatasetDict): | |
| # Multilingual: subset_name -> DatasetDict(split_name -> Dataset) | |
| for split_name, split_ds in subset_data.items(): | |
| subset_data[split_name] = _truncate_split(split_ds, max_samples) | |
| else: | |
| # Flat: split_name -> Dataset | |
| task.dataset[subset_name] = _truncate_split(subset_data, max_samples) |
1ab48e4 to
c202f29
Compare
🛠 Summary
Requested on standup
To test only 20 samples, run: