This repository provides a PyTorch-based implementation of a greedy coreset selection algorithm. The method incrementally selects samples that maximize diversity in feature space, optionally weighted by a per-sample quality score.
Given:
- A set of feature vectors
- (Optionally) a quality score for each sample
The algorithm:
- Starts from an initial sample (index
0) - Iteratively selects the next sample that maximizes the objective function
Typical use cases:
- Dataset subset selection
- Diversity-aware sampling
- Quality-aware coreset construction
from coreset import execute_coreset_selection
selected_files = execute_coreset_selection(
feature_dict_path="diversity.pt",
quality_dict_path="nisqa.pt",
size=4000,
device=None # automatically uses CUDA if available
)| Argument | Type | Description |
|---|---|---|
feature_dict_path |
str |
Path to a .pt file containing {key: feature_tensor} |
quality_dict_path |
str |
Path to a .pt file containing {key: quality_score} |
size |
int |
Number of samples to select |
device |
torch.device or None |
Computation device |
List[str]: Keys corresponding to the selected samples, in the order they were selected.
{
"file_001": torch.Tensor([...]),
"file_002": torch.Tensor([...]),
...
}All feature tensors must have the same shape.
{
"file_001": float,
"file_002": float,
...
}All keys must match those in the feature dictionary.
MIT License