Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions docs/src/cli/sample_inverted_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,38 @@
```
<!-- cmdrun ../../../build/bin/sample_inverted_index --help -->
```

## Description

Creates a smaller inverted index from an existing one by sampling postings or
documents. The purpose of this tool is to reduce time and space requirements
while preserving the main statistical properties of the original collection,
making it useful for faster experiments and debugging.

### Sampling strategy (`-t, --type`)

- `random_postings`: keep random occurrences per posting list (not whole
posting lists).
- `random_docids`: keep all postings belonging to a random subset of documents.

## Examples

### Keep ~25% of postings

```bash
sample_inverted_index \
-c path/to/inverted \
-o path/to/inverted.sampled \
-r 0.25 \
-t random_postings
```

### Keep ~25% of the documents

```bash
sample_inverted_index \
-c path/to/inverted \
-o path/to/inverted.sampled \
-r 0.25 \
-t random_docids
```