Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
dd5b162
fix: Unable to run vebbench and cli
XuanYang-cn Jan 10, 2025
0095bd7
enhance: Unify optimize and remove ready_to_load
XuanYang-cn Jan 13, 2025
d9fc5e1
add mongodb client
zhuwenxing Jan 14, 2025
811564a
add mongodb client in readme
zhuwenxing Jan 14, 2025
4f21fcf
add some risk warnings for custom dataset
alwayslove2013 Jan 19, 2025
491ef6b
Bump grpcio from 1.53.0 to 1.53.2 in /install
dependabot[bot] Jan 20, 2025
5eeab7e
add mongodb config
zhuwenxing Jan 14, 2025
111048d
Opensearch interal configuration parameters (#463)
Xavierantony1982 Jan 31, 2025
0756516
ui control num of concurrencies
Caroline-an777 Feb 10, 2025
62454b3
Update README.md
xiaofan-luan Feb 12, 2025
6832120
environs version should <14.1.0
alwayslove2013 Feb 13, 2025
220038e
Support GPU_BRUTE_FORCE index for Milvus (#476)
Rachit-Chaudhary11 Feb 24, 2025
7bda989
Add table quantization type
lucagiac81 Nov 5, 2024
7f50104
Support MariaDB database (#375)
HugoWenTD Mar 11, 2025
b8221d1
Add TiDB backend (#484)
breezewish Mar 13, 2025
dba738b
CLI fix for GPU index (#485)
Rachit-Chaudhary11 Mar 14, 2025
4cbfef7
remove duplicated code
yuyuankang Mar 25, 2025
a39fe83
feat: initial commit
MansorY23 Apr 8, 2025
1446c6e
Add vespa integration
nuvotex-tk Apr 8, 2025
1ab2627
remove redundant empty_field config check for qdrant and tidb
alwayslove2013 Apr 14, 2025
05203c0
reformat all
alwayslove2013 Apr 14, 2025
1a9aa48
fix cli crush
alwayslove2013 Apr 16, 2025
90879f7
downgrade streamlit version
pauvez Apr 17, 2025
1a1ba0d
add more milvus index types: hnsw sq/pq/prq; ivf rabitq
alwayslove2013 Apr 18, 2025
e42845f
add more milvus index types: ivf_pq
alwayslove2013 Apr 23, 2025
7f83936
Add HNSW support for Clickhouse client (#500)
MansorY23 Apr 24, 2025
b7bad93
fix bugs when use custom_dataset without groundtruth file
alwayslove2013 Apr 30, 2025
024455f
fix: prevent the frontend from crashing on invalid indexes in results
s-h-a-d-o-w May 3, 2025
4ef378b
fix ruff warnings
s-h-a-d-o-w May 6, 2025
b1e5cb7
Fix formatting
s-h-a-d-o-w May 6, 2025
617e57e
Add lancedb
s-h-a-d-o-w Apr 26, 2025
029666d
Add --task-label option for cli (#517)
LoveYou3000 May 7, 2025
31b8cbd
Add qdrant cli
s-h-a-d-o-w May 6, 2025
7d8464c
Update README.md
yuyuankang May 12, 2025
975ba84
Fixing Bugs in Benchmarking ClickHouse with vectordbbench (#523)
yuyuankang May 13, 2025
556b703
Add --concurrency-timeout option to avoid long time waiting (#521)
LoveYou3000 May 14, 2025
b893bde
Merge branch 'main-yb' into sync-upstream-main
shaharuk-yb May 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,7 @@ __MACOSX
.DS_Store
build/
venv/
.venv/
.idea/
results/
results/
logs/
71 changes: 66 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ Closely mimicking real-world production environments, we've set up diverse testi

Prepare to delve into the world of VectorDBBench, and let it guide you in uncovering your perfect vector database match.

VectorDBBench is sponsered by Zilliz,the leading opensource vectorDB company behind Milvus. Choose smarter with VectorDBBench- start your free test on [zilliz cloud](https://zilliz.com/) today!

**Leaderboard:** https://zilliz.com/benchmark
## Quick Start
### Prerequirement
Expand Down Expand Up @@ -53,6 +55,8 @@ All the database client supported
| awsopensearch | `pip install vectordb-bench[opensearch]` |
| aliyun_opensearch | `pip install vectordb-bench[aliyun_opensearch]` |
| mongodb | `pip install vectordb-bench[mongodb]` |
| tidb | `pip install vectordb-bench[tidb]` |
| vespa | `pip install vectordb-bench[vespa]` |

### Run

Expand Down Expand Up @@ -110,6 +114,10 @@ Options:
--num-concurrency TEXT Comma-separated list of concurrency values
to test during concurrent search [default:
1,10,20]
--concurrency-timeout INTEGER Timeout (in seconds) to wait for a
concurrency slot before failing. Set to a
negative value to wait indefinitely.
[default: 3600]
--user-name TEXT Db username [required]
--password TEXT Db password [required]
--host TEXT Db host [required]
Expand All @@ -129,7 +137,11 @@ Options:
--ef-construction INTEGER hnsw ef-construction
--ef-search INTEGER hnsw ef-search
--quantization-type [none|bit|halfvec]
quantization type for vectors
quantization type for vectors (in index)
--table-quantization-type [none|bit|halfvec]
quantization type for vectors (in table). If
equal to bit, the parameter
quantization_type will be set to bit too.
--custom-case-name TEXT Custom case name i.e. PerformanceCase1536D50K
--custom-case-description TEXT Custom name description
--custom-case-load-timeout INTEGER
Expand All @@ -153,6 +165,48 @@ Options:
with-gt]
--help Show this message and exit.
```

### Run awsopensearch from command line

```shell
vectordbbench awsopensearch --db-label awsopensearch \
--m 16 --ef-construction 256 \
--host search-vector-db-prod-h4f6m4of6x7yp2rz7gdmots7w4.us-west-2.es.amazonaws.com --port 443 \
--user vector --password '<password>' \
--case-type Performance1536D5M --num-insert-workers 10 \
--skip-load --num-concurrency 75
```

To list the options for awsopensearch, execute `vectordbbench awsopensearch --help`

```text
$ vectordbbench awsopensearch --help
Usage: vectordbbench awsopensearch [OPTIONS]

Options:
# Sharding and Replication
--number-of-shards INTEGER Number of primary shards for the index
--number-of-replicas INTEGER Number of replica copies for each primary
shard
# Indexing Performance
--index-thread-qty INTEGER Thread count for native engine indexing
--index-thread-qty-during-force-merge INTEGER
Thread count during force merge operations
--number-of-indexing-clients INTEGER
Number of concurrent indexing clients
# Index Management
--number-of-segments INTEGER Target number of segments after merging
--refresh-interval TEXT How often to make new data available for
search
--force-merge-enabled BOOLEAN Whether to perform force merge operation
--flush-threshold-size TEXT Size threshold for flushing the transaction
log
# Memory Management
--cb-threshold TEXT k-NN Memory circuit breaker threshold

--help Show this message and exit.
```

#### Using a configuration file.

The vectordbbench command can optionally read some or all the options from a yaml formatted configuration file.
Expand Down Expand Up @@ -218,13 +272,13 @@ pip install -e '.[pinecone]'
```
### Run test server
```
$ python -m vectordb_bench
python -m vectordb_bench
```

OR:

```shell
$ init_bench
init_bench
```

OR:
Expand All @@ -241,13 +295,13 @@ After reopen the repository in container, run `python -m vectordb_bench` in the

### Check coding styles
```shell
$ make lint
make lint
```

To fix the coding styles automatically

```shell
$ make format
make format
```

## How does it work?
Expand Down Expand Up @@ -319,6 +373,13 @@ We have strict requirements for the data set format, please follow them.
- `Folder Path` - The path to the folder containing all the files. Please ensure that all files in the folder are in the `Parquet` format.
- Vectors data files: The file must be named `train.parquet` and should have two columns: `id` as an incrementing `int` and `emb` as an array of `float32`.
- Query test vectors: The file must be named `test.parquet` and should have two columns: `id` as an incrementing `int` and `emb` as an array of `float32`.
- We recommend limiting the number of test query vectors, like 1,000.
When conducting concurrent query tests, Vdbbench creates a large number of processes.
To minimize additional communication overhead during testing,
we prepare a complete set of test queries for each process, allowing them to run independently.
However, this means that as the number of concurrent processes increases,
the number of copied query vectors also increases significantly,
which can place substantial pressure on memory resources.
- Ground truth file: The file must be named `neighbors.parquet` and should have two columns: `id` corresponding to query vectors and `neighbors_id` as an array of `int`.

- `Train File Count` - If the vector file is too large, you can consider splitting it into multiple files. The naming format for the split files should be `train-[index]-of-[file_count].parquet`. For example, `train-01-of-10.parquet` represents the second file (0-indexed) among 10 split files.
Expand Down
3 changes: 2 additions & 1 deletion install.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import os
import argparse
import os
import subprocess


def docker_tag_base():
return 'vdbbench'

Expand Down
4 changes: 3 additions & 1 deletion install/requirements_py3.11.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
grpcio==1.53.0
grpcio==1.53.2
grpcio-tools==1.53.0
qdrant-client
pinecone-client
Expand All @@ -22,3 +22,5 @@ environs
pydantic<v2
scikit-learn
pymilvus
clickhouse_connect
pyvespa
12 changes: 11 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ dependencies = [
"click",
"pytz",
"streamlit-autorefresh",
"streamlit!=1.34.0",
"streamlit<1.44,!=1.34.0", # There is a breaking change in 1.44 related to get_page https://discuss.streamlit.io/t/from-streamlit-source-util-import-get-pages-gone-in-v-1-44-0-need-urgent-help/98399
"streamlit_extras",
"tqdm",
"s3fs",
Expand Down Expand Up @@ -68,6 +68,11 @@ all = [
"memorydb",
"alibabacloud_ha3engine_vector",
"alibabacloud_searchengine20211025",
"mariadb",
"PyMySQL",
"clickhouse-connect",
"pyvespa",
"lancedb",
]

qdrant = [ "qdrant-client" ]
Expand All @@ -86,6 +91,11 @@ chromadb = [ "chromadb" ]
opensearch = [ "opensearch-py" ]
aliyun_opensearch = [ "alibabacloud_ha3engine_vector", "alibabacloud_searchengine20211025"]
mongodb = [ "pymongo" ]
mariadb = [ "mariadb" ]
tidb = [ "PyMySQL" ]
clickhouse = [ "clickhouse-connect" ]
vespa = [ "pyvespa" ]
lancedb = [ "lancedb" ]

[project.urls]
"repository" = "https://github.com/zilliztech/VectorDBBench"
Expand Down
4 changes: 3 additions & 1 deletion vectordb_bench/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from . import log_util

env = environs.Env()
env.read_env(".env", False)
env.read_env(path=".env", recurse=False)


class config:
Expand Down Expand Up @@ -52,6 +52,8 @@ class config:

CONCURRENCY_DURATION = 30

CONCURRENCY_TIMEOUT = 3600

RESULTS_LOCAL_DIR = env.path(
"RESULTS_LOCAL_DIR",
pathlib.Path(__file__).parent.joinpath("results"),
Expand Down
86 changes: 83 additions & 3 deletions vectordb_bench/backend/clients/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,17 @@ class DB(Enum):
Chroma = "Chroma"
AWSOpenSearch = "OpenSearch"
AliyunElasticsearch = "AliyunElasticsearch"
MariaDB = "MariaDB"
Test = "test"
AliyunOpenSearch = "AliyunOpenSearch"
MongoDB = "MongoDB"
TiDB = "TiDB"
Clickhouse = "Clickhouse"
Vespa = "Vespa"
LanceDB = "LanceDB"

@property
def init_cls(self) -> type[VectorDB]: # noqa: PLR0911, PLR0912, C901
def init_cls(self) -> type[VectorDB]: # noqa: PLR0911, PLR0912, C901, PLR0915
"""Import while in use"""
if self == DB.Milvus:
from .milvus.milvus import Milvus
Expand Down Expand Up @@ -115,6 +120,11 @@ def init_cls(self) -> type[VectorDB]: # noqa: PLR0911, PLR0912, C901

return AWSOpenSearch

if self == DB.Clickhouse:
from .clickhouse.clickhouse import Clickhouse

return Clickhouse

if self == DB.AlloyDB:
from .alloydb.alloydb import AlloyDB

Expand All @@ -135,16 +145,36 @@ def init_cls(self) -> type[VectorDB]: # noqa: PLR0911, PLR0912, C901

return MongoDB

if self == DB.MariaDB:
from .mariadb.mariadb import MariaDB

return MariaDB

if self == DB.TiDB:
from .tidb.tidb import TiDB

return TiDB

if self == DB.Test:
from .test.test import Test

return Test

if self == DB.Vespa:
from .vespa.vespa import Vespa

return Vespa

if self == DB.LanceDB:
from .lancedb.lancedb import LanceDB

return LanceDB

msg = f"Unknown DB: {self.name}"
raise ValueError(msg)

@property
def config_cls(self) -> type[DBConfig]: # noqa: PLR0911, PLR0912, C901
def config_cls(self) -> type[DBConfig]: # noqa: PLR0911, PLR0912, C901, PLR0915
"""Import while in use"""
if self == DB.Milvus:
from .milvus.config import MilvusConfig
Expand Down Expand Up @@ -216,6 +246,11 @@ def config_cls(self) -> type[DBConfig]: # noqa: PLR0911, PLR0912, C901

return AWSOpenSearchConfig

if self == DB.Clickhouse:
from .clickhouse.config import ClickhouseConfig

return ClickhouseConfig

if self == DB.AlloyDB:
from .alloydb.config import AlloyDBConfig

Expand All @@ -236,15 +271,35 @@ def config_cls(self) -> type[DBConfig]: # noqa: PLR0911, PLR0912, C901

return MongoDBConfig

if self == DB.MariaDB:
from .mariadb.config import MariaDBConfig

return MariaDBConfig

if self == DB.TiDB:
from .tidb.config import TiDBConfig

return TiDBConfig

if self == DB.Test:
from .test.config import TestConfig

return TestConfig

if self == DB.Vespa:
from .vespa.config import VespaConfig

return VespaConfig

if self == DB.LanceDB:
from .lancedb.config import LanceDBConfig

return LanceDBConfig

msg = f"Unknown DB: {self.name}"
raise ValueError(msg)

def case_config_cls( # noqa: PLR0911
def case_config_cls( # noqa: C901, PLR0911, PLR0912
self,
index_type: IndexType | None = None,
) -> type[DBCaseConfig]:
Expand Down Expand Up @@ -288,6 +343,11 @@ def case_config_cls( # noqa: PLR0911

return AWSOpenSearchIndexConfig

if self == DB.Clickhouse:
from .clickhouse.config import ClickhouseHNSWConfig

return ClickhouseHNSWConfig

if self == DB.PgVectorScale:
from .pgvectorscale.config import _pgvectorscale_case_config

Expand Down Expand Up @@ -318,6 +378,26 @@ def case_config_cls( # noqa: PLR0911

return MongoDBIndexConfig

if self == DB.MariaDB:
from .mariadb.config import _mariadb_case_config

return _mariadb_case_config.get(index_type)

if self == DB.TiDB:
from .tidb.config import TiDBIndexConfig

return TiDBIndexConfig

if self == DB.Vespa:
from .vespa.config import VespaHNSWConfig

return VespaHNSWConfig

if self == DB.LanceDB:
from .lancedb.config import _lancedb_case_config

return _lancedb_case_config.get(index_type)

# DB.Pinecone, DB.Chroma, DB.Redis
return EmptyDBCaseConfig

Expand Down
Loading