Skip to content

AustralianCancerDataNetwork/omop-emb

Repository files navigation

omop-emb

Vector embedding layer for OMOP CDM concepts.

omop-emb generates, stores, and retrieves embeddings for OMOP concepts. It works out of the box with sqlite-vec (no external database required) and scales to PostgreSQL/pgvector for larger deployments. The database is the source of truth — FAISS is an optional read-acceleration sidecar, not a primary store.

Installation

pip install omop-emb                         # sqlite-vec backend (default, no extras needed)
pip install "omop-emb[pgvector]"             # adds PostgreSQL/pgvector support
pip install "omop-emb[faiss-cpu]"            # adds FAISS sidecar support
pip install "omop-emb[pgvector,faiss-cpu]"   # everything

Quick start

Ingest concepts (sqlite-vec, no external service):

export OMOP_EMB_BACKEND=sqlitevec
export OMOP_EMB_SQLITE_PATH=/data/omop_emb.db
export OMOP_CDM_DB_URL=postgresql+psycopg://user:pass@host:5432/omop_cdm

omop-emb embeddings add-embeddings --api-base http://localhost:11434/v1 --api-key ollama \
    --model nomic-embed-text:v1.5

Search:

omop-emb embeddings search --api-base http://localhost:11434/v1 --api-key ollama \
    --model nomic-embed-text:v1.5 \
    --query "hypertension" --query "type 2 diabetes" \
    --standard-only --domain Condition --k 5

pgvector with HNSW index:

export OMOP_EMB_BACKEND=pgvector
export OMOP_EMB_DB_HOST=localhost
export OMOP_EMB_DB_USER=omop_emb
export OMOP_EMB_DB_PASSWORD=omop_emb
export OMOP_EMB_DB_NAME=omop_emb

omop-emb embeddings add-embeddings --api-base http://localhost:11434/v1 --api-key ollama \
    --model nomic-embed-text:v1.5
omop-emb maintenance rebuild-index --model nomic-embed-text:v1.5 --index-type hnsw --metric-type cosine

Environment variables

Variable Default Description
OMOP_EMB_BACKEND sqlitevec Backend: sqlitevec or pgvector.
OMOP_EMB_SQLITE_PATH sqlite-vec database file path (or :memory:).
OMOP_EMB_DB_HOST pgvector: PostgreSQL host.
OMOP_EMB_DB_PORT 5432 pgvector: PostgreSQL port.
OMOP_EMB_DB_USER pgvector: database user.
OMOP_EMB_DB_PASSWORD pgvector: database password.
OMOP_EMB_DB_NAME pgvector: database name.
OMOP_EMB_DB_URL pgvector: full SQLAlchemy URL (overrides individual vars).
OMOP_CDM_DB_URL OMOP CDM connection (required for ingestion commands only).
OMOP_EMB_FAISS_CACHE_DIR Default FAISS cache directory (alternative to --faiss-cache-dir).

See the Configuration Reference for the complete list including asymmetric embedding prefixes and driver overrides.

Documentation

Full documentation: https://AustralianCancerDataNetwork.github.io/omop-emb

Roadmap

  • sqlite-vec backend (default, zero-config)
  • pgvector backend (PostgreSQL)
  • HNSW index support for pgvector
  • FAISS sidecar (approximate nearest-neighbour read acceleration)
  • FAISS export / import CLI (export-faiss-cache, import-faiss-cache)
  • In-DB concept filtering (domain, vocabulary, standard status, active status)
  • Transparent FAISS fast path in EmbeddingReaderInterface
  • Extensive backend and registry testing
  • FAISS GPU support
  • pgvectorscale support
  • Vector quantisation for more efficient storage

About

Embedding layer for OMOP CDM

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages