DF Embedder is a blazing-fast Python library (with a Rust backend) that embeds, indexes and turns your dataframes into fast vector stores based on Lance format in a few lines of code. It is aimed for use cases in which you have a dataframe with textual data that you want to embed and load to a vector db, in order to conduct vector search. It is opinionated and specifically aimed to deal with huge tables that need to be embedded fast. It's fast and efficient but uses its own embedding model and textual representation method. Read on for the details.
(Requires Python >= 3.10)
pip install dfembedfrom dfembed import DfEmbedder
import polars as pl
# read a dataset using polars or pandas
df = pl.read_csv("tmdb.csv")
# turn into an arrow dataset
arrow_table = df.to_arrow()
embedder = DfEmbedder(database_name="tmdb_db")
# embed and index the dataframe to a lance table
embedder.index_table(arrow_table, table_name="films_table")
# run similarities queries
similar_movies = embedder.find_similar("adventures jungle animals", "films_table", 10)See more usage examples in the notebook here
DfEmbedder is still an early version and work in progress. Feedback and comments will be highly appriciated.
- Rust Backend: For blazing-fast, multi-threaded embedding and indexing.
- Apache Arrow: To seamlessly work with data from libraries like Polars, Pandas (via PyArrow), etc.
- Static Embeddings: Uses efficient static embedding model for generating text embedding 100X faster.
- Lance Format: For optimized storage and fast vector similarity searches.
- PyO3: To provide a clean and easy-to-use Python API.
How fast is DF Embedder? benchamrks are often misleading and users should run their own analysis. To give a general idea, I was able to index about 1.2M rows from the TMDB movie dataset in about 100 seconds, using a machine with 10 CPU cores. Thats reading, embedding, indexing and writing more than 10K rows per second. And there are still ways to improve its performance by further tunning its params.
There are quite a few tabular data embedding methods (e.g. TabNet, TABBIE, etc). However, many of which assume that data has a specific structure and type whereas RAG use cases involve embedding of unstructured free text queries into the same vector space. To resolve this I tried to follow an approach similar to the one taken by Koloski et al. with some minor change in order to be agnostic to the field type. Accordingly, indexing a dataframe using DfEmbedder starts by representing each row in the dataframe as a string that follows the format: col0_name is col0_value; col1_name is col1_value (Koloski were working with a known schema and thus offered a more "typed" approach). Next, all strings are embedded using a static embedding model (an embedding method that can generate embedding on CPU in blazing speed with very little loss of quality). Finally, it writes data as a table in Lance format.
There are several ways to search and query Lance tables created using DfEmbedder
- You can use
DfEmbedder'sfind_similarmethod - You can use
LanceDB
import lancedb
db = lancedb.connect("tmdb_db")
tbl = db.open_table("films_table")
# you need the embedder to embed a query
vector = embedder.embed_string(text)
# run a vector search
tbl.search(vector).limit(10).to_list()- You can use its LlamaIndex
VectorStoreinterface
from dfembed import DfEmbedder, DfEmbedVectorStore
# because we use our own embedding model
Settings.embed_model = MockEmbedding(embed_dim=1024)
vector_store = DfEmbedVectorStore(
df_embedder=embedder,
table_name=table_name
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine(similarity_top_k=5, llm=llm)See more usage examples in the notebook here
The DfEmbedder constructor accepts the following parameters:
num_threads(default: CPU count): Number of parallel worker threads used for embedding. Setting this to the number of available CPU cores typically gives the best performance.embedding_chunk_size(default: 500): Number of records to process in each embedding batch. Larger values may improve throughput but require more memory.write_buffer_size(default: 2000): Number of embeddings to buffer before writing to storage. Increasing this reduces the number of write operations, potentially improving performance for large datasets.database_name(default: "./lance_db"): Path to the Lance database directory where tables will be stored.table_name(default: "embeddings"): Default name for tables created in the database. Can be overridden inindex_table().vector_dim(default: 1024): Dimensionality of the embedding vectors produced by the static embedder. Please keep it on default for this version
import polars as pl # could also use Pandas or DuckDB
import pyarrow as pa # Although not directly used, good practice to import
from dfembed import DfEmbedder
# Load data from a CSV using Polars
df = pl.read_csv("tmdb.csv")
# transform to PyArrow Table format
arrow_table = df.to_arrow()
# Configure database path, and optional performance params
embedder = DfEmbedder(
num_threads=8, # Use 8 threads for embedding or defaults to avail num of cores
write_buffer_size=3500, # Buffer 3500 embeddings before writing
database_name="tmdb_db", # Path to the Lance database directory
)
table_name = "tmdb_table"
embedder.index_table(arrow_table, table_name=table_name)
# get 10 most similar items
query = "adventures jungle animals"
results = embedder.find_similar(query=query, table_name=table_name, k=10)-
index_table(table, table_name=None): Embeds and indexes an Arrow table.table: A PyArrow Table object containing the data to index.table_name: Name for the created Lance table. If None, uses the default name from the constructor.
-
find_similar(query, table_name, k): Performs semantic search for similar items.query: String query to search for.table_name: Name of the Lance table to search in.k: Number of results to return.- Returns a list of the k most similar text records.
-
embed_string(text): Directly access the static embedder to encode a single string.text: String to embed.- Returns a vector of floats (the embedding).
- For large datasets, increase
write_buffer_sizeto reduce write operations. - Adjust
embedding_chunk_sizebased on your available memory and dataset characteristics. - The
num_threadsparameter should typically match your CPU core count for optimal performance. - For production use, consider using a fast SSD for the database storage location.
MIT
To release a new version:
- Update the version in
Cargo.tomlandpyproject.toml - Commit and push your changes
- Create and push a new tag with the format
v{version}(e.g.,v0.1.2):git tag v0.1.2 git push origin v0.1.2
- The GitHub Actions workflow will automatically build wheels and publish them to PyPI
You can also manually trigger the build workflow from the GitHub Actions tab in your repository.
