Skip to content

gjeltep/benchtop-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchtop Data

A few ideas have been of interest recently, and wanted to explore a bit:

  • what's possible to set up locally for analyzing credit transactions?
  • small (~10B) parameter models can certainly provide interesting behavior - what sort of mileage can they provide?
  • generalizing specific need to a simple E2E data analysis pipeline

Those explorations ultimately yielded a fairly general-purpose, local AI data analysis pipeline capable of processing datasets, validating them against schemas, storing them in DuckDB, capturing semantic embeddings, and offering a natural language querying interface. And - should you choose - without token burn!

Features

  • Schema-based validation: Define datasets using YAML/JSON schemas
  • DuckDB storage: Fast, local SQL database for structured data
  • Semantic embeddings: Automatic text generation and Chroma vector storage
  • Natural language queries: Ask questions using LlamaIndex + Ollama (I've been defaulting to 8B Mistral)

Installation

Setup

  1. Install dependencies:
uv pip install -e .
  1. Verify Ollama is running:
ollama list

Make sure you've pulled the model of your choice.

Quick Start

Take a look in examples/usage.py to get going.

Web UI (Streamlit)

The easiest way to use the pipeline is through the Streamlit web interface:

uv run streamlit run src/data_pipeline/ui.py

Or after installation:

streamlit-ui

CLI Usage

For command-line usage:

# Process a dataset
data-pipeline examples/products.csv examples/products_schema.yaml \
    --db-path data.db \
    --chroma-path chroma_db

# Process and ask a question
data-pipeline examples/products.csv examples/products_schema.yaml \
    --query "What products are out of stock?"

Schema Format

Schemas support:

  • Field types: string, integer, float, boolean, date, datetime
  • Constraints: nullable, primary_key
  • Metadata: description for semantic meaning
  • Relationships: Foreign key relationships between tables

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages