lea is a simple SQL runner. You write SQL scripts, organize them in folders, and lea takes care of the rest: it figures out the dependency order, runs them concurrently, and materializes the results as tables in your data warehouse.
Think of it as a minimalist alternative to dbt or SQLMesh. We use it every day at Carbonfact to manage our BigQuery data warehouse.
You organize SQL scripts in a scripts/ directory:
scripts/
staging/
customers.sql
orders.sql
core/
revenue.sql
Each script is a SELECT statement:
-- scripts/core/revenue.sql
SELECT
customers.name,
SUM(orders.amount) AS total
FROM staging.customers
JOIN staging.orders ON orders.customer_id = customers.id
GROUP BY 1Then you run them:
lea runlea parses the SQL, sees that core.revenue depends on staging.customers and staging.orders, and runs them in the right order. Each script becomes a table in your warehouse.
- TLDR
- Table of contents
- Installation
- Configuration
- Usage
- Warehouse specific features
- Examples
- Contributing
- License
pip install lea-cliThis installs the lea command. It also makes the lea Python library available.
lea is configured via environment variables. They can be provided in an .env file, or directly in the shell. Each command also has an --env flag to provide a path to an .env file.
# Required
LEA_WAREHOUSE=bigquery
# Required
LEA_BQ_LOCATION=EU
# Required
LEA_BQ_DATASET_NAME=kaya
# Required, the project where the dataset is located
LEA_BQ_PROJECT_ID=carbonfact-dwh
# Optional, allows using a different project for compute
LEA_BQ_COMPUTE_PROJECT_ID=carbonfact-dwh-compute
# Not necessary if you're logged in with the gcloud CLI
LEA_BQ_SERVICE_ACCOUNT=<JSON dump of the service account file> # not a path
# Defaults to https://www.googleapis.com/auth/bigquery
LEA_BQ_SCOPES=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/drive
# LOGICAL or PHYSICAL, defaults to PHYSICAL
LEA_BQ_STORAGE_BILLING_MODEL=PHYSICAL# Required
LEA_WAREHOUSE=duckdb
# Required
LEA_DUCKDB_PATH=duckdb.db
# Optional
LEA_DUCKDB_EXTENSIONS=parquet,httpfs# Required
LEA_WAREHOUSE=motherduck
# Required
MOTHERDUCK_TOKEN=<get this from https://app.motherduck.com/settings/tokens>
# Required
LEA_MOTHERDUCK_DATABASE=bike_sharing
# Optional
LEA_DUCKDB_EXTENSIONS=parquet,httpfsDuckLake needs a database to manage metadata, which is what LEA_DUCKLAKE_CATALOG_DATABASE is for.
Local storage:
LEA_WAREHOUSE=ducklake
LEA_DUCKLAKE_CATALOG_DATABASE=metadata.ducklake
LEA_DUCKLAKE_DATA_PATH=/path/to/dataS3:
LEA_WAREHOUSE=ducklake
LEA_DUCKLAKE_CATALOG_DATABASE=metadata.ducklake
LEA_DUCKLAKE_DATA_PATH=s3://my-bucket/data
LEA_DUCKLAKE_S3_ENDPOINT=storage.googleapis.comGCS (requires HMAC keys):
LEA_WAREHOUSE=ducklake
LEA_DUCKLAKE_CATALOG_DATABASE=metadata.ducklake
LEA_DUCKLAKE_DATA_PATH=gcs://my-bucket/data
LEA_DUCKLAKE_GCS_KEY_ID=GOOG1E...
LEA_DUCKLAKE_GCS_SECRET=...By default, lea run runs all scripts. You can select specific scripts:
lea run --select core.revenue
lea run --select core.revenue --select core.usersYou can select all scripts in a schema:
lea run --select core/Graph operators let you include dependencies or dependents:
lea run --select core.revenue+ # revenue and everything that depends on it
lea run --select +core.revenue # revenue and everything it depends on
lea run --select +core.revenue+ # both directionsYou can also select scripts that have been modified in the current Git branch:
lea run --select git
lea run --select git+ # modified scripts and their dependentsThis is handy in continuous integration.
You can exclude scripts with --unselect:
lea run --unselect tests/By default, lea creates a development environment isolated from production by appending _<user> to dataset names. Use --production to target the production environment:
lea run --productionThe <user> is determined from the login name. It can be overridden with the LEA_USERNAME environment variable.
SQL queries can be templated with Jinja. Use a .sql.jinja extension.
You have access to env (i.e. os.environ) and a load_yaml function:
{% set taxonomy = load_yaml('core/taxonomies/product.yaml') %}
SELECT
{% for dim in taxonomy.dimensions %}
MAX(IF(key = '{{ dim.key }}', value, NULL)) AS {{ dim.column }},
{% endfor %}
account_slug
FROM core.raw_attributesThere is no separate lea test command. Tests run alongside regular scripts during lea run. The run stops whenever a test fails.
There are two types of tests:
- Singular tests are queries stored in a
tests/directory. They fail if they return any rows. - Assertion tests are comment annotations in the queries themselves:
SELECT
-- #UNIQUE
-- #NO_NULLS
user_id,
-- #NO_NULLS
address,
-- #UNIQUE_BY(address)
full_name,
-- #SET{'A', 'B', 'AB', 'O'}
blood_type
FROM core.usersAvailable tags: #NO_NULLS, #UNIQUE, #UNIQUE_BY(<by>), #SET{<elements>}.
When you run a script, all applicable tests are run as well.
lea follows the WAP pattern by default. When you execute lea run, scripts are first materialized into temporary ___audit tables. These are promoted to replace the real tables only once everything has succeeded without errors.
This prevents partial updates. If script bar depends on foo and bar fails, foo is not modified either.
lea doesn't re-run scripts when the audit table already exists and the script hasn't been modified since. This avoids unnecessary work during development. You can disable this with:
lea run --restartQuack mode runs your scripts locally with DuckDB instead of your cloud warehouse. This makes local iteration much faster and doesn't incur any cloud costs.
lea run --select core.users --quacklea automatically pulls the necessary upstream tables from your warehouse into a DuckLake instance, and only pulls what's missing. SQL is transpiled to DuckDB automatically.
You'll need to configure a DuckLake instance for storage, in addition to your regular warehouse configuration.
Local storage:
LEA_QUACK_DUCKLAKE_CATALOG_DATABASE=quack.ducklake
LEA_QUACK_DUCKLAKE_DATA_PATH=/path/to/quack/dataS3:
LEA_QUACK_DUCKLAKE_CATALOG_DATABASE=quack.ducklake
LEA_QUACK_DUCKLAKE_DATA_PATH=s3://my-bucket/quack/data
LEA_QUACK_DUCKLAKE_S3_ENDPOINT=storage.googleapis.comGCS (requires HMAC keys):
LEA_QUACK_DUCKLAKE_CATALOG_DATABASE=quack.ducklake
LEA_QUACK_DUCKLAKE_DATA_PATH=gcs://my-bucket/quack/data
LEA_QUACK_DUCKLAKE_GCS_KEY_ID=GOOG1E...
LEA_QUACK_DUCKLAKE_GCS_SECRET=...You can push the DuckLake tables back to your warehouse with --quack-push:
lea run --quack --quack-pushTo inspect the resulting DuckLake tables in a browser, use the built-in UI command:
lea quack-uiThis opens the DuckDB UI with your DuckLake catalog attached, so you can easily browse them.
"Quack mode" is mostly intended for cheap local development with a rapid feedback loop. But you could also run your production refreshes with DuckDB and upload the artifacts to your warehouse with --quack-push. This can be a great way to save on compute costs.
lea can automatically cluster tables that contain given fields:
LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug
LEA_BQ_DEFAULT_CLUSTERING_FIELDS=account_slug,brand_slugFor each table, lea uses whichever configured clustering fields are present and ignores the rest.
You can also define clustering fields for a specific table:
SELECT
account_slug,
-- #CLUSTERING_FIELD
object_kind,
value
FROM my_tableTable-specific clustering fields are added in addition to the default ones.
You can route specific scripts to different compute projects:
LEA_BQ_SCRIPT_SPECIFIC_COMPUTE_PROJECT_IDS={"dataset.schema__table": "reservation-project-id"}Scripts not listed use the default LEA_BQ_COMPUTE_PROJECT_ID.
Big Blue provides a Pick API that suggests whether to run a query on-demand or on a reservation. lea supports this out of the box:
LEA_BQ_BIG_BLUE_PICK_API_KEY=<get from https://your-company.biq.blue/settings.html>
LEA_BQ_BIG_BLUE_PICK_API_URL=https://pick.biq.blue
LEA_BQ_BIG_BLUE_PICK_API_ON_DEMAND_PROJECT_ID=on-demand-compute-project-id
LEA_BQ_BIG_BLUE_PICK_API_REVERVATION_PROJECT_ID=reservation-compute-project-idFeel free to reach out to max@carbonfact.com if you want to know more and/or contribute.
We have suggested some issues as good places to get started.
lea is free and open-source software licensed under the Apache License, Version 2.0.