🚀 MXFrame

GPU-accelerated DataFrames — Python ergonomics, Mojo speed, every GPU.

MXFrame is a DataFrame query engine that pairs a Polars-style Python API with pre-compiled Mojo AOT kernels. The same code runs on NVIDIA, AMD, and Apple Silicon — no CUDA required, no JIT compilation at query time.

✨ Why MXFrame?

	pandas	Polars	cuDF (Rapids)	MXFrame
GPU support	❌	❌	✅ NVIDIA only	✅ Any GPU
Compiled kernels	❌	✅ Rust	✅ CUDA	✅ Mojo AOT
Install complexity	pip	pip	CUDA + Rapids stack	pixi install
TPC-H competitive	❌	✅	✅	✅
Cross-vendor	❌	❌	❌	✅ NVIDIA/AMD/Apple

MXFrame is the cuDF architecture without the CUDA lock-in.
The kernels are compiled once to a .so at build time — loaded in ~1 ms at process start, then pure dispatch on every query. No per-query JIT tax.

⚡ Quick Start

Install (Development)

# 1. Install pixi (Modular's package manager)
curl -fsSL https://pixi.sh/install.sh | bash

# 2. Clone and set up
git clone https://github.com/abhisheksreesaila/mxframe
cd mxframe
pixi install

# 3. Verify GPU is working
pixi run python3 scripts/_check_gpu.py

Your First Query

import pyarrow as pa
from mxframe import LazyFrame, Scan, col, lit

# Create an Arrow table (or load from Parquet/CSV)
data = pa.table({
    "dept":   pa.array(["eng", "eng", "mkt", "mkt", "eng"]),
    "salary": pa.array([120.0, 95.0, 80.0, 110.0, 130.0], pa.float32()),
    "age":    pa.array([32, 28, 35, 29, 40], pa.int32()),
})

# Build a lazy query plan — nothing executes yet
result = (
    LazyFrame(Scan(data))
    .filter(col("age") > lit(28))
    .groupby("dept")
    .agg(
        col("salary").sum().alias("total_salary"),
        col("salary").mean().alias("avg_salary"),
        col("age").count().alias("headcount"),
    )
    .sort(col("total_salary"), descending=True)
    .compute(device="gpu")   # or "cpu"
)

print(result.to_pandas())

Output:

  dept  total_salary  avg_salary  headcount
0  eng         345.0       115.0          3
1  mkt         110.0       110.0          1

📊 TPC-H Benchmark — all 22 queries

Hardware: NVIDIA RTX 3090 (sm_86) · AMD 12-core CPU · Mojo 0.26.2 AOT kernels Baselines: Polars 1.29+ · Pandas 3.0 · MXFrame CPU path · MXFrame GPU path Data: TPC-H schema synthetic data (numpy RNG, fixed seed) Methodology: 1 warmup run + median of 3 timed runs, per engine. Warmup primes every cache an app would have primed on query #2 in production — so these numbers reflect steady-state dispatch, not first-call JIT cost. Source of truth: scripts/bench_results_1M.csv and scripts/bench_results_10M.csv — committed in repo, reproducible via scripts/benchmark_all_22.py

How the kernels dispatch

Device	Path	Coverage
CPU	100% ctypes into pre-compiled `libmxkernels_aot.so`	All 22 queries — group aggs, masked aggs, inner + left joins, gather, filter, sort, unique
GPU	ctypes into pre-compiled `libmxkernels_aot_gpu.so`	All grouped aggs (sum/min/max/count) + masked global aggs
GPU	MAX Graph, shape-cached model	Hash joins only — compiled once per `(n_left, n_right)` shape, cached for the session

No per-query JIT on CPU. GPU aggregations skip MAX Graph entirely. GPU joins compile a model once per shape and reuse it — the warmup run primes that cache.

1 M rows — warm median of 3 runs

All times in milliseconds · lower is better. Speedup columns = Polars / MXFrame; bold = MXFrame wins.

Query	Description	MX CPU	MX GPU	Polars	Pandas	CPU vs Polars	GPU vs Polars
Q1	Filter + 8 aggregations	11.2	94.4	35.6	117.4	3.2×	0.4×
Q2	Min-cost supplier	6.6	15.6	16.3	9.5	2.5×	1.0×
Q3	3-table join + agg	5.6	15.9	19.2	21.8	3.4×	1.2×
Q4	Order priority	15.0	192.5	15.0	29.9	1.0×	0.1×
Q5	Multi-join + groupby	0.6	4.1	22.6	22.0	37.7×	5.5×
Q6	Masked global agg	7.9	13.2	10.4	7.7	1.3×	0.8×
Q7	Shipping volume	0.6	7.3	29.9	19.1	49.8×	4.1×
Q8	Market share	0.9	4.7	20.2	10.3	22.4×	4.3×
Q9	Product profit (6-table join)	0.6	6.6	39.9	17.8	66.5×	6.0×
Q10	Customer revenue	3.6	11.0	32.6	19.3	9.1×	3.0×
Q11	Important stock	0.5	2.7	7.4	3.0	14.8×	2.7×
Q12	2-table join + agg	0.8	3.5	24.5	586.2	30.6×	7.0×
Q13	Customer distribution	16.2	30.2	27.5	33.5	1.7×	0.9×
Q14	Promo revenue	1.4	1.2	6.9	244.0	4.9×	5.8×
Q15	Top-supplier revenue	1.3	11.8	9.9	6.4	7.6×	0.8×
Q16	Part/supplier relationships	2.1	6.3	16.9	6.8	8.0×	2.7×
Q17	Small-qty order	0.3	2.7	7.9	4.7	26.3×	2.9×
Q18	Large-volume customers	4.2	22.3	33.4	16.8	8.0×	1.5×
Q19	Discounted revenue	10.9	11.2	19.6	22.4	1.8×	1.8×
Q20	Potential part promo	4.3	5.6	30.0	9.6	7.0×	5.4×
Q21	Suppliers who kept (EXISTS)	26.0	64.1	31.3	28.6	1.2×	0.5×
Q22	Global sales opportunity	7.6	16.5	25.4	56.7	3.3×	1.5×

1 M summary: MX CPU beats Polars on 21/22 queries (Q4 tied); MX GPU beats Polars on 16/22 queries. Headline CPU wins: Q9 66×, Q7 50×, Q5 38×, Q12 31×, Q17 26×. Headline GPU wins: Q12 7×, Q9 6×, Q14 5.8×, Q5 5.5×, Q20 5.4×.

10 M rows — warm median of 3 runs

Query	Description	MX CPU	MX GPU	Polars	Pandas	CPU vs Polars	GPU vs Polars
Q1	Filter + 8 aggregations	361.0	1190.7	946.5	1771.7	2.6×	0.8×
Q2	Min-cost supplier	5.7	11.9	15.4	7.6	2.7×	1.3×
Q3	3-table join + agg	57.7	67.2	72.2	581.8	1.3×	1.1×
Q4	Order priority	301.5	492.2	113.8	807.6	0.4×	0.2×
Q5	Multi-join + groupby	2.8	13.8	60.7	332.9	21.7×	4.4×
Q6	Masked global agg	399.6	523.2	92.0	246.4	0.2×	0.2×
Q7	Shipping volume	1.8	16.4	76.4	392.5	42.4×	4.7×
Q8	Market share	1.3	3.8	40.9	55.1	31.5×	10.8×
Q9	Product profit	0.7	7.4	89.7	431.3	128.1×	12.1×
Q10	Customer revenue	39.2	45.0	131.1	216.2	3.3×	2.9×
Q11	Important stock	0.4	3.1	6.6	2.5	16.5×	2.1×
Q12	2-table join + agg	1.3	4.4	116.4	6853.6	89.5×	26.5×
Q13	Customer distribution	385.1	396.1	285.8	463.9	0.7×	0.7×
Q14	Promo revenue	14.4	16.9	29.7	2719.2	2.1×	1.8×
Q15	Top-supplier revenue	2.7	57.4	16.1	30.0	6.0×	0.3×
Q16	Part/supplier relationships	2.7	6.5	16.3	6.8	6.0×	2.5×
Q17	Small-qty order	0.6	1.5	14.6	17.0	24.3×	9.7×
Q18	Large-volume customers	46.8	69.4	63.4	242.8	1.4×	0.9×
Q19	Discounted revenue	100.4	97.9	112.9	234.8	1.1×	1.2×
Q20	Potential part promo	32.3	37.1	39.9	52.9	1.2×	1.1×
Q21	Suppliers who kept	756.0	705.3	85.9	216.6	0.1×	0.1×
Q22	Global sales opportunity	54.1	59.2	132.8	1292.1	2.5×	2.2×

10 M summary: MX CPU beats Polars on 18/22 queries; MX GPU beats Polars on 15/22. Headline CPU wins scale beautifully: Q9 128×, Q12 89×, Q7 42×, Q8 32×, Q17 24×, Q5 22×. Headline GPU wins: Q12 26.5×, Q9 12×, Q8 10.8×, Q17 9.7×, Q7 4.7×.

Where MXFrame loses (same at both scales)

Q4, Q6, Q13, Q21 — operations where our kernel path falls back to PyArrow compute or does extra passes. These are the focus of the next milestone (see roadmap.md).

What the numbers mean

Correctness ✅ — all 22 queries return results that round-trip through Pandas and match Polars output.
Coverage ✅ — every TPC-H query has a CPU AOT path; all group aggs/masked aggs have GPU AOT paths; GPU joins use shape-cached MAX Graph models.
No JIT tax in steady state — after the first query of each shape warms the GPU join model cache, every subsequent call is pure dispatch. The CPU path has no JIT at all.
Why GPU doesn't always win — GPU wins scale with workload size and kernel coverage. At 10 M, GPU crushes Polars on the join-heavy queries (Q8/Q9/Q12) where Mojo's shape-cached kernels pay off. Where GPU loses, it's either PCIe overhead on tiny outputs (Q1, Q6) or ops that still route through PyArrow fallback (Q4, Q13, Q21).

🔁 Reproducing the Benchmark

To run the benchmark with official TPC-H data (generated by DuckDB's faithful port of the TPC-H dbgen tool):

# Step 1 — generate TPC-H data (requires: pip install duckdb)
#   SF=1  →  ~6M lineitem rows,  ~200 MB Parquet
#   SF=0.1 → ~600K rows, quick sanity check
pixi run python3 scripts/gen_tpch_parquet.py --sf 1

# Step 2 — run the 22-query benchmark against real data
pixi run python3 scripts/bench_real_tpch.py --data-dir data/tpch_sf1 --runs 3

The generated data/tpch_sf1/ directory contains 8 Parquet files (one per TPC-H table) that you can inspect, share, or version-control.

Scale factor guide

`--sf`	lineitem rows	approx size	use case
0.01	~60K	2 MB	smoke test / CI
0.1	~600K	20 MB	local dev
1	~6M	200 MB	standard published benchmark
10	~60M	2 GB	stress test

Data lineage & legal note

Data is generated by DuckDB's TPC-H extension — a faithful port of the official TPC-H dbgen v3.0.1 with the same value distributions (uniform, Zipfian, pseudo-random vocab).
TPC-H® is a trademark of the Transaction Processing Performance Council. These results are an independent, non-certified benchmark. They may not be reported as "TPC-H results" without formal TPC certification.
Reference: https://www.tpc.org/tpch/

🔤 API Reference

LazyFrame

from mxframe import LazyFrame, Scan, col, lit, when

lf = LazyFrame(Scan(arrow_table))

Method	Description	Example
`.filter(expr)`	Row filter	`.filter(col("x") > lit(10))`
`.select(*cols)`	Column projection	`.select("a", "b", col("c").alias("d"))`
`.with_columns(*exprs)`	Add/replace columns	`.with_columns((col("a") * lit(2)).alias("a2"))`
`.groupby(*keys)`	Start grouped agg	`.groupby("dept", "region")`
`.join(other, left_on, right_on, how)`	Hash join	`.join(lf2, "id", "fk_id", how="inner")`
`.sort(expr, descending)`	Sort rows	`.sort(col("revenue"), descending=True)`
`.limit(n)`	Take first N rows	`.limit(100)`
`.distinct()`	Deduplicate rows	`.distinct()`
`.compute(device)`	Execute the plan	`.compute(device="gpu")`

Expressions (`col`, `lit`, `when`)

# Arithmetic
col("price") * (lit(1.0) - col("discount"))

# Comparison
col("date") >= lit(19940101)

# Boolean combine
(col("x") > lit(0)) & (col("y") < lit(100))

# Conditional
when(col("nation") == lit("BRAZIL"), col("revenue"), lit(0.0))

# String
col("phone").startswith("13")

# Date parts
col("orderdate").year()    # extract year as int32

# Aggregations (inside .agg())
col("salary").sum()
col("salary").mean()
col("salary").min()
col("salary").max()
col("id").count()

SQL Frontend

from mxframe.sql_frontend import sql

result = sql("""
    SELECT dept, SUM(salary) AS total, COUNT(*) AS n
    FROM employees
    WHERE age > 30
    GROUP BY dept
    ORDER BY total DESC
""", employees=arrow_table)

🔧 Supported Operations

Category	Operations
Filter	`>`, `>=`, `<`, `<=`, `==`, `!=`, `&`, `\|`, `~`, `isin`, `startswith`, `contains`
Aggregation	`sum`, `mean`, `min`, `max`, `count`
Groupby	Single key, multi-key, composite key
Join	Inner, Left outer
Sort	Single/multi column, ascending/descending
Window	`year()` date part extraction
Projection	`select`, `with_columns`, `alias`, arithmetic expressions
Semi-join	Via unique-key inner join
Anti-join	Via `pc.is_in` + `pc.invert`
Distinct	Full row deduplication
SQL	`SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`, `LIMIT`, `JOIN`

📁 Project Structure

mxframe/
├── __init__.py            ← Public API (LazyFrame, Scan, col, lit, when, sql)
├── lazy_frame.py          ← LazyFrame, LazyGroupBy, Scan
├── lazy_expr.py           ← Expr, col(), lit(), when()
├── compiler.py            ← LogicalPlan → MAX Graph compiler
├── custom_ops.py          ← Dispatch: AOT kernels / MAX Graph / PyArrow fallback
├── optimizer.py           ← Plan rewrites (filter pushdown, join reordering)
├── plan_validation.py     ← Pre-execution plan checks
├── sql_frontend.py        ← SQL → LogicalPlan via sqlglot
│
├── kernels_aot/           ← Pre-compiled AOT shared libraries
│   ├── libmxkernels_aot.so      ← CPU kernels (ctypes-callable)
│   └── libmxkernels_aot_gpu.so  ← GPU kernels (CUDA/ROCm/Metal)
│
├── kernels_v261/          ← Mojo kernel source (build time only)
│   ├── group_sum.mojo, group_min.mojo, group_max.mojo ...
│   ├── join_scatter.mojo, join_count.mojo
│   ├── join_scatter_left.mojo, join_count_left.mojo
│   └── filter_gather.mojo, gather_rows.mojo, unique_mask.mojo ...
│
├── kernels_aot/
│   ├── kernels_aot.mojo         ← CPU AOT entry points
│   └── kernels_aot_gpu.mojo     ← GPU AOT entry points
│
├── scripts/
│   ├── bench_simple.py          ← Clean 4-col benchmark (Pandas|Polars|MX CPU|MX GPU)
│   ├── benchmark_tpch.py        ← All 22 TPC-H query implementations
│   ├── _test_aot_smoke.py       ← AOT kernel smoke tests
│   └── quickstart.py            ← Minimal hello-world example
│
└── docs/
    ├── vision-and-architecture.md
    ├── CONTRIBUTING.md          ← Developer guide
    └── PUBLISHING.md            ← pip release steps

🖥️ Device Selection

# CPU (default, works everywhere)
result = lf.compute(device="cpu")

# GPU (requires NVIDIA/AMD/Apple Silicon with MAX runtime)
result = lf.compute(device="gpu")

The GPU path uses Mojo's DeviceContext — the same source compiles to:

PTX on NVIDIA (CUDA-compatible)
HSA/ROCm on AMD
Metal on Apple Silicon

🧪 Running Tests

# Smoke tests — AOT kernels
pixi run python3 scripts/_test_aot_smoke.py

# All TPC-H correctness checks
pixi run python3 scripts/_test_phase6_tpch_tier2.py

# GPU check
pixi run python3 scripts/_check_gpu.py

# Full 22-query benchmark
pixi run python3 scripts/bench_simple.py --rows 1000000 --runs 3

📦 Dependencies

Package	Required	Purpose
`pyarrow >= 14`	✅	Column storage, zero-copy NumPy bridge
`numpy >= 1.24`	✅	Vectorized pre/post processing
`pandas >= 2.0`	✅	Reference implementations, Pandas bridge
`modular >= 26.2`	GPU only	MAX Engine runtime, Mojo GPU dispatch
`polars >= 0.20`	optional	Polars bridge + benchmark comparison
`sqlglot >= 25`	optional	SQL frontend parsing

🤝 Contributing

See CONTRIBUTING.md for the developer guide, kernel writing guidelines, and how to add new TPC-H queries.

📄 License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
docs		docs
kernels_aot		kernels_aot
kernels_v261		kernels_v261
llm-ctx		llm-ctx
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
MANIFEST.in		MANIFEST.in
PUBLISHING.md		PUBLISHING.md
README.md		README.md
__init__.py		__init__.py
_modidx.py		_modidx.py
aot_kernels.py		aot_kernels.py
build_final.log		build_final.log
build_recheck.log		build_recheck.log
compiler.py		compiler.py
custom_ops.py		custom_ops.py
kernels.mojopkg		kernels.mojopkg
lazy_expr.py		lazy_expr.py
lazy_frame.py		lazy_frame.py
optimizer.py		optimizer.py
pixi.lock		pixi.lock
pixi.toml		pixi.toml
plan_validation.py		plan_validation.py
pyproject.toml		pyproject.toml
roadmap.md		roadmap.md
setup.py		setup.py
sql_frontend.py		sql_frontend.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 MXFrame

✨ Why MXFrame?

⚡ Quick Start

Install (Development)

Your First Query

📊 TPC-H Benchmark — all 22 queries

How the kernels dispatch

1 M rows — warm median of 3 runs

10 M rows — warm median of 3 runs

Where MXFrame loses (same at both scales)

What the numbers mean

🔁 Reproducing the Benchmark

Scale factor guide

Data lineage & legal note

🔤 API Reference

LazyFrame

Expressions (`col`, `lit`, `when`)

SQL Frontend

🔧 Supported Operations

📁 Project Structure

🖥️ Device Selection

🧪 Running Tests

📦 Dependencies

🤝 Contributing

📄 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 MXFrame

✨ Why MXFrame?

⚡ Quick Start

Install (Development)

Your First Query

📊 TPC-H Benchmark — all 22 queries

How the kernels dispatch

1 M rows — warm median of 3 runs

10 M rows — warm median of 3 runs

Where MXFrame loses (same at both scales)

What the numbers mean

🔁 Reproducing the Benchmark

Scale factor guide

Data lineage & legal note

🔤 API Reference

LazyFrame

Expressions (col, lit, when)

SQL Frontend

🔧 Supported Operations

📁 Project Structure

🖥️ Device Selection

🧪 Running Tests

📦 Dependencies

🤝 Contributing

📄 License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Expressions (`col`, `lit`, `when`)

Packages