lumen-xarray

+

lumen-xarray

Native xarray support for Lumen - SQL-queryable N-dimensional scientific data

Part of the HoloViz ecosystem - extends Lumen to work with N-dimensional scientific data.

The Problem

Lumen is a framework for building AI-powered data applications. It currently operates on tabular data (CSV, Parquet, SQL) via DuckDB. Scientists and researchers, however, work with N-dimensional labeled datasets - temperature grids across time/lat/lon, satellite imagery, genomics matrices - stored in NetCDF, Zarr, HDF5, and GRIB.

lumen-xarray bridges this gap: it registers xarray datasets with Apache DataFusion (via xarray-sql) and exposes them through Lumen's Source API. This lets Lumen AI agents generate SQL queries against scientific data and makes the full pipeline ecosystem work with multidimensional data.

Built With

Features

Category	Feature
Sources	`XArraySQLSource` (SQL via DataFusion), `XArraySource` (native xarray)
Transforms	10 scientific transforms: slice, bbox, aggregate, resample, anomaly, rolling, trend, climatology, percentile, spatial gradient
CF Conventions	Auto-detect lat/lon/time/vertical via `cf-xarray` with heuristic fallback
Multi-File	Load file lists and glob patterns via `xr.open_mfdataset`
Geographic Maps	GeoViews + Cartopy projected maps with coastlines (optional)
AI Integration	Lumen AI hooks, context builder, suggested queries, 4 Analysis subclasses
Dashboard	Adaptive Panel dashboard with grouped tabs, Tabulator, export, SQL explorer
Performance	SQL-level spatial binning - 68M cells in < 5s (34x speedup over naive GROUP BY)
Formats	NetCDF, Zarr, HDF5, GRIB - local and remote (S3, GCS, OpenDAP)

Interactive Dashboard

The dashboard auto-adapts to any xarray dataset. Widgets, tabs, and SQL queries are generated dynamically from the data's dimensions and variables. Upload files or enter paths/URLs at runtime.

Dashboard Overview

Navigate through all tabs - Explore, Time Analysis, Analysis, Compare, and Tools

Explore - Spatial Maps & Profiles

Spatial heatmaps, lat/lon profiles, vertical profiles, and distribution analysis

Time Analysis

Time series, anomaly detection, rolling mean, linear trend, and Hovmoller diagrams

Advanced Analysis

Difference maps, cross-sections, correlation maps, and region statistics

SQL Explorer

Write raw SQL queries against N-dimensional scientific data via Apache DataFusion

Static Screenshots

Spatial Map

Time Series & Anomaly Detection


Time series with aggregation	Anomaly detection from climatological mean

Hovmoller Diagram & Distribution


Hovmoller (time vs. dimension)	Histogram + KDE + summary stats

Advanced Analysis


Difference map between variables/time slices	Vertical cross-section

SQL Explorer & Dataset Info


Write raw SQL against scientific data	CF metadata, attributes, dimensions

Dashboard Tabs

Explore: Spatial Map (GeoViews/heatmap), Lat/Lon Profiles, Vertical Profile, Distribution (histogram + KDE)
Time Analysis: Time Series, Anomaly, Rolling Mean, Linear Trend, Monthly Climatology, Hovmoller
Analysis: Difference Map, Cross-Section, Correlation Map, Region Statistics
Compare: Cross-Variable scatter (with correlation), Statistics (Tabulator)
Tools: SQL Explorer (with pagination), Data Export (CSV/Parquet/JSON), Dataset Info (CF roles, attributes), Data Coverage

Run the Dashboard

# Demo dataset (NOAA air temperature)
PYTHONPATH=. panel serve examples/dashboard.py --show

# Your own NetCDF / Zarr / HDF5 / GRIB file
PYTHONPATH=. panel serve examples/dashboard.py --show --args my_data.nc

# Multi-file glob pattern
PYTHONPATH=. panel serve examples/dashboard.py --show --args "data/*.nc"

Quick Start

import xarray as xr
from lumen_xarray import XArraySQLSource

# Load any xarray dataset
ds = xr.tutorial.open_dataset("air_temperature")
source = XArraySQLSource(_dataset=ds)

# SQL queries over scientific data
df = source.execute("""
    SELECT lat, AVG(air) as avg_temp
    FROM air
    WHERE lat > 60
    GROUP BY lat
    ORDER BY lat
""")

# From files (single, list, or glob)
source = XArraySQLSource(uri="climate_data.nc")
source = XArraySQLSource(uri=["data_01.nc", "data_02.nc", "data_03.nc"])
source = XArraySQLSource(uri="data/*.nc")

# Remote data
source = XArraySQLSource(uri="s3://bucket/data.zarr", engine="zarr")

# Lumen Source API
source.get_tables()           # ['air']
source.get_schema("air")      # {column: {type, min, max, ...}, __len__: N}
source.get_metadata("air")    # {description, columns, dimensions, shape, ...}
source.get_dimension_info()   # {time: {type, min, max, size, role}, ...}
source.get("air", lat=75.0)   # Filtered DataFrame
source.estimate_size("air")   # {rows, estimated_mb, exceeds_warning}

CF Conventions Auto-Detection

from lumen_xarray import detect_coordinates, get_coordinate_metadata

coords = detect_coordinates(ds)
# {'latitude': 'lat', 'longitude': 'lon', 'time': 'time', 'vertical': None}

meta = get_coordinate_metadata(ds)
# {'lat': {'units': 'degrees_north', 'standard_name': 'latitude'}, ...}

Transforms (10 total)

from lumen_xarray import (
    DimensionSlice, SpatialBBox, DimensionAggregate, TimeResample,
    Anomaly, RollingWindow, LinearTrend, Climatology, Percentile, SpatialGradient,
)

df = source.execute("SELECT * FROM air")

# Slice, filter, resample
df = DimensionSlice(dimension="time", start="2013-06-01", stop="2013-12-31").apply(df)
df = SpatialBBox(lat_min=30, lat_max=60, lon_min=200, lon_max=280).apply(df)
df = TimeResample(time_col="time", freq="MS").apply(df)

# Scientific analysis
df = Anomaly(time_col="time", value_col="air", groupby="month").apply(df)
df = LinearTrend(time_col="time", value_col="air").apply(df)
df = Climatology(time_col="time", value_col="air", groupby="month").apply(df)
df = Percentile(column="air", percentiles=[10, 50, 90]).apply(df)
df = SpatialGradient(value_col="air", lat_col="lat", lon_col="lon").apply(df)

AI Context for LLM Agents

from lumen_xarray import build_ai_context, get_suggested_queries

# Structured context for LLM system prompts
context = build_ai_context(source, "air")
# Describes dimensions, roles, units, pitfalls, and SQL patterns

# Auto-generated queries based on data structure
queries = get_suggested_queries(source, "air")
# ['SELECT * FROM air LIMIT 10',
#  'SELECT EXTRACT(MONTH FROM time) as month, AVG(air) ...',
#  'SELECT lat, lon, AVG(air) ... GROUP BY lat, lon ...', ...]

Lumen AI Analysis Subclasses

from lumen_xarray import ClimateTimeSeries, SpatialMap, VerticalProfile, DistributionAnalysis

# Auto-detect applicability and render interactive plots
ClimateTimeSeries.applies(pipeline)  # True if data has time + numeric cols
SpatialMap.applies(pipeline)         # True if data has lat/lon + numeric cols
VerticalProfile.applies(pipeline)    # True if data has level/depth/pressure
DistributionAnalysis.applies(pipeline)  # True if data has any numeric col

Architecture

NetCDF / Zarr / HDF5 / GRIB / Remote URLs / Multi-file globs
    |
    v
xarray.open_dataset() / open_mfdataset()  (lazy, dask-chunked)
    |
    +---> cf-xarray: auto-detect coordinate roles (lat/lon/time/vertical)
    |
    +---> XArraySQLSource (BaseSQLSource)
    |       |
    |       v
    |   xarray-sql: XarrayContext (Apache DataFusion)
    |       |
    |       v
    |   SQL queries --> pandas DataFrames
    |       |
    |       v
    |   Transforms --> Lumen Pipeline / AI Agents / Dashboard
    |
    +---> XArraySource (Source)
            |
            v
        Native xarray ops --> pandas DataFrames --> Lumen Pipeline

Performance

The dashboard uses SQL-level spatial binning to handle large datasets efficiently. Instead of returning one row per grid cell (which produces 68M rows for a 6336x10800 grid), queries use FLOOR(coord / bin_width) * bin_width to aggregate into a configurable number of spatial bins.

Dataset	Grid Size	Naive Query	Binned Query	Speedup
Smith & Sandwell Topography	6336 x 10800 (68M cells)	146s	4.3s	34x
NCEP/NCAR Reanalysis	73 x 144 (10K cells)	0.8s	0.8s (no binning)	-

A resolution control widget lets users switch between Auto, Low, Medium, High, and Full resolution.

API Reference

Sources

Component	Base Class	SQL	Use Case
`XArraySQLSource`	`BaseSQLSource`	DataFusion	Lumen AI, SQL queries, full pipeline integration
`XArraySource`	`Source`	No	Programmatic access, native xarray operations

Transforms

Transform	Description
`DimensionSlice`	Slice by range, values, or nearest match along any dimension
`SpatialBBox`	Filter to a lat/lon bounding box
`DimensionAggregate`	Reduce dimensions - auto-detects coordinates vs. data columns
`TimeResample`	Resample time series (daily to monthly, etc.) with spatial grouping
`Anomaly`	Deviations from climatological mean (monthly, seasonal, overall)
`RollingWindow`	Moving average/sum/std for time series smoothing
`LinearTrend`	Polynomial trend fitting with detrended residuals
`Climatology`	Long-term grouped mean (seasonal cycle baseline)
`Percentile`	Global or grouped percentile computation
`SpatialGradient`	Finite-difference lat/lon gradients on gridded data

AI Integration

Component	Purpose
`build_ai_context()`	Structured dataset description for LLM system prompts
`get_suggested_queries()`	Auto-generated SQL queries based on data structure
`ClimateTimeSeries`	Analysis: monthly mean + trend + anomaly overlay
`SpatialMap`	Analysis: geographic heatmap with GeoViews/fallback
`VerticalProfile`	Analysis: value vs. pressure/depth with inverted y-axis
`DistributionAnalysis`	Analysis: histogram + KDE + summary statistics
`is_xarray_path()`	Detect xarray file extensions and URLs
`resolve_xarray_source()`	Create source from path (`lumen-ai serve data.nc`)
`handle_xarray_upload()`	Process uploaded files in Lumen AI UI
`register_xarray_handlers()`	Patch Lumen AI to recognize xarray file types

Supported Formats

Format	Extensions	Engine	Remote
NetCDF	`.nc`, `.nc4`, `.netcdf`	`netcdf4`	OpenDAP URLs
Zarr	`.zarr`	`zarr`	S3, GCS, HTTP via fsspec
HDF5	`.h5`, `.hdf5`, `.he5`	`h5netcdf`	-
GRIB	`.grib`, `.grib2`, `.grb`	`cfgrib`	-

Test Suite

$ pytest tests/ -v
======================= 206 passed in 12.87s ========================

Module	Tests	Covers
`test_sql_source.py`	50	Construction, SQL, schema, metadata, normalize_table, estimate_size, async, serialization
`test_basic_source.py`	27	Source API, filtering, native xarray ops, file I/O
`test_transforms.py`	52	All 10 transforms + integration (chaining) tests
`test_ai_integration.py`	25	Path detection, source resolution, upload, code gen, AI context, suggested queries
`test_cf.py`	15	CF coordinate detection, heuristic fallback, metadata extraction
`test_multifile.py`	13	Multi-file detection, list/glob loading, time continuity, schema
`test_analyses.py`	24	Analysis applicability, output types, helpers

Examples

Example	Description
`examples/quickstart.py`	Basic XArraySQLSource usage and Lumen API
`examples/sql_queries.py`	SQL patterns for scientific data
`examples/lumen_pipeline.py`	Lumen Pipeline integration
`examples/dashboard.py`	Interactive Panel dashboard (2300+ lines)
`examples/demo_era5.py`	ERSSTv5 sea surface temperature analysis
`examples/demo_multimodel.py`	Multi-file dataset loading and analysis

Installation

git clone https://github.com/ghostiee-11/lumen-xarray.git
cd lumen-xarray
pip install -e ".[all,test,examples]"

Core dependencies: lumen, xarray, xarray-sql, pandas, numpy, param

Optional: cf-xarray (CF conventions), geoviews + cartopy (geographic maps), netCDF4, zarr, cfgrib

Project Structure

lumen-xarray/
├── lumen_xarray/
│   ├── __init__.py           # Public API (20+ exports)
│   ├── _base.py              # Shared mixin, multi-file support, format detection
│   ├── source.py             # XArraySQLSource - SQL via DataFusion
│   ├── basic_source.py       # XArraySource - native xarray ops
│   ├── transforms.py         # 10 scientific data transforms
│   ├── cf.py                 # CF conventions auto-detection
│   ├── analyses.py           # 4 Lumen AI Analysis subclasses
│   └── ai.py                 # Lumen AI hooks, context builder, query suggestions
├── tests/                    # 206 tests across 7 modules
├── examples/                 # 6 runnable examples + interactive dashboard
├── .github/workflows/ci.yml  # CI pipeline (Python 3.10-3.12 + ruff)
├── pyproject.toml
└── README.md

Design Decisions

DataFusion over DuckDB - xarray-sql uses Apache DataFusion. We set dialect="postgres" for sqlglot since DataFusion's SQL is PostgreSQL-compatible.
Two source classes - XArraySQLSource for Lumen AI (agents need execute()), XArraySource for programmatic use with native xarray ops.
Per-variable tables - Each data variable becomes a SQL table. Coordinates (time, lat, lon) become columns in each table.
CF-first coordinate detection - Uses cf-xarray for robust coordinate role detection (standard_name, axis attributes), falls back to name heuristics when cf-xarray is not installed.
Multi-file transparency - Pass a list or glob pattern as uri and open_mfdataset handles concatenation. Works identically to single-file loading downstream.
Coordinate-aware aggregation - DimensionAggregate auto-detects coordinate columns vs. data columns, so grouping and averaging work correctly.
Async-first for AI - execute_async() and get_async() run in thread pools for non-blocking agent workflows.
Adaptive dashboard - Widgets and tabs auto-generate from dataset dimensions and CF roles. Works with any xarray dataset.
SQL-level spatial binning - Large grids are binned at the SQL layer using FLOOR() expressions, not after loading into memory. This keeps DataFusion fast and avoids OOM on 100M+ cell datasets.

License

BSD-3-Clause

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
docs/screenshots		docs/screenshots
examples		examples
lumen_xarray		lumen_xarray
screenshots		screenshots
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

lumen-xarray

The Problem

Built With

Features

Interactive Dashboard

Dashboard Overview

Explore - Spatial Maps & Profiles

Time Analysis

Advanced Analysis

SQL Explorer

Spatial Map

Time Series & Anomaly Detection

Hovmoller Diagram & Distribution

Advanced Analysis

SQL Explorer & Dataset Info

Dashboard Tabs

Run the Dashboard

Quick Start

CF Conventions Auto-Detection

Transforms (10 total)

AI Context for LLM Agents

Lumen AI Analysis Subclasses

Architecture

Performance

API Reference

Sources

Transforms

AI Integration

Supported Formats

Test Suite

Examples

Installation

Project Structure

Design Decisions

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages