Snorkel

A fast, in-memory time-series analytics database written in Rust. Inspired by Facebook's Scuba, Snorkel is designed for real-time analytics on event data with SQL-like queries, automatic data expiration, and horizontal scaling.

Features

Columnar Storage - Cache-efficient memory layout optimized for analytics queries
String Dictionary Encoding - Compact storage for repeated string values
Time-Based Sharding - Data partitioned by time for efficient range queries and TTL expiration
SQL-Like Queries - Familiar syntax with SELECT, WHERE, GROUP BY, ORDER BY, LIMIT
Rich Aggregations - COUNT, SUM, AVG, MIN, MAX, PERCENTILE
JSON Auto-Flattening - Nested JSON objects automatically flattened to dot-notation columns
Distributed Queries - Fan-out queries across multiple nodes with result merging
Query Caching - TTL-based result cache with automatic invalidation
Bloom Filters - Fast shard pruning for equality filters
SIMD Aggregations - Vectorized single-pass statistics computation
Web UI - Visual query builder with interactive charts
No Dependencies - Pure Rust with no external database requirements

Quick Start

# Clone and build
git clone https://github.com/jamra/snorkel.git
cd snorkel
cargo build --release

# Run the server (default port 8080)
cargo run --release

# Or with custom settings
SNORKEL_PORT=9000 SNORKEL_MAX_MEMORY_MB=2048 cargo run --release

Usage

Ingest Data

curl -X POST http://localhost:9000/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "table": "events",
    "rows": [
      {"timestamp": 1708700000000, "event": "click", "user_id": 123, "latency_ms": 45},
      {"timestamp": 1708700001000, "event": "view", "user_id": 456, "latency_ms": 120}
    ]
  }'

Query Data

# Count events by type
curl -X POST http://localhost:9000/query \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT event, COUNT(*), AVG(latency_ms) FROM events GROUP BY event"}'

# Filter and aggregate
curl -X POST http://localhost:9000/query \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT event, COUNT(*) FROM events WHERE latency_ms > 100 GROUP BY event"}'

JSON Flattening

Nested JSON is automatically flattened on ingest:

# Ingest nested data
curl -X POST http://localhost:9000/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "table": "orders",
    "rows": [
      {
        "timestamp": 1708700000000,
        "customer": {"name": "Alice", "tier": "premium"},
        "order": {"total": 99.99, "items": 3}
      }
    ]
  }'

# Query with dot notation
curl -X POST http://localhost:9000/query \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT customer.tier, SUM(order.total) FROM orders GROUP BY customer.tier"}'

Web UI

Open http://localhost:9000/ in your browser for the visual query builder:

Select tables and metrics from dropdowns
Build filters with point-and-click
Visualize results as bar, line, or area charts
View sample data alongside aggregations

API Endpoints

Endpoint	Method	Description
`/`	GET	Web UI
`/health`	GET	Health check
`/ingest`	POST	Insert rows
`/query`	POST	Execute SQL query
`/tables`	GET	List all tables
`/tables`	POST	Create table with config
`/tables/:name/schema`	GET	Get table schema
`/tables/:name`	DELETE	Drop table
`/stats`	GET	Server statistics

Configuration

Environment variables:

Variable	Default	Description
`SNORKEL_HOST`	`0.0.0.0`	Bind address
`SNORKEL_PORT`	`8080`	Server port
`SNORKEL_MAX_MEMORY_MB`	`1024`	Maximum memory usage

Cluster Mode (Symmetric)

Run multiple nodes for horizontal scaling. All nodes are equal - any node can coordinate queries.

# Node 1
SNORKEL_PORT=9000 \
SNORKEL_NODE_ID=node-1 \
SNORKEL_PEERS="10.0.0.2:8080,10.0.0.3:8080" \
cargo run

# Node 2
SNORKEL_PORT=8080 \
SNORKEL_NODE_ID=node-2 \
SNORKEL_PEERS="10.0.0.1:9000,10.0.0.3:8080" \
cargo run

# Node 3
SNORKEL_PORT=8080 \
SNORKEL_NODE_ID=node-3 \
SNORKEL_PEERS="10.0.0.1:9000,10.0.0.2:8080" \
cargo run

Key points:

Each node lists all OTHER nodes as peers
Query any node - it will fan out to all peers and merge results
Put a load balancer (nginx, HAProxy, k8s Ingress) in front for production

Or use the provided scripts for local testing:

./run_cluster.sh    # Start 3-node cluster
./stop_cluster.sh   # Stop all nodes
./seed_data.sh      # Load sample data

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         HTTP API                                │
│                  (Axum + Tower middleware)                      │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌──────────────────────────────────────────┐  │
│  │Query Cache  │  │            Query Engine                  │  │
│  │   (moka)    │◄─┤  Parser → Planner → Executor (parallel)  │  │
│  │  TTL-based  │  │                                          │  │
│  └─────────────┘  │  Optimizations:                          │  │
│                   │  • Predicate pushdown (row masks)        │  │
│                   │  • SIMD aggregations (single-pass)       │  │
│                   │  • Bloom filter shard pruning            │  │
│                   └──────────────────────────────────────────┘  │
├─────────────────────────────────────────────────────────────────┤
│                       Storage Engine                            │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │     Table 1     │  │     Table 2     │  │     Table N     │  │
│  │  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │  │
│  │  │   Shard   │  │  │  │   Shard   │  │  │  │   Shard   │  │  │
│  │  │ • Columns │  │  │  │ • Columns │  │  │  │ • Columns │  │  │
│  │  │ • Bloom   │  │  │  │ • Bloom   │  │  │  │ • Bloom   │  │  │
│  │  └───────────┘  │  │  └───────────┘  │  │  └───────────┘  │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │ Alert Checker   │  │ TTL Expiration  │  │ Load Balancer   │  │
│  │ (background)    │  │ (background)    │  │ (cluster mode)  │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

                         Cluster Mode
┌─────────────────────────────────────────────────────────────────┐
│                        Coordinator                              │
│              (fan-out queries, merge results)                   │
├──────────────────┬──────────────────┬──────────────────────────┤
│      Node 1      │      Node 2      │        Node N            │
│   (local data)   │   (local data)   │     (local data)         │
└──────────────────┴──────────────────┴──────────────────────────┘

SQL Support

Supported Syntax

SELECT column1, column2, AGG(column3)
FROM table_name
WHERE condition1 AND condition2
GROUP BY column1, column2
ORDER BY column1 [ASC|DESC]
LIMIT n

Aggregation Functions

COUNT(*) / COUNT(column)
SUM(column)
AVG(column)
MIN(column)
MAX(column)
PERCENTILE(column, 0.95)

Filter Operators

Comparison: =, !=, >, <, >=, <=
Pattern: LIKE
Logical: AND, OR

Performance

Snorkel is optimized for high-throughput ingestion and sub-millisecond queries.

Benchmark Results (11M rows)

Metric	Result
Ingest throughput	300-390K rows/sec
Query latency (avg)	85-100µs
Cache speedup	287x
Bloom filter speedup	9.5x
Memory efficiency	75 bytes/row

Query Latencies

Query Type	Average	Min
COUNT(*)	95µs	53µs
GROUP BY	85µs	53µs
AVG aggregation	89µs	57µs
Multiple aggregates	85µs	59µs
Top N with ORDER BY	79µs	56µs

Optimizations

Bloom filters - Prune shards that don't contain searched values (9.5x speedup for selective queries)
Predicate pushdown - Build row masks once per shard instead of per-row filter evaluation
SIMD-friendly aggregations - Single-pass statistics computation with auto-vectorization
Query result caching - TTL-based cache with 287x speedup for repeated queries
Parallel shard processing - Rayon-based parallel execution across CPU cores

Running Benchmarks

# Start server
cargo run --release &

# Run benchmark (choose scale)
cargo run --release --bin benchmark small   # 10K rows/table
cargo run --release --bin benchmark medium  # 100K rows/table
cargo run --release --bin benchmark large   # 1M rows/table
cargo run --release --bin benchmark xlarge  # 10M rows/table

Development

# Run tests
cargo test

# Run with logging
RUST_LOG=snorkel=debug cargo run

# Build release
cargo build --release

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
fix_compile.patch		fix_compile.patch
fix_compile2.patch		fix_compile2.patch
run_cluster.sh		run_cluster.sh
run_node1.sh		run_node1.sh
run_node2.sh		run_node2.sh
run_node3.sh		run_node3.sh
seed_data.sh		seed_data.sh
stop_cluster.sh		stop_cluster.sh
stress_test.sh		stress_test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snorkel

Features

Quick Start

Usage

Ingest Data

Query Data

JSON Flattening

Web UI

API Endpoints

Configuration

Cluster Mode (Symmetric)

Architecture

SQL Support

Supported Syntax

Aggregation Functions

Filter Operators

Performance

Benchmark Results (11M rows)

Query Latencies

Optimizations

Running Benchmarks

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Snorkel

Features

Quick Start

Usage

Ingest Data

Query Data

JSON Flattening

Web UI

API Endpoints

Configuration

Cluster Mode (Symmetric)

Architecture

SQL Support

Supported Syntax

Aggregation Functions

Filter Operators

Performance

Benchmark Results (11M rows)

Query Latencies

Optimizations

Running Benchmarks

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages