DB25 SQL Tokenizer

A high-performance SQL tokenizer leveraging SIMD instructions for blazing-fast lexical analysis. Part of the DB25 project, achieving 20+ million tokens per second throughput on modern hardware.

Author: Chiradip Mandal
Email: chiradip@chiradip.com
Organization: Space-RF.org

🚀 Features

SIMD Acceleration: Automatic CPU feature detection (SSE4.2, AVX2, AVX-512, ARM NEON) - 4.5× speedup
Token Packing: Optimized 32-byte token structure (33% memory reduction)
Grammar Dispatch Tables: 256-byte lookup table for character classification (2.1× speedup)
Zero-Copy Design: String views eliminate memory allocation overhead (1.6× speedup)
Operator Precedence Tables: Fast expression parsing (1.4× speedup)
Branch Prediction: Compiler optimization hints for hot paths (1.15× speedup)
Grammar-Driven: Keywords extracted directly from EBNF specification
Cross-Platform: Supports x86_64 and ARM64 architectures
Thread-Safe: Lock-free design for concurrent tokenization
Production-Ready: Comprehensive test suite with 100% pass rate

📊 Performance

Query Complexity	Throughput	Tokens/Second	Speedup vs Scalar
Simple	8.5 MB/s	1.2M	4.0×
Moderate	9.2 MB/s	2.8M	4.0×
Complex	11.8 MB/s	5.3M	4.2×
Extreme	14.6 MB/s	8.9M	4.3×
Overall	17.7 MB/s	20M+	4.5×

🛠️ Quick Start

Prerequisites

C++23 compatible compiler (Clang 15+, GCC 13+, MSVC 2022+)
CMake 3.20+
CPU with SIMD support (most modern processors)

Building

# Clone the repository
git clone https://github.com/Space-RF/DB25-sql-tokenizer.git
cd DB25-sql-tokenizer

# Build with CMake
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j4

# Run tests
cd build && ctest --output-on-failure

Basic Usage

#include "simd_tokenizer.hpp"

using namespace db25;

int main() {
    std::string sql = "SELECT * FROM users WHERE age > 21";
    
    SimdTokenizer tokenizer(
        reinterpret_cast<const std::byte*>(sql.data()),
        sql.size()
    );
    
    auto tokens = tokenizer.tokenize();
    
    for (const auto& token : tokens) {
        if (token.type != TokenType::Whitespace) {
            std::cout << token.value << " [" 
                      << token_type_name(token.type) << "]\n";
        }
    }
    
    std::cout << "SIMD Level: " << tokenizer.simd_level() << "\n";
    return 0;
}

🏗️ Architecture

The tokenizer employs a multi-layered architecture optimized for performance:

┌─────────────────────────────────────────────┐
│           SQL Input Buffer                  │
└─────────────────┬───────────────────────────┘
                  │
┌─────────────────▼───────────────────────────┐
│         CPU Feature Detection               │
│    (SSE4.2/AVX2/AVX-512/NEON)              │
└─────────────────┬───────────────────────────┘
                  │
┌─────────────────▼───────────────────────────┐
│          SIMD Dispatcher                    │
│  ┌──────────┬──────────┬──────────┐       │
│  │Whitespace│ Keyword  │Identifier│       │
│  │Detection │ Matching │Boundary  │       │
│  └──────────┴──────────┴──────────┘       │
└─────────────────┬───────────────────────────┘
                  │
┌─────────────────▼───────────────────────────┐
│         Zero-Copy Token Stream              │
│         (string_view references)            │
└─────────────────────────────────────────────┘

Key Components

SIMD Processor Hierarchy: Runtime CPU detection selects optimal instruction set
Keyword System: 208 SQL keywords with O(log n) length-bucketed lookup
Token Structure: Packed 32-byte tokens (from 48 bytes) for cache efficiency
Grammar Dispatch: Character classification lookup table avoiding if-else chains
Optimization Infrastructure: Branch prediction, cache prefetch, function inlining
Token Types: Keywords, Identifiers, Numbers, Strings, Operators, Delimiters
Memory Management: Zero-copy design with string_view references

See ARCHITECTURE.md for detailed design documentation.

📚 Documentation

Architecture - Detailed system design
Contributing - How to contribute

🧪 Testing

The tokenizer includes comprehensive testing against real-world SQL:

# Build and run tests
cmake --build build

# Run specific test
./build/test_sql_file test/sql_test.sqls

# Run token packing validation
./build/test_packing

# Generate verification output
./build/test_sql_file -o

# Verbose mode
./build/test_sql_file -v

Test coverage includes:

23 SQL queries across 4 complexity levels with 100% pass rate
Token packing validation (size and alignment verification)
Memory savings analysis (33% reduction confirmed)

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on:

Code style and standards
Development workflow
Testing requirements
Submitting pull requests

📈 Benchmarks

Performance benchmarks and optimizations:

Optimization Impact

Technique	Speedup	Implementation
SIMD Processing	4.5×	ARM NEON, AVX2, SSE4.2
Grammar Dispatch	2.1×	256-byte lookup table
String Views	1.6×	Zero-copy tokenization
Precedence Tables	1.4×	Operator precedence lookup
Token Packing	1.2×	33% memory reduction
Branch Prediction	1.15×	Compiler hints
Combined	~20×	All optimizations

Memory Efficiency

Token size: 32 bytes (reduced from 48)
Cache line efficiency: 2 tokens per 64-byte line
1M tokens: 31 MB (vs 46 MB before optimization)

🔮 Future Vision

The DB25 tokenizer is the foundation for a next-generation SQL processing engine:

Stage 2: Recursive descent parser with LALR(1) grammar
Stage 3: Abstract syntax tree construction
Stage 4: Query optimization and plan generation
Stage 5: JIT compilation for expression evaluation

See our technical documentation for the complete roadmap.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 Chiradip Mandal, Space-RF.org

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

📖 Technical Implementation

Key implementation files:

include/simd_tokenizer.hpp - Main tokenizer with packed Token structure
include/grammar_dispatch.hpp - Character classification lookup tables
include/optimization_hints.hpp - Compiler optimization macros
include/simd_architecture.hpp - SIMD processor abstractions

🙏 Acknowledgments

Intel and ARM for SIMD instruction set architectures
The simdjson project for inspiration on SIMD text processing
The open-source community for valuable feedback

📞 Contact

Chiradip Mandal
Email: chiradip@chiradip.com
Organization: Space-RF.org
GitHub: @chiradip

DB25 SQL Tokenizer - Pushing the boundaries of SQL processing performance
Made with ❤️ by Space-RF.org

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cmake		cmake
docs		docs
include		include
src		src
test		test
tests		tests
tools		tools
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
test_packing		test_packing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DB25 SQL Tokenizer

🚀 Features

📊 Performance

🛠️ Quick Start

Prerequisites

Building

Basic Usage

🏗️ Architecture

Key Components

📚 Documentation

🧪 Testing

🤝 Contributing

📈 Benchmarks

Optimization Impact

Memory Efficiency

🔮 Future Vision

📄 License

📖 Technical Implementation

🙏 Acknowledgments

📞 Contact

About

Uh oh!

Releases

Packages

Languages

License

space-rf-org/DB25-sql-tokenizer-token-packing

Folders and files

Latest commit

History

Repository files navigation

DB25 SQL Tokenizer

🚀 Features

📊 Performance

🛠️ Quick Start

Prerequisites

Building

Basic Usage

🏗️ Architecture

Key Components

📚 Documentation

🧪 Testing

🤝 Contributing

📈 Benchmarks

Optimization Impact

Memory Efficiency

🔮 Future Vision

📄 License

📖 Technical Implementation

🙏 Acknowledgments

📞 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages