DB25 SQL Tokenizer

A high-performance SQL tokenizer leveraging SIMD instructions for blazing-fast lexical analysis. Part of the DB25 project, achieving 20+ million tokens per second throughput on modern hardware.

Author: Chiradip Mandal
Email: chiradip@chiradip.com
Organization: Space-RF.org

🚀 Features

SIMD Acceleration: Automatic CPU feature detection (SSE4.2, AVX2, AVX-512, ARM NEON)
Zero-Copy Design: String views eliminate memory allocation overhead
4.5× Faster: Compared to traditional scalar implementations
Grammar-Driven: Keywords extracted directly from EBNF specification
Cross-Platform: Supports x86_64 and ARM64 architectures
Thread-Safe: Lock-free design for concurrent tokenization
Production-Ready: Comprehensive test suite with 100% pass rate

📊 Performance

Query Complexity	Throughput	Tokens/Second	Speedup vs Scalar
Simple	8.5 MB/s	1.2M	4.0×
Moderate	9.2 MB/s	2.8M	4.0×
Complex	11.8 MB/s	5.3M	4.2×
Extreme	14.6 MB/s	8.9M	4.3×
Overall	17.7 MB/s	20M+	4.5×

🛠️ Quick Start

Prerequisites

C++23 compatible compiler (Clang 15+, GCC 13+, MSVC 2022+)
CMake 3.20+
CPU with SIMD support (most modern processors)

Building

# Clone the repository
git clone https://github.com/Space-RF/DB25-sql-tokenizer.git
cd DB25-sql-tokenizer

# Build with CMake
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j4

# Run tests
cd build && ctest --output-on-failure

Basic Usage

#include "simd_tokenizer.hpp"

using namespace db25;

int main() {
    std::string sql = "SELECT * FROM users WHERE age > 21";
    
    SimdTokenizer tokenizer(
        reinterpret_cast<const std::byte*>(sql.data()),
        sql.size()
    );
    
    auto tokens = tokenizer.tokenize();
    
    for (const auto& token : tokens) {
        if (token.type != TokenType::Whitespace) {
            std::cout << token.value << " [" 
                      << token_type_name(token.type) << "]\n";
        }
    }
    
    std::cout << "SIMD Level: " << tokenizer.simd_level() << "\n";
    return 0;
}

🏗️ Architecture

The tokenizer employs a multi-layered architecture optimized for performance:

┌─────────────────────────────────────────────┐
│           SQL Input Buffer                  │
└─────────────────┬───────────────────────────┘
                  │
┌─────────────────▼───────────────────────────┐
│         CPU Feature Detection               │
│    (SSE4.2/AVX2/AVX-512/NEON)              │
└─────────────────┬───────────────────────────┘
                  │
┌─────────────────▼───────────────────────────┐
│          SIMD Dispatcher                    │
│  ┌──────────┬──────────┬──────────┐       │
│  │Whitespace│ Keyword  │Identifier│       │
│  │Detection │ Matching │Boundary  │       │
│  └──────────┴──────────┴──────────┘       │
└─────────────────┬───────────────────────────┘
                  │
┌─────────────────▼───────────────────────────┐
│         Zero-Copy Token Stream              │
│         (string_view references)            │
└─────────────────────────────────────────────┘

Key Components

SIMD Processor Hierarchy: Runtime CPU detection selects optimal instruction set
Keyword System: 208 SQL keywords with O(log n) length-bucketed lookup
Token Types: Keywords, Identifiers, Numbers, Strings, Operators, Delimiters
Memory Management: Zero-copy design with string_view references

See ARCHITECTURE.md for detailed design documentation.

📚 Documentation

Tutorial - Step-by-step guide with examples
Architecture - Detailed system design
Visual Tutorial - Diagrams and visualizations
Contributing - How to contribute
Academic Papers - Research and performance analysis

🧪 Testing

The tokenizer includes comprehensive testing against real-world SQL:

# Build and run tests
cmake --build build

# Run specific test
./build/test_sql_file test/sql_test.sqls

# Generate verification output
./build/test_sql_file -o

# Verbose mode
./build/test_sql_file -v

Test coverage includes 23 SQL queries across 4 complexity levels with 100% pass rate.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on:

Code style and standards
Development workflow
Testing requirements
Submitting pull requests

📈 Benchmarks

Performance benchmarks are detailed in the documentation:

Token distribution analysis
SIMD operation performance
Comparison with other parsers
Platform-specific results

🔮 Future Vision

The DB25 tokenizer is the foundation for a next-generation SQL processing engine:

Stage 2: Recursive descent parser with LALR(1) grammar
Stage 3: Abstract syntax tree construction
Stage 4: Query optimization and plan generation
Stage 5: JIT compilation for expression evaluation

See our Academic Paper for the complete roadmap.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 Chiradip Mandal, Space-RF.org

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

📖 Academic Papers

🙏 Acknowledgments

Intel and ARM for SIMD instruction set architectures
The simdjson project for inspiration on SIMD text processing
The open-source community for valuable feedback

📞 Contact

Chiradip Mandal
Email: chiradip@chiradip.com
Organization: Space-RF.org
GitHub: @chiradip

DB25 SQL Tokenizer - Pushing the boundaries of SQL processing performance
Made with ❤️ by Space-RF.org

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
analysis		analysis
cmake		cmake
docs		docs
grammar		grammar
include		include
papers		papers
src		src
test		test
tools		tools
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
temp-delete		temp-delete
test_latex.sh		test_latex.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DB25 SQL Tokenizer

🚀 Features

📊 Performance

🛠️ Quick Start

Prerequisites

Building

Basic Usage

🏗️ Architecture

Key Components

📚 Documentation

🧪 Testing

🤝 Contributing

📈 Benchmarks

🔮 Future Vision

📄 License

📖 Academic Papers

🙏 Acknowledgments

📞 Contact

About

Uh oh!

Releases 1

Packages

Languages

License

space-rf-org/DB25-sql-tokenizer

Folders and files

Latest commit

History

Repository files navigation

DB25 SQL Tokenizer

🚀 Features

📊 Performance

🛠️ Quick Start

Prerequisites

Building

Basic Usage

🏗️ Architecture

Key Components

📚 Documentation

🧪 Testing

🤝 Contributing

📈 Benchmarks

🔮 Future Vision

📄 License

📖 Academic Papers

🙏 Acknowledgments

📞 Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages