Skip to content

space-rf-org/DB25-sql-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

11 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

DB25 SQL Tokenizer

License: MIT C++23 SIMD Performance

A high-performance SQL tokenizer leveraging SIMD instructions for blazing-fast lexical analysis. Part of the DB25 project, achieving 20+ million tokens per second throughput on modern hardware.

Author: Chiradip Mandal
Email: chiradip@chiradip.com
Organization: Space-RF.org

๐Ÿš€ Features

  • SIMD Acceleration: Automatic CPU feature detection (SSE4.2, AVX2, AVX-512, ARM NEON)
  • Zero-Copy Design: String views eliminate memory allocation overhead
  • 4.5ร— Faster: Compared to traditional scalar implementations
  • Grammar-Driven: Keywords extracted directly from EBNF specification
  • Cross-Platform: Supports x86_64 and ARM64 architectures
  • Thread-Safe: Lock-free design for concurrent tokenization
  • Production-Ready: Comprehensive test suite with 100% pass rate

๐Ÿ“Š Performance

Query Complexity Throughput Tokens/Second Speedup vs Scalar
Simple 8.5 MB/s 1.2M 4.0ร—
Moderate 9.2 MB/s 2.8M 4.0ร—
Complex 11.8 MB/s 5.3M 4.2ร—
Extreme 14.6 MB/s 8.9M 4.3ร—
Overall 17.7 MB/s 20M+ 4.5ร—

๐Ÿ› ๏ธ Quick Start

Prerequisites

  • C++23 compatible compiler (Clang 15+, GCC 13+, MSVC 2022+)
  • CMake 3.20+
  • CPU with SIMD support (most modern processors)

Building

# Clone the repository
git clone https://github.com/Space-RF/DB25-sql-tokenizer.git
cd DB25-sql-tokenizer

# Build with CMake
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j4

# Run tests
cd build && ctest --output-on-failure

Basic Usage

#include "simd_tokenizer.hpp"

using namespace db25;

int main() {
    std::string sql = "SELECT * FROM users WHERE age > 21";
    
    SimdTokenizer tokenizer(
        reinterpret_cast<const std::byte*>(sql.data()),
        sql.size()
    );
    
    auto tokens = tokenizer.tokenize();
    
    for (const auto& token : tokens) {
        if (token.type != TokenType::Whitespace) {
            std::cout << token.value << " [" 
                      << token_type_name(token.type) << "]\n";
        }
    }
    
    std::cout << "SIMD Level: " << tokenizer.simd_level() << "\n";
    return 0;
}

๐Ÿ—๏ธ Architecture

The tokenizer employs a multi-layered architecture optimized for performance:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           SQL Input Buffer                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         CPU Feature Detection               โ”‚
โ”‚    (SSE4.2/AVX2/AVX-512/NEON)              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          SIMD Dispatcher                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚Whitespaceโ”‚ Keyword  โ”‚Identifierโ”‚       โ”‚
โ”‚  โ”‚Detection โ”‚ Matching โ”‚Boundary  โ”‚       โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         Zero-Copy Token Stream              โ”‚
โ”‚         (string_view references)            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Components

  1. SIMD Processor Hierarchy: Runtime CPU detection selects optimal instruction set
  2. Keyword System: 208 SQL keywords with O(log n) length-bucketed lookup
  3. Token Types: Keywords, Identifiers, Numbers, Strings, Operators, Delimiters
  4. Memory Management: Zero-copy design with string_view references

See ARCHITECTURE.md for detailed design documentation.

๐Ÿ“š Documentation

๐Ÿงช Testing

The tokenizer includes comprehensive testing against real-world SQL:

# Build and run tests
cmake --build build

# Run specific test
./build/test_sql_file test/sql_test.sqls

# Generate verification output
./build/test_sql_file -o

# Verbose mode
./build/test_sql_file -v

Test coverage includes 23 SQL queries across 4 complexity levels with 100% pass rate.

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details on:

  • Code style and standards
  • Development workflow
  • Testing requirements
  • Submitting pull requests

๐Ÿ“ˆ Benchmarks

Performance benchmarks are detailed in the documentation:

  • Token distribution analysis
  • SIMD operation performance
  • Comparison with other parsers
  • Platform-specific results

๐Ÿ”ฎ Future Vision

The DB25 tokenizer is the foundation for a next-generation SQL processing engine:

  • Stage 2: Recursive descent parser with LALR(1) grammar
  • Stage 3: Abstract syntax tree construction
  • Stage 4: Query optimization and plan generation
  • Stage 5: JIT compilation for expression evaluation

See our Academic Paper for the complete roadmap.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 Chiradip Mandal, Space-RF.org

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

๐Ÿ“– Academic Papers

๐Ÿ™ Acknowledgments

  • Intel and ARM for SIMD instruction set architectures
  • The simdjson project for inspiration on SIMD text processing
  • The open-source community for valuable feedback

๐Ÿ“ž Contact

Chiradip Mandal
Email: chiradip@chiradip.com
Organization: Space-RF.org
GitHub: @chiradip


DB25 SQL Tokenizer - Pushing the boundaries of SQL processing performance
Made with โค๏ธ by Space-RF.org

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published