A high-performance SQL tokenizer leveraging SIMD instructions for blazing-fast lexical analysis. Part of the DB25 project, achieving 20+ million tokens per second throughput on modern hardware.
Author: Chiradip Mandal
Email: chiradip@chiradip.com
Organization: Space-RF.org
- SIMD Acceleration: Automatic CPU feature detection (SSE4.2, AVX2, AVX-512, ARM NEON)
- Zero-Copy Design: String views eliminate memory allocation overhead
- 4.5ร Faster: Compared to traditional scalar implementations
- Grammar-Driven: Keywords extracted directly from EBNF specification
- Cross-Platform: Supports x86_64 and ARM64 architectures
- Thread-Safe: Lock-free design for concurrent tokenization
- Production-Ready: Comprehensive test suite with 100% pass rate
| Query Complexity | Throughput | Tokens/Second | Speedup vs Scalar |
|---|---|---|---|
| Simple | 8.5 MB/s | 1.2M | 4.0ร |
| Moderate | 9.2 MB/s | 2.8M | 4.0ร |
| Complex | 11.8 MB/s | 5.3M | 4.2ร |
| Extreme | 14.6 MB/s | 8.9M | 4.3ร |
| Overall | 17.7 MB/s | 20M+ | 4.5ร |
- C++23 compatible compiler (Clang 15+, GCC 13+, MSVC 2022+)
- CMake 3.20+
- CPU with SIMD support (most modern processors)
# Clone the repository
git clone https://github.com/Space-RF/DB25-sql-tokenizer.git
cd DB25-sql-tokenizer
# Build with CMake
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j4
# Run tests
cd build && ctest --output-on-failure#include "simd_tokenizer.hpp"
using namespace db25;
int main() {
std::string sql = "SELECT * FROM users WHERE age > 21";
SimdTokenizer tokenizer(
reinterpret_cast<const std::byte*>(sql.data()),
sql.size()
);
auto tokens = tokenizer.tokenize();
for (const auto& token : tokens) {
if (token.type != TokenType::Whitespace) {
std::cout << token.value << " ["
<< token_type_name(token.type) << "]\n";
}
}
std::cout << "SIMD Level: " << tokenizer.simd_level() << "\n";
return 0;
}The tokenizer employs a multi-layered architecture optimized for performance:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SQL Input Buffer โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CPU Feature Detection โ
โ (SSE4.2/AVX2/AVX-512/NEON) โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SIMD Dispatcher โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโ โ
โ โWhitespaceโ Keyword โIdentifierโ โ
โ โDetection โ Matching โBoundary โ โ
โ โโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Zero-Copy Token Stream โ
โ (string_view references) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- SIMD Processor Hierarchy: Runtime CPU detection selects optimal instruction set
- Keyword System: 208 SQL keywords with O(log n) length-bucketed lookup
- Token Types: Keywords, Identifiers, Numbers, Strings, Operators, Delimiters
- Memory Management: Zero-copy design with string_view references
See ARCHITECTURE.md for detailed design documentation.
- Tutorial - Step-by-step guide with examples
- Architecture - Detailed system design
- Visual Tutorial - Diagrams and visualizations
- Contributing - How to contribute
- Academic Papers - Research and performance analysis
The tokenizer includes comprehensive testing against real-world SQL:
# Build and run tests
cmake --build build
# Run specific test
./build/test_sql_file test/sql_test.sqls
# Generate verification output
./build/test_sql_file -o
# Verbose mode
./build/test_sql_file -vTest coverage includes 23 SQL queries across 4 complexity levels with 100% pass rate.
We welcome contributions! Please see our Contributing Guide for details on:
- Code style and standards
- Development workflow
- Testing requirements
- Submitting pull requests
Performance benchmarks are detailed in the documentation:
- Token distribution analysis
- SIMD operation performance
- Comparison with other parsers
- Platform-specific results
The DB25 tokenizer is the foundation for a next-generation SQL processing engine:
- Stage 2: Recursive descent parser with LALR(1) grammar
- Stage 3: Abstract syntax tree construction
- Stage 4: Query optimization and plan generation
- Stage 5: JIT compilation for expression evaluation
See our Academic Paper for the complete roadmap.
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2024 Chiradip Mandal, Space-RF.org
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
- Intel and ARM for SIMD instruction set architectures
- The simdjson project for inspiration on SIMD text processing
- The open-source community for valuable feedback
Chiradip Mandal
Email: chiradip@chiradip.com
Organization: Space-RF.org
GitHub: @chiradip
DB25 SQL Tokenizer - Pushing the boundaries of SQL processing performance
Made with โค๏ธ by Space-RF.org