From 59bb1aaea9608c763f25ea552a57fef8b5270e5f Mon Sep 17 00:00:00 2001 From: Arun Sharma Date: Mon, 23 Feb 2026 21:22:43 -0800 Subject: [PATCH] docs: explain how node table scanning with masks works --- docs/scan_node_table_semi_mask.md | 343 ++++++++++++++++++++++++++++++ 1 file changed, 343 insertions(+) create mode 100644 docs/scan_node_table_semi_mask.md diff --git a/docs/scan_node_table_semi_mask.md b/docs/scan_node_table_semi_mask.md new file mode 100644 index 0000000000..5546e017e5 --- /dev/null +++ b/docs/scan_node_table_semi_mask.md @@ -0,0 +1,343 @@ +# Semi Masks and Node Table Scanning in Ladybug + +This document explains how `scan_node_table.cpp` interacts with semi masks, the key data structures involved, and how this optimization reduces disk I/O in hash join scenarios. + +## Core Data Structure Concepts + +### 1. Nodes +A **node** represents an entity in the graph database. Each node has: +- A unique **node ID** (`nodeID_t`) consisting of: + - `tableID`: The table the node belongs to + - `offset`: The position of the node within its node group + +Nodes are stored in node tables, which are the primary storage unit for graph entities. + +### 2. Node Groups +A **node group** is a physical storage unit that contains a contiguous range of nodes. Key characteristics: +- Contains a fixed number of nodes (typically determined by `NODE_GROUP_SIZE_LOG2`) +- Each node group is identified by a `node_group_idx_t` (0-indexed) +- Node offsets within a node group: `global_offset = node_group_idx * NODE_GROUP_SIZE + offset_in_group` +- Multiple node groups form the complete node table + +The node table in `scan_node_table.cpp` iterates over node groups (lines 79-90): +```cpp +if (currentCommittedGroupIdx < numCommittedNodeGroups) { + nodeScanState.nodeGroupIdx = currentCommittedGroupIdx++; + // ... scan this node group +} +``` + +### 3. Column Chunk (ColumnChunkData) +A **column chunk** stores all values of a single column for a node group: +- One column chunk per property column per node group +- Contains the actual data values (e.g., INT64, STRING, etc.) +- Optionally contains null bitmap data +- Uses compression for efficient storage + +### 4. Data Chunk +A **data chunk** is the in-memory representation during query execution: +- Contains multiple **value vectors** (one per column being scanned) +- Has a selection vector that indicates which rows are valid/selected +- Size is typically `DEFAULT_VECTOR_CAPACITY` (2048 rows) + +From `data_chunk.h`: +```cpp +// A DataChunk represents tuples as a set of value vectors and a selector array. +class DataChunk { + std::vector> valueVectors; + std::shared_ptr state; +}; +``` + +### 5. Value Vector +A **value vector** holds values of a single column: +- Fixed capacity (1 for sequences, `DEFAULT_VECTOR_CAPACITY` for general use) +- Contains: + - `valueBuffer`: Raw data buffer + - `nullMask`: Bitmap indicating null values + - `state`: Shared state including selection vector + +From `value_vector.h`: +```cpp +class ValueVector { + LogicalType dataType; + std::shared_ptr state; + std::unique_ptr valueBuffer; + NullMask nullMask; +}; +``` + +## Semi Masks + +### What is a Semi Mask? +A **semi mask** is a bitmap that tracks which node offsets are "interesting" for a given query. It's used to filter node table scans to only return relevant nodes, avoiding unnecessary disk I/O. + +Key interfaces (from `mask.h`): +```cpp +class SemiMask { + virtual void mask(offset_t nodeOffset) = 0; // Mark a node as interesting + virtual void maskRange(offset_t start, offset_t end) = 0; // Mark a range + virtual bool isMasked(offset_t offset) = 0; // Check if masked + virtual offset_vec_t range(uint32_t start, uint32_t end) = 0; // Get masked offsets in range +}; +``` + +### Implementation: Roaring Bitmaps +Semi masks are implemented using **Roaring Bitmaps** for memory efficiency: +- `Roaring32BitmapSemiMask`: For tables with ≤ 2^32 nodes +- `Roaring64BitmapSemiMask`: For larger tables +- These provide compressed bitmap storage with fast operations + +## How Semi Masks Work with Hash Joins + +### Build Side and Probe Side +In a **hash join**, there are two sides: +- **Build side**: The smaller table that gets hashed into a hash table +- **Probe side**: The larger table being probed against the hash table + +### Semi Mask Flow in Hash Joins + +1. **Build to Probe SIP (Semi-side Information Passing)**: + - Build side is scanned first + - Nodes that match join keys are recorded in the semi mask + - When probing, only masked nodes need to be checked + +2. **Probe to Build SIP**: + - Probe side is scanned first + - Build side uses semi mask to filter what needs to be looked up + +From `acc_hash_join_optimizer.cpp`: +```cpp +// Try build to probe SIP first +if (tryBuildToProbeHJSIP(op)) { + // Semi mask is applied on build side + sipInfo.direction = SIPDirection::BUILD_TO_PROBE; +} +// If that fails, try probe to build +tryProbeToBuildHJSIP(op); +``` + +## Reducing Disk I/O: The Key Optimization + +### Without Semi Masks +When scanning a node table during hash join probing: +- **Every node group must be read from disk** (unless filtered by other predicates) +- Even if only 1% of nodes match the join condition, 100% of data is read + +### With Semi Masks +The optimization works as follows: + +1. **Mask Population Phase** (`semi_masker.cpp`): + - A SemiMasker operator runs on one side of the join + - It iterates over the result tuples and masks the node IDs: + ```cpp + bool SingleTableSemiMasker::getNextTuplesInternal(ExecutionContext* context) { + // ... get child tuples ... + for (auto i = 0u; i < selVector.getSelSize(); i++) { + auto nodeID = keyVector->getValue(pos); + localState->maskSingleTable(nodeID.offset); // Mark this node + } + } + ``` + +2. **Scan with Filter** (`node_group.cpp`): + - During node table scan, the semi mask is applied: + ```cpp + bool enableSemiMask = state.source == TableScanSource::COMMITTED + && state.semiMask + && state.semiMask->isEnabled(); + if (enableSemiMask) { + applySemiMaskFilter(state, numRowsToScan, state.outState->getSelVectorUnsafe()); + // Only masked rows are kept in the selection vector + } + ``` + +3. **The Filter Logic** (`node_group.cpp`): + ```cpp + void applySemiMaskFilter(const TableScanState& state, row_idx_t numRowsToScan, + SelectionVector& selVector) { + const auto& arr = state.semiMask->range(startNodeOffset, endNodeOffset); + // Keep only offsets that are in the semi mask + // All other rows are filtered out + } + ``` + +4. **How SelVector Reduces Disk I/O** (`column.cpp`): + + The key insight is that the SelVector is used at the **column scan level** to skip reading unnecessary data from disk: + + ```cpp + // From column.cpp - scanSegment function + if (!resultVector->state || resultVector->state->getSelVector().isUnfiltered()) { + // Unfiltered: read all values + columnReadWriter->readCompressedValuesToVector(...); + } else { + // Filtered: only read values at positions in selVector + columnReadWriter->readCompressedValuesToVector(..., + Filterer{resultVector->state->getSelVector(), offsetInVector}); + } + ``` + + The `Filterer` class (lines 229-249 in `column.cpp`) uses the SelVector to determine which rows to read: + ```cpp + struct Filterer { + bool operator()(offset_t startIdx, offset_t endIdx) { + // Only return true if there's a selVector position in this range + return (posInSelVector < selVector.getSelSize() && + isInRange(selVector[posInSelVector] - offsetInVector, startIdx, endIdx)); + } + }; + ``` + + This means: + - **Without SelVector**: Each column segment reads ALL compressed values, then filters + - **With SelVector**: Column segments are analyzed, and only segments containing valid positions are decompressed/read + + The compression layer (e.g., RLE, bit-packing) can skip entire blocks when no positions in the SelVector fall within that block. + +### I/O Reduction Example +Consider a query finding all friends of a specific user: +- **Without semi mask**: Scan entire Person node table (millions of rows) +- **With semi mask**: + 1. First scan the `knows` relationship table to find all friend node IDs + 2. Build a semi mask with those node IDs + 3. When scanning Person table, only load chunks containing masked nodes + +This can reduce I/O by orders of magnitude when the join result is much smaller than the table. + +## Integration in scan_node_table.cpp + +The scan operator integrates semi masks at line 137: +```cpp +void ScanNodeTable::initCurrentTable(ExecutionContext* context) { + // ... + scanState->semiMask = sharedStates[currentTableIdx]->getSemiMask(); +} +``` + +And retrieves them for the hash join (lines 93-100): +```cpp +table_id_map_t ScanNodeTable::getSemiMasks() const { + for (auto i = 0u; i < sharedStates.size(); ++i) { + result.insert({tableInfos[i].table->getTableID(), sharedStates[i]->getSemiMask()}); + } + return result; +} +``` + +## Summary + +### Local Tables and Semi Masks + +Ladybug distinguishes between two sources of data when scanning node tables: + +1. **Committed data** (`TableScanSource::COMMITTED`): Data that has been persisted to disk +2. **Uncommitted data** (`TableScanSource::UNCOMMITTED`): In-memory data from the current transaction (local storage) + +#### How Local Tables Work + +During a write transaction, new or modified nodes are first stored in the **local storage** (in-memory): +- Located in `LocalNodeTable` +- Organized into local node groups +- Eventually flushed to disk on commit + +From `scan_node_table.cpp` (lines 64-70): +```cpp +if (transaction->isWriteTransaction()) { + if (const auto localTable = + transaction->getLocalStorage()->getLocalTable(this->table->getTableID())) { + auto& localNodeTable = localTable->cast(); + this->numUnCommittedNodeGroups = localNodeTable.getNumNodeGroups(); + } +} +``` + +The scanner processes committed node groups first, then uncommitted ones: +```cpp +if (currentCommittedGroupIdx < numCommittedNodeGroups) { + nodeScanState.nodeGroupIdx = currentCommittedGroupIdx++; + nodeScanState.source = TableScanSource::COMMITTED; + // ... scan from disk +} +if (currentUnCommittedGroupIdx < numUnCommittedNodeGroups) { + nodeScanState.nodeGroupIdx = currentUnCommittedGroupIdx++; + nodeScanState.source = TableScanSource::UNCOMMITTED; + // ... scan from local storage +} +``` + +#### Semi Masks and Disk Block Skipping + +**Critical insight**: Semi masks are ONLY applied when scanning COMMITTED (disk) data: + +From `node_group.cpp` (lines 238-239 and 256-257): +```cpp +bool enableSemiMask = + state.source == TableScanSource::COMMITTED && state.semiMask && state.semiMask->isEnabled(); +``` + +This is an important design decision: +- **Disk blocks**: Can skip reading entire blocks using semi masks (huge I/O savings) +- **Local storage**: Always scanned in full (in-memory, so typically small) + +#### Why Semi Masks Don't Apply to Local Storage + +The SelVector is used extensively in local storage operations, but not for semi-mask filtering: + +1. **Local table scanning uses SelVector for different purposes** (`local_rel_table.cpp`): + ```cpp + // Setting up which rows to scan from local storage + localScanState.rowIdxVector->state->getSelVectorUnsafe().setSelSize(numToScan); + + // For intersection operations in local storage + scanChunk.state->getSelVectorUnsafe().setSelSize(intersectRows.size()); + ``` + +2. **Local storage is accessed via lookup operations**: + - Local storage (`LocalNodeTable`, `LocalRelTable`) uses direct lookups rather than full scans + - The data is already in memory, so selective reading provides less benefit + - Operations like inserts/deletes use the SelVector to identify specific rows to modify + +3. **No semi mask application in local path**: + - From `node_table.cpp` lines 271-276, when `source == TableScanSource::UNCOMMITTED`, the local NodeGroup is retrieved directly + - The semi mask check in `node_group.cpp` explicitly requires `COMMITTED` source + - This means the SelVector filtering path is never triggered for local data + +4. **Local storage lookup pattern** (`local_rel_table.cpp` line 234-235): + ```cpp + [[maybe_unused]] auto lookupRes = + localNodeGroup->lookupMultiple(transaction, localScanState); + ``` + This uses direct lookups rather than filtered scans. + +#### Why This Design? + +1. **Disk I/O is the bottleneck**: Disk reads are orders of magnitude slower than memory access, so optimizing disk reads provides the biggest benefit + +2. **Local tables are typically small**: In a well-designed workload, uncommitted data is a small fraction of the table + +3. **Simplified consistency**: Semi masks are built from join results which themselves come from various sources; applying them consistently to local storage would require additional complexity + +4. **Post-commit optimization**: After the transaction commits, uncommitted data becomes committed and future queries can benefit from semi mask filtering + +#### Practical Impact + +When executing a hash join in a write transaction: +1. **Build side** may include both committed and uncommitted nodes +2. **Probe side** benefits from semi mask filtering when scanning committed (disk) data +3. Uncommitted local data is scanned without filtering (but typically small) + +This design maximizes I/O savings where they matter most while keeping the implementation straightforward. + +## Summary + +Semi masks provide a crucial optimization for graph database queries involving joins: + +1. **Concept**: Bitmap tracking which nodes are relevant to the query +2. **Storage**: Roaring bitmaps for memory efficiency +3. **Hash Join Integration**: Built side passes information to probe side (or vice versa) +4. **SelVector Conversion**: Semi mask is converted to a SelVector in `applySemiMaskFilter()` which specifies which row positions are valid +5. **I/O Reduction**: Column scan uses the SelVector via a `Filterer` to skip reading disk blocks that contain no relevant data +6. **Local Tables**: Semi masks only apply to committed (disk) data; local storage uses direct lookups and is always scanned (but typically small) + +This is especially impactful in graph workloads where pattern matching often involves finding small subsets of large node tables.