Skip to content

Store tags in compressed segments instead of in the metadata Delta Lake#293

Merged
CGodiksen merged 70 commits intomainfrom
dev/tags-in-data
Mar 6, 2025
Merged

Store tags in compressed segments instead of in the metadata Delta Lake#293
CGodiksen merged 70 commits intomainfrom
dev/tags-in-data

Conversation

@CGodiksen
Copy link
Copy Markdown
Collaborator

This PR implements #197 by storing tags in the compressed segments instead of in the metadata Delta Lake and removing univariate ids from the entire code base.

The ingestion process has been updated to no longer lookup and save the tag hash using the table metadata manager when inserting data points. Instead, we calculate a tag hash in the uncompressed data manager that is only used internally for managing uncompressed multivariate time series during ingestion. This tag hash is not used beyond compression since the compressed segments are stored per table.

Compression has been updated to no longer store the univariate id in the compressed segments and instead store the tag columns. The compressed data manager has also been updated so we now use the tag columns for sorting instead of the univariate id when saving data to disk.

The query process has also been updated accordingly. ParquetExec now uses a schema that is based on the specific model table, GridExec reconstructs data points that includes the tag columns in each row, and SortedJoinExec now reconstructs the query result in the wanted return order using the tag columns in the first field column instead of using the metadata Delta Lake.

It should be noted that the objective of this PR is to implement the base functionality for tags in data meaning further optimizations are required. #292 has been created to look into further optimizations.

@CGodiksen CGodiksen self-assigned this Feb 27, 2025
@CGodiksen CGodiksen linked an issue Feb 27, 2025 that may be closed by this pull request
@CGodiksen CGodiksen requested a review from chrthomsen February 27, 2025 22:32
Comment thread crates/modelardb_server/src/storage/data_sinks.rs Outdated
Comment thread crates/modelardb_server/src/storage/uncompressed_data_manager.rs Outdated
Comment thread crates/modelardb_server/src/storage/uncompressed_data_manager.rs Outdated
Comment thread crates/modelardb_storage/src/query/sorted_join_exec.rs Outdated
Comment thread crates/modelardb_compression/src/compression.rs Outdated
Comment thread crates/modelardb_compression/src/compression.rs
Comment thread crates/modelardb_compression/src/types.rs Outdated
Comment thread crates/modelardb_compression/src/types.rs Outdated
Comment thread crates/modelardb_compression/src/types.rs
Comment thread crates/modelardb_storage/src/query/model_table.rs
Comment thread crates/modelardb_storage/src/query/model_table.rs
Comment thread crates/modelardb_storage/src/query/model_table.rs
Comment thread crates/modelardb_storage/src/query/sorted_join_exec.rs
Comment thread crates/modelardb_storage/src/query/sorted_join_exec.rs
@CGodiksen CGodiksen requested a review from skejserjensen March 5, 2025 17:06
@CGodiksen CGodiksen merged commit c2ee88d into main Mar 6, 2025
4 checks passed
@CGodiksen CGodiksen deleted the dev/tags-in-data branch March 6, 2025 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluate removing uid by storing tags in segments

3 participants