Store tags in compressed segments instead of in the metadata Delta Lake by CGodiksen · Pull Request #293 · ModelarData/ModelarDB-RS

CGodiksen · 2025-02-27T21:18:55Z

This PR implements #197 by storing tags in the compressed segments instead of in the metadata Delta Lake and removing univariate ids from the entire code base.

The ingestion process has been updated to no longer lookup and save the tag hash using the table metadata manager when inserting data points. Instead, we calculate a tag hash in the uncompressed data manager that is only used internally for managing uncompressed multivariate time series during ingestion. This tag hash is not used beyond compression since the compressed segments are stored per table.

Compression has been updated to no longer store the univariate id in the compressed segments and instead store the tag columns. The compressed data manager has also been updated so we now use the tag columns for sorting instead of the univariate id when saving data to disk.

The query process has also been updated accordingly. ParquetExec now uses a schema that is based on the specific model table, GridExec reconstructs data points that includes the tag columns in each row, and SortedJoinExec now reconstructs the query result in the wanted return order using the tag columns in the first field column instead of using the metadata Delta Lake.

It should be noted that the objective of this PR is to implement the base functionality for tags in data meaning further optimizations are required. #292 has been created to look into further optimizations.

…nivariate ID

…delta lake

…jection

CGodiksen added 30 commits February 27, 2025 00:22

Fixed outdated documentation

879e0cb

Add table name to file path for spilled buffers

cc89194

Use the table name in the spilled buffer file path when initializing

b00ebd6

Remove model_table_hash_table_name from metadata Delta Lake

a78eae8

Remove limitation of 1024 on number of field columns

c5baba5

No longer save tag metadata when inserting data points

999576d

Use a new function to calculate tag hash outside table metadata manager

6794ffc

Remove methods to lookup and save tag hash metadata

cd5e9bc

Remove tag cache from table metadata manager

c4d0c90

Remove mapping_from_hash_to_tags()

92831e3

Remove model_table_tags table from metadata Delta Lake

13e3ad3

Remove method to truncate table metadata

ec961f6

Remove separate schema for uncompressed data

e7e1aed

Include tag values in uncompressed data buffer data

be01035

Add a test method to get uncompressed data for a model table

9ada71e

Add method to get column arrays from model table metadata

91723dc

Use method to get column arrays instead of doing it manually

af7b440

Fix tests after changes to uncompressed data buffers

84c08a1

Pass tag values and field column index to try_compress() instead of u…

d2e6f1d

…nivariate ID

Remove UNCOMPRESSED_SCHEMA

44dd8b5

Remove univaraite_ids from macros

5b95831

Remove methods to convert univariate ids between int64 and uint64

2bf2739

Remove DISK schemas

0040f10

Add compressed schema to model table metadata

a7207d3

Update compression to use tag values instead of univariate id

68c9bdb

Fix calls to try_compress() in tests

f125d0e

Use compressed schema with tag column in test util function

a1d3e1a

Use model table compressed schema in compressed data buffer

ff30a73

Sort compressed segment files by tag columns instead of univariate id

930a109

Use compressed schema with tag columns when creating model tables in …

e066424

…delta lake

CGodiksen added 6 commits February 27, 2025 00:22

No longer use tag_column_indices when checking for tag columns in pro…

da3c9d3

…jection

Use tag columns in data points in sorted_join()

27a1cfd

Reformat and fixed doc and clippy issues

62e04dc

Fix bug causing INSERT INTO to fail due to schema mismatch

d4cb178

Reformat with Rustfmt

e74acb2

Change QuerySchema to GridSchema to match schema name

950a9a5

CGodiksen self-assigned this Feb 27, 2025

CGodiksen linked an issue Feb 27, 2025 that may be closed by this pull request

Evaluate removing uid by storing tags in segments #197

Closed

Fix cargo doc issue

c777e3d

CGodiksen requested a review from chrthomsen February 27, 2025 22:32

chrthomsen approved these changes Feb 28, 2025

View reviewed changes

Update based on comments from @chrthomsen

118c6ce

CGodiksen requested a review from skejserjensen February 28, 2025 17:05

skejserjensen requested changes Mar 2, 2025

View reviewed changes

CGodiksen added 2 commits March 5, 2025 12:22

Change order of arguments in try_compress()

548b0ae

Update method for calculating tag hash

3d30ede

This was referenced Mar 5, 2025

Research if ingestion can be optimized by sorting and partitioning the ingested multivariate data by tag values #297

Open

Research if hash maps can be more widely used to increase readabillity without sacrificing performance #298

Open

Add limitation on number of model table fields back

0328922

CGodiksen mentioned this pull request Mar 5, 2025

Research issues with data order throughout the system #299

Open

CGodiksen added 2 commits March 5, 2025 17:24

Rename Apache Arrow DataFusion to Apache DataFusion

26afb72

Use Arc<Schema> instead of SchemaRef

a9eaeb8

This was referenced Mar 5, 2025

Optimize usage of tag columns in data during ingestion and querying #292

Open

Optimize selection of fallback field column in query engine #300

Open

Update based on comments from @skejserjensen

6bd5d89

CGodiksen requested a review from skejserjensen March 5, 2025 17:06

Merge branch 'main' into dev/tags-in-data

1998f6e

skejserjensen approved these changes Mar 6, 2025

View reviewed changes

CGodiksen merged commit c2ee88d into main Mar 6, 2025
4 checks passed

CGodiksen deleted the dev/tags-in-data branch March 6, 2025 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store tags in compressed segments instead of in the metadata Delta Lake#293

Store tags in compressed segments instead of in the metadata Delta Lake#293
CGodiksen merged 70 commits intomainfrom
dev/tags-in-data

CGodiksen commented Feb 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CGodiksen commented Feb 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants