Skip to content

Normalize schema fingerprint for column permutations#17044

Merged
abhishekagarwal87 merged 7 commits intoapache:masterfrom
findingrish:first-seen-column-order-in-schema
Sep 18, 2024
Merged

Normalize schema fingerprint for column permutations#17044
abhishekagarwal87 merged 7 commits intoapache:masterfrom
findingrish:first-seen-column-order-in-schema

Conversation

@findingrish
Copy link
Copy Markdown
Contributor

@findingrish findingrish commented Sep 12, 2024

Parent issue: #14989

It is possible for the order of columns to vary across segments especially during realtime ingestion.
Since, the schema fingerprint is sensitive to column order this leads to creation of a large number of segment schema in the metadata database for essentially the same set of columns.

This is wasteful, this patch fixes this problem by computing schema fingerprint on lexicographically sorted columns. This would result in creation of a single schema in the metadata database with the first observed column order for a given signature.

Release notes

In the CentralizedDatasourceSchema feature, different permutations of the same column order do not result in distinct schemas in the database.

Copy link
Copy Markdown
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

// thus avoiding schema explosion in the metadata database
// Note that this signature is not persisted anywhere, it is only used for fingerprint computation
final RowSignature sortedSignature = getLexicographicallySortedSignature(schemaPayload.getRowSignature());
final SchemaPayload updatedPayload = new SchemaPayload(sortedSignature, schemaPayload.getAggregatorFactories());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention a note about the aggregator factories as well that they are column order independent since they are backed by a map.

@abhishekagarwal87 abhishekagarwal87 merged commit 43d790f into apache:master Sep 18, 2024
@adarshsanjeev adarshsanjeev added this to the 32.0.0 milestone Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants