Skip to content

fix nested column handling of null and "null"#13714

Merged
clintropolis merged 7 commits intoapache:masterfrom
clintropolis:nested-column-null-fix
Feb 1, 2023
Merged

fix nested column handling of null and "null"#13714
clintropolis merged 7 commits intoapache:masterfrom
clintropolis:nested-column-null-fix

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Jan 27, 2023

Description

Fixes an issue with nested columns that can occur when both actual null and the string "null" are present in any nested path that results in the null values incorrectly becoming associated with the "null" values.

The bug was caused by a usage of String.valueOf in StringFieldColumnWriter that was not checking for null values when writing out the column. This still worked by dumb luck because the fastutils 2Int maps that were backing the globalId lookup when writing out segments had a default value of 0, which happens to be null global id, so even though "null" wasn't present in the global dictionary it ended up with the correct id. However, if "null" was present, null would incorrectly be written out as the "null" global id and associated to that value instead.

As a safety measure, I've changed the 2int maps to have a default value of -1, and check that the globalid is in range before writing it to the column to ensure mistakes like this don't happen in the future, which caught a pretty serious regression introduced by #13653, caused by the simplified dictionary merging iterator that would result in segment dictionaries getting completely mangled if the same value was in more than 1 segment's dictionary. Luckily this didn't make it into any releases.

The added test data in NestedDataColumnSupplierTest would fail prior to this PR.

Unrelated, this PR also moves some column selector tests that had nothing to do with scan queries out of NestedDataScanQueryTest into their own file.

This PR has:

  • been self-reviewed.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Comment on lines +139 to +145
if (value == null) {
localId = localDictionary.add(0);
} else {
final int globalId = lookupGlobalId(value);
Preconditions.checkArgument(globalId >= 0, "Value [%s] is not present in global dictionary", value);
localId = localDictionary.add(globalId);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not okay to do this check inside of lookupGlobalId?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the shared function, lookupGlobalId has different implementations depending on the type of writer so would need to duplicate there

@clintropolis clintropolis merged commit ec1e6ac into apache:master Feb 1, 2023
@clintropolis clintropolis deleted the nested-column-null-fix branch February 1, 2023 04:59
abhagraw pushed a commit to abhagraw/druid that referenced this pull request Feb 8, 2023
* fix nested column handling of null and "null"
* fix issue merging nested column value dictionaries that could incorrect lose dictionary values
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants