Skip to content

Materialize scan results correctly when columns are not present in the segments#16619

Merged
LakshSingla merged 7 commits intoapache:masterfrom
LakshSingla:missing-col-frames
Jun 23, 2024
Merged

Materialize scan results correctly when columns are not present in the segments#16619
LakshSingla merged 7 commits intoapache:masterfrom
LakshSingla:missing-col-frames

Conversation

@LakshSingla
Copy link
Copy Markdown
Contributor

@LakshSingla LakshSingla commented Jun 17, 2024

Description

The query engine is unable to estimate the correct size in bytes of the subquery results when the scan query has columns which are missing from the segments. This is because the ScanQueryEngine receives all the columns of the scan query, and populates the row signature with null type if its unable to find the column in the segment.

This PR modifies the materializing logic to materialize the results of the columns whose types are known, and check that the columns whose types are unknown always have null values. This is helpful because:
a. If the type is unknown and the column contains all null values, we don't need to materialize the results
b. If the type is unknown and the column contains non-null values in any row, we are running into the case of missing types, and we should throw an error.

Release note

Fixes a bug causing maxSubqueryBytes to not work when segments have missing columns.


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.


@MethodSource("constructorFeeder")
@ParameterizedTest(name = "{0}")
public void testSubqueryOnDataSourceWithMissingColumnsInSegments(String testName, Map<String, Object> queryContext)

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'testName' is never used.
break;
}

for (Integer columnNumber : nullTypedColumns) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I wonder why use a fastutil IntList - if it gets iterated with a foreach ; plain get?
this could be moved into some method like validateRow - that will naturally do a CSE of the currentRows.get(currentRowIndex) so that it will be only evaluated once

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to use FastUtil IntList as such. I just thought it might be faster to create than an arraylist.

this could be moved into some method like validateRow - that will naturally do a CSE of the currentRows.get(currentRowIndex) so that it will be only evaluated once

It is getting evaluated once here right? Unless I misinterpreted your comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was just a note; this loop is validating one row; but to access that it has to do a function call currentRows.get(currentRowIndex) ; which became part of the loop body - moving it into a method could make it clear that it works on a row - and it will naturally remove the currentRows.get(currentRowIndex) as that's the row :)

populateCursor();
boolean firstRowWritten = false;
// While calling populateCursor() repeatedly, currentRowSignature might change. Therefore we store the signature
// While calling populateCursor() repeatedly, currentRowSignature might change. Therefore, we store the signature
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

....what if the signature changes - is that a problem? shouldn't that be an Exception?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are two cursors, CursorA with RowSignatureA and CursorB with RowSignatureB and the cursor is at the last row of CursorA, populate call will return false, i.e. the two cursors cannot be batched together, and set currentRowSignature to the RowSignatureB (i.e. prepare the variables for the next write). We still want to return the old frame with the old signature therefore we need to preserve the signature with which we have written the frame.
Per your previous suggestion, frameWriterFactory.signature() would be sufficient and cleaner, and I will use that instead.

Copy link
Copy Markdown
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - left some minor notes

break;
}

for (Integer columnNumber : nullTypedColumns) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was just a note; this loop is validating one row; but to access that it has to do a function call currentRows.get(currentRowIndex) ; which became part of the loop body - moving it into a method could make it clear that it works on a row - and it will naturally remove the currentRows.get(currentRowIndex) as that's the row :)

}

firstRowWritten = true;
// Check that the columns with the null types are actually null before advancing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: isn't this comment misplaced? (note: this detail is not necessary - but it could live as an apidoc of the validateRow if that would be around)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up the code

@LakshSingla
Copy link
Copy Markdown
Contributor Author

Thanks for the review! @kgyrtkirk

@LakshSingla LakshSingla merged commit 00c9643 into apache:master Jun 23, 2024
@LakshSingla LakshSingla deleted the missing-col-frames branch June 23, 2024 17:45
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants