Materialize scan results correctly when columns are not present in the segments#16619
Conversation
| break; | ||
| } | ||
|
|
||
| for (Integer columnNumber : nullTypedColumns) { |
There was a problem hiding this comment.
note: I wonder why use a fastutil IntList - if it gets iterated with a foreach ; plain get?
this could be moved into some method like validateRow - that will naturally do a CSE of the currentRows.get(currentRowIndex) so that it will be only evaluated once
There was a problem hiding this comment.
No reason to use FastUtil IntList as such. I just thought it might be faster to create than an arraylist.
this could be moved into some method like validateRow - that will naturally do a CSE of the currentRows.get(currentRowIndex) so that it will be only evaluated once
It is getting evaluated once here right? Unless I misinterpreted your comment
There was a problem hiding this comment.
this was just a note; this loop is validating one row; but to access that it has to do a function call currentRows.get(currentRowIndex) ; which became part of the loop body - moving it into a method could make it clear that it works on a row - and it will naturally remove the currentRows.get(currentRowIndex) as that's the row :)
| populateCursor(); | ||
| boolean firstRowWritten = false; | ||
| // While calling populateCursor() repeatedly, currentRowSignature might change. Therefore we store the signature | ||
| // While calling populateCursor() repeatedly, currentRowSignature might change. Therefore, we store the signature |
There was a problem hiding this comment.
....what if the signature changes - is that a problem? shouldn't that be an Exception?
There was a problem hiding this comment.
if there are two cursors, CursorA with RowSignatureA and CursorB with RowSignatureB and the cursor is at the last row of CursorA, populate call will return false, i.e. the two cursors cannot be batched together, and set currentRowSignature to the RowSignatureB (i.e. prepare the variables for the next write). We still want to return the old frame with the old signature therefore we need to preserve the signature with which we have written the frame.
Per your previous suggestion, frameWriterFactory.signature() would be sufficient and cleaner, and I will use that instead.
kgyrtkirk
left a comment
There was a problem hiding this comment.
looks good - left some minor notes
| break; | ||
| } | ||
|
|
||
| for (Integer columnNumber : nullTypedColumns) { |
There was a problem hiding this comment.
this was just a note; this loop is validating one row; but to access that it has to do a function call currentRows.get(currentRowIndex) ; which became part of the loop body - moving it into a method could make it clear that it works on a row - and it will naturally remove the currentRows.get(currentRowIndex) as that's the row :)
| } | ||
|
|
||
| firstRowWritten = true; | ||
| // Check that the columns with the null types are actually null before advancing |
There was a problem hiding this comment.
note: isn't this comment misplaced? (note: this detail is not necessary - but it could live as an apidoc of the validateRow if that would be around)
There was a problem hiding this comment.
Cleaned up the code
|
Thanks for the review! @kgyrtkirk |
Description
The query engine is unable to estimate the correct size in bytes of the subquery results when the scan query has columns which are missing from the segments. This is because the ScanQueryEngine receives all the columns of the scan query, and populates the row signature with null type if its unable to find the column in the segment.
This PR modifies the materializing logic to materialize the results of the columns whose types are known, and check that the columns whose types are unknown always have
nullvalues. This is helpful because:a. If the type is unknown and the column contains all null values, we don't need to materialize the results
b. If the type is unknown and the column contains non-null values in any row, we are running into the case of missing types, and we should throw an error.
Release note
Fixes a bug causing maxSubqueryBytes to not work when segments have missing columns.
Key changed/added classes in this PR
MyFooOurBarTheirBazThis PR has: