Skip to content

Add window-focused tests from Drill#13773

Merged
gianm merged 13 commits intoapache:masterfrom
imply-cheddar:window-tests
Jul 6, 2023
Merged

Add window-focused tests from Drill#13773
gianm merged 13 commits intoapache:masterfrom
imply-cheddar:window-tests

Conversation

@imply-cheddar
Copy link
Copy Markdown
Contributor

This commit borrows some test definitions from Drill's test suite and tries to use them to flesh out the full validation of window function capbilities.

In order to be able to run these tests, we also add the ability to run a Scan operation against segments, which also meant an implementation of RowsAndColumns for frames.

Initially, in trying to add these tests, I also started trying to
fix the problems that arose. One of which was being able to scan
data from segments for use in queries. This is necessary for these
tests because the Drill tests are generally not grouping on things
first and, instead, are essentially just resolving to scan operators.

After resolving that issue, I ran into another set of bugs specifically
associated with Calcite query planning, where Calcite did not give me
a logical plan that mapped correctly to the semantics of the query.
As I dove into that, I realized that it was a bigger ball of yarn and
this commit was already starting to sprawl out in scope, so I changed
strategy and am instead introducing the test framework and fixes
that have been made so far, will introduce the full set of 2000
files for the tests in a subsequent commit, and just focus this commit
on the code changes required to get everything in place.

Table of Contents (or, what to expect when reviewing this):

  1. Changes in parquet-extension, these are some code changes to add a main (ParquetToJson) that can be used to convert parquet files to new-line delimited Json. This is just to have a utility for developers to use if we ever need to add a new dataset that is defined by parquet and is not a Main intended for a general audience
  2. There is a change to the FrameColumnReader to have it be able to read out a RowsAndColumns column. This hopefully also provides a relatively straight-forward path for using Frame columns in cases where direct reads from locations makes more sense than the ColumnSelector/DimensionSelector routes that have been previously employed
  3. There's a DecoratableRowsAndColumn semantic interface added that takes on "decorations" of a RAC and tries to lazily execute them. This is leveraged in making the ability to read the segment work. Note that the capabilities for reading segments have only the minimum implemented to make these tests run and are not implemented and wired up to be able to actually execute in a distributed environment.
  4. There are only a few test cases checked in inside of this PR because, when I initially tried to include all of them, the PR was 2600 files. I didn't want the 2600 files to take away from the code that is deserving of review, so I decided to only check in a few of the files in this PR. Once this is merged, I will do a new PR with all of the other files, which will essentially just be creating new files in the resources directory without any actual code changes. That should make it easy to merge in the 2600 extra files for tests.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly reviewed the Parquet changes (which I have no comment on since they look ok), the frame stuff, & the jackson stuff. I skimmed through the the new tests and the operator changes, so if my comments there look surface-level, that's why 🙂.

This commit borrows some test definitions from Drill's test suite
and tries to use them to flesh out the full validation of window
function capbilities.

In order to be able to run these tests, we also add the ability to
run a Scan operation against segments, which also meant an
implementation of RowsAndColumns for frames.
final ColumnType type = columnAccessor.getType();
if (type.getType() == ValueType.COMPLEX) {
final ComplexMetricSerde serdeForType = ComplexMetrics.getSerdeForType(type.getComplexTypeName());
if (serdeForType != null && serdeForType.getObjectStrategy() != null) {

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [ComplexMetricSerde.getObjectStrategy](1) should be avoided because it has been deprecated.
if (type.getType() == ValueType.COMPLEX) {
final ComplexMetricSerde serdeForType = ComplexMetrics.getSerdeForType(type.getComplexTypeName());
if (serdeForType != null && serdeForType.getObjectStrategy() != null) {
return serdeForType.getObjectStrategy().getClazz();

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [ComplexMetricSerde.getObjectStrategy](1) should be avoided because it has been deprecated.
Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the adjusted changes are pretty targeted. I have one question on the SQL planning side that doesn't affect my approval, I am just wanting to know.

}

final DataSource myDataSource;
if (dataSource instanceof TableDataSource) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the underlying datasource is a join or unnest? Do we need special handling or is that taken care of by the WindowOperatorQuery itself?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is insufficient for those cases. I expect that the Drill tests will end up covering those cases. I.e. that's still WIP. In the "simplest" solution, however, this should perhaps be inverted to just check if it's already a QueryDataSource and if it's not, to make it one. That would cover all of the cases and handle things as a scan.

That said, the whole planning a scan query thing is a short-term hack/work-around anyway, I expect that in the end this will just plan the "operator query" all the way down to the segment (which is what happens with this workaround as well, it just happens in a really... odd... way)

@gianm gianm merged commit 5fc122a into apache:master Jul 6, 2023
@abhishekagarwal87 abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023
sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
This commit borrows some test definitions from Drill's test suite
and tries to use them to flesh out the full validation of window
function capbilities.

In order to be able to run these tests, we also add the ability to
run a Scan operation against segments, which also meant an
implementation of RowsAndColumns for frames.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants