Skip to content

Conversation

@cloud-fan
Copy link
Contributor

Bring back #25955

What changes were proposed in this pull request?

This adds a new rule, V2ScanRelationPushDown, to push filters and projections in to a new DataSourceV2ScanRelation in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.

To run scan pushdown before rules where stats are used, this adds a new optimizer override, earlyScanPushDownRules and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, PruneFileSourcePartitions, is moved into the early pushdown rule set.

This also moves pushdown helper methods from DataSourceV2Strategy into a util class.

Why are the changes needed?

This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This updates the implementation of stats from DataSourceV2Relation so tests will fail if stats are accessed before early pushdown for v2 relations.

rdblue and others added 2 commits October 31, 2019 16:21
### What changes were proposed in this pull request?

This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.

To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set.

This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class.

### Why are the changes needed?

This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations.

Closes apache#25955 from rdblue/move-v2-pushdown.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Ryan Blue <blue@apache.org>
@cloud-fan
Copy link
Contributor Author

We have 2 OrcFilterSuite, one for hadoop 2 and one for hadoop 3. #25955 only updates one OrcFilterSuite and thus breaks hadoop 3 build. The last commit fixes this problem.

cc @dongjoon-hyun @rdblue

@cloud-fan cloud-fan changed the title [SPARK-29277][SQL] Add early DSv2 filter and projection pushdown [SPARK-29277][SQL][[test-hadoop3.2]] Add early DSv2 filter and projection pushdown Oct 31, 2019
@cloud-fan cloud-fan changed the title [SPARK-29277][SQL][[test-hadoop3.2]] Add early DSv2 filter and projection pushdown [SPARK-29277][SQL][test-hadoop3.2] Add early DSv2 filter and projection pushdown Oct 31, 2019
@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113011 has finished for PR 26341 at commit d2c9abe.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

@viirya @HyukjinKwon seems it's CRAN issue again? I see this Spark R failure in many other PRs.

@viirya
Copy link
Member

viirya commented Oct 31, 2019

@cloud-fan seems so, let me looking at this.

@dongjoon-hyun
Copy link
Member

Thank you for making this back swiftly, @cloud-fan !

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Since all test passed with hadoop-3.2 and we know this CRAN issue, I'll merge this back.

* checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : 
  dims [product 24] do not match the length of object [0]

Thank you, @cloud-fan . Merged to master.

@viirya
Copy link
Member

viirya commented Oct 31, 2019

Contacted CRAN sysadmin and expect fix soon.

@dongjoon-hyun
Copy link
Member

Thank you always, @viirya ! 😄

@rdblue
Copy link
Contributor

rdblue commented Oct 31, 2019

Can someone explain what happened here?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 31, 2019

Sure, @rdblue .

What we needed was the second commit of this PR. We switches some codes (Hive 1.2.1 and Hive 2.3.6) based on the Hadoop profile. So, we need to update both codes.

@rdblue
Copy link
Contributor

rdblue commented Oct 31, 2019

Okay, so the PR needed to be updated to pass tests in a different profile?

Is this something we could have caught before merging the initial PR? Or is there something that we can do to avoid problems like this in the future?

@dongjoon-hyun
Copy link
Member

Yes, right. Technically, we need to trigger twice with the default and with [test-hadoop-3.2] when we change Hive and ORC related stuff.

@dongjoon-hyun
Copy link
Member

For now, our PRBuilder doesn't test for both ones. So, until now, we need to trigger manually with the title string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants