Skip to content

Conversation

@weixiuli
Copy link
Contributor

@weixiuli weixiuli commented Jun 16, 2021

What changes were proposed in this pull request?

Set the list of read columns in the task configuration to reduce reading of ORC data.

Why are the changes needed?

Now, the ORC reader will read all columns of the ORC table when the task configuration does not set the list of read columns . Therefore, we should set the list of read columns in the task configuration to reduce reading of ORC data.

Does this PR introduce any user-facing change?

No

How was this patch tested?

exist unittests

@github-actions github-actions bot added the SQL label Jun 16, 2021
@dongjoon-hyun
Copy link
Member

ok to test

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @weixiuli .
Merged to master for Apache Spark 3.2.0.

@weixiuli
Copy link
Contributor Author

Thank you @dongjoon-hyun for the quick review and comments.

@zhengruifeng
Copy link
Contributor

@weixiuli Great Catch!

@dongjoon-hyun maybe we need to backport it to 3.0 & 3.1?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 17, 2021

For me, this is a performance improvement, @zhengruifeng .

@cloud-fan
Copy link
Contributor

Does this mean we never do column pruning for ORC before this PR? And shall we update the result of DataSourceReadBenchmark?

@zhengruifeng
Copy link
Contributor

@cloud-fan We are migrating from 2.4.7 to 3.0.2, and observed a significant regression in some cases due to this issue.

image

@weixiuli
Copy link
Contributor Author

Does this mean we never do column pruning for ORC before this PR? And shall we update the result of DataSourceReadBenchmark?

@cloud-fan Yes, i will check the DataSourceReadBenchmark and update it.

@cloud-fan
Copy link
Contributor

We are migrating from 2.4.7 to 3.0.2, and observed a significant regression in some cases due to this issue.

I think this is a serious perf regression we should backport. @dongjoon-hyun what do you think?

@dongjoon-hyun
Copy link
Member

Got it. In that case, I'm okay for backporting, @cloud-fan .

I'll backport this.

dongjoon-hyun pushed a commit that referenced this pull request Jun 24, 2021
…tion to reduce reading of ORC data

### What changes were proposed in this pull request?
Set the list of read columns in the task configuration to reduce reading of ORC data.
### Why are the changes needed?
Now, the ORC reader will read all columns of the ORC table when the task configuration does not set the list of read columns . Therefore, we should set the list of read columns in the task configuration to reduce reading of ORC data.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
exist unittests

Closes #32923 from weixiuli/SPARK-35783.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 947c7ea)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Jun 24, 2021
…tion to reduce reading of ORC data

### What changes were proposed in this pull request?
Set the list of read columns in the task configuration to reduce reading of ORC data.
### Why are the changes needed?
Now, the ORC reader will read all columns of the ORC table when the task configuration does not set the list of read columns . Therefore, we should set the list of read columns in the task configuration to reduce reading of ORC data.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
exist unittests

Closes #32923 from weixiuli/SPARK-35783.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 947c7ea)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…tion to reduce reading of ORC data

### What changes were proposed in this pull request?
Set the list of read columns in the task configuration to reduce reading of ORC data.
### Why are the changes needed?
Now, the ORC reader will read all columns of the ORC table when the task configuration does not set the list of read columns . Therefore, we should set the list of read columns in the task configuration to reduce reading of ORC data.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
exist unittests

Closes apache#32923 from weixiuli/SPARK-35783.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 947c7ea)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@sleep1661
Copy link
Contributor

@cloud-fan We are migrating from 2.4.7 to 3.0.2, and observed a significant regression in some cases due to this issue.

image

@cloud-fan We are migrating from 2.4.7 to 3.0.2, and observed a significant regression in some cases due to this issue.

image
@zhengruifeng I had run some sql test on 3.1.2 (without this PR), but I didn't reproduce the issue. The result the ORC reader does not read all columns. My sql was like this: select count(id) from dev.test1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants