-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35783][SQL] Set the list of read columns in the task configuration to reduce reading of ORC data. #32923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tion to reduce reading of ORC data.
|
ok to test |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @weixiuli .
Merged to master for Apache Spark 3.2.0.
|
Thank you @dongjoon-hyun for the quick review and comments. |
|
@weixiuli Great Catch! @dongjoon-hyun maybe we need to backport it to 3.0 & 3.1? |
|
For me, this is a performance improvement, @zhengruifeng . |
|
Does this mean we never do column pruning for ORC before this PR? And shall we update the result of |
|
@cloud-fan We are migrating from 2.4.7 to 3.0.2, and observed a significant regression in some cases due to this issue. |
@cloud-fan Yes, i will check the |
I think this is a serious perf regression we should backport. @dongjoon-hyun what do you think? |
|
Got it. In that case, I'm okay for backporting, @cloud-fan . I'll backport this. |
…tion to reduce reading of ORC data ### What changes were proposed in this pull request? Set the list of read columns in the task configuration to reduce reading of ORC data. ### Why are the changes needed? Now, the ORC reader will read all columns of the ORC table when the task configuration does not set the list of read columns . Therefore, we should set the list of read columns in the task configuration to reduce reading of ORC data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist unittests Closes #32923 from weixiuli/SPARK-35783. Authored-by: weixiuli <weixiuli@jd.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 947c7ea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…tion to reduce reading of ORC data ### What changes were proposed in this pull request? Set the list of read columns in the task configuration to reduce reading of ORC data. ### Why are the changes needed? Now, the ORC reader will read all columns of the ORC table when the task configuration does not set the list of read columns . Therefore, we should set the list of read columns in the task configuration to reduce reading of ORC data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist unittests Closes #32923 from weixiuli/SPARK-35783. Authored-by: weixiuli <weixiuli@jd.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 947c7ea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…tion to reduce reading of ORC data ### What changes were proposed in this pull request? Set the list of read columns in the task configuration to reduce reading of ORC data. ### Why are the changes needed? Now, the ORC reader will read all columns of the ORC table when the task configuration does not set the list of read columns . Therefore, we should set the list of read columns in the task configuration to reduce reading of ORC data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist unittests Closes apache#32923 from weixiuli/SPARK-35783. Authored-by: weixiuli <weixiuli@jd.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 947c7ea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|

What changes were proposed in this pull request?
Set the list of read columns in the task configuration to reduce reading of ORC data.
Why are the changes needed?
Now, the ORC reader will read all columns of the ORC table when the task configuration does not set the list of read columns . Therefore, we should set the list of read columns in the task configuration to reduce reading of ORC data.
Does this PR introduce any user-facing change?
No
How was this patch tested?
exist unittests