-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31365][SQL] Enable nested predicate pushdown per data sources #28366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9659699
6feaaa4
e555a1c
84bc8dd
a49b73c
17d1094
aa32dcc
00b9d47
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2063,16 +2063,17 @@ object SQLConf { | |
| .booleanConf | ||
| .createWithDefault(true) | ||
|
|
||
| val NESTED_PREDICATE_PUSHDOWN_ENABLED = | ||
| buildConf("spark.sql.optimizer.nestedPredicatePushdown.enabled") | ||
| .internal() | ||
| .doc("When true, Spark tries to push down predicates for nested columns and or names " + | ||
| "containing `dots` to data sources. Currently, Parquet implements both optimizations " + | ||
| "while ORC only supports predicates for names containing `dots`. The other data sources" + | ||
| "don't support this feature yet.") | ||
| val NESTED_PREDICATE_PUSHDOWN_FILE_SOURCE_LIST = | ||
| buildConf("spark.sql.optimizer.nestedPredicatePushdown.supportedFileSources") | ||
| .internal() | ||
| .doc("A comma-separated list of data source short names or fully qualified data source " + | ||
| "implementation class names for which Spark tries to push down predicates for nested " + | ||
| "columns and/or names containing `dots` to data sources. Currently, Parquet implements " + | ||
| "both optimizations while ORC only supports predicates for names containing `dots`. The " + | ||
| "other data sources don't support this feature yet. So the default value is 'parquet,orc'.") | ||
| .version("3.0.0") | ||
| .booleanConf | ||
| .createWithDefault(true) | ||
| .stringConf | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We compare this list with Currently I think it is safer to assume custom data sources don't support this feature. I actually also think if custom data source wants to support it, it is better to adapt data source v2. We don't have a common API for v1 data sources that tells if it supports nested predicate pushdown. If we really want to allow custom v1 data sources have that, we can consider adding one common v1 API for the purpose. But, again, seems to me that we will encourage adapting v2 instead adding new things to v1.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yea, +1 on your thought.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Looks fine. @dbtsai are you good with it? Do you have use cases that need nested predicate pushdown for non-file-source?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In v1, we don't have any use-case for supporting it in custom data source. I'm good with it. |
||
| .createWithDefault("parquet,orc") | ||
|
|
||
| val SERIALIZER_NESTED_SCHEMA_PRUNING_ENABLED = | ||
| buildConf("spark.sql.optimizer.serializer.nestedSchemaPruning.enabled") | ||
|
|
@@ -3098,8 +3099,6 @@ class SQLConf extends Serializable with Logging { | |
|
|
||
| def nestedSchemaPruningEnabled: Boolean = getConf(NESTED_SCHEMA_PRUNING_ENABLED) | ||
|
|
||
| def nestedPredicatePushdownEnabled: Boolean = getConf(NESTED_PREDICATE_PUSHDOWN_ENABLED) | ||
|
|
||
| def serializerNestedSchemaPruningEnabled: Boolean = | ||
| getConf(SERIALIZER_NESTED_SCHEMA_PRUNING_ENABLED) | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -178,8 +178,11 @@ object FileSourceStrategy extends Strategy with Logging { | |
| // Partition keys are not available in the statistics of the files. | ||
| val dataFilters = | ||
| normalizedFiltersWithoutSubqueries.filter(_.references.intersect(partitionSet).isEmpty) | ||
| logInfo(s"Pushed Filters: " + | ||
| s"${dataFilters.flatMap(DataSourceStrategy.translateFilter).mkString(",")}") | ||
| val supportNestedPredicatePushdown = | ||
| DataSourceUtils.supportNestedPredicatePushdown(fsRelation) | ||
| val pushedFilters = dataFilters | ||
| .flatMap(DataSourceStrategy.translateFilter(_, supportNestedPredicatePushdown)) | ||
| logInfo(s"Pushed Filters: ${pushedFilters.mkString(",")}") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to have it propagated back so when an user does
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In |
||
|
|
||
| // Predicates with both partition keys and attributes need to be evaluated after the scan. | ||
| val afterScanFilters = filterSet -- partitionKeyFilters.filter(_.references.nonEmpty) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we decided to only make this configuration effective against DSv1, which seems okay because only DSv1 will have compatibility issues.
But shell we at least explicitly mention that this configuration is only effective with DSv1, (or make this configuration effective against DSv2)? Seems like it's going to be confusing to both end users or developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think DSv2 API supposes nested column capacity like pushdown and pruning, so we only need to deal with DSv1 compatibility issues here. Precisely, file source.
I will create a simple followup to refine the doc of this configuration for this point. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!