-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29665][SQL] refine the TableProvider interface #26297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #112859 has finished for PR 26297 at commit
|
|
Test build #112860 has finished for PR 26297 at commit
|
| public static Transform bucket(int numBuckets, String... columns) { | ||
| return LogicalExpressions.bucket(numBuckets, | ||
| JavaConverters.asScalaBuffer(Arrays.asList(columns)).toSeq()); | ||
| NamedReference[] references = Arrays.stream(columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll send another for these related changes.
| // For file source, it's expensive to infer schema/partition at each write. Here we pass | ||
| // the schema of input query and the user-specified partitioning to `getTable`. If the | ||
| // query schema is not compatible with the existing data, the write can still success but | ||
| // following reads would fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can introduce a general API for it, but I don't think it's worth. It only applies to file source + DataFrameWriter, not file source tables or DataFrameWriterV2.
|
Test build #112945 has finished for PR 26297 at commit
|
|
Test build #112947 has finished for PR 26297 at commit
|
| StructType(fields) | ||
| } | ||
|
|
||
| override def inferPartitioning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only one I see that uses schema, and that is just to create the file index.
| * @param options The options that can identify a table, e.g. file path, Kafka topic name, etc. | ||
| * It's an immutable case-insensitive string-to-string map. | ||
| */ | ||
| Transform[] inferPartitioning(StructType schema, CaseInsensitiveStringMap options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the schema is actually needed. Partitioning and schemas are mostly orthogonal. If anything, you could argue that identity partitions should be in the schema and that inferSchema could accept the result of inferPartitioning.
Also, none of the implementations actually use it besides the one that uses it to create a file index. It seems to me like this is more of a convenience for that implementation than something that is generally needed. Can we remove it from the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the schema parameter and make the API more flexible, but I'm not sure we need such flexibility.
As I mentioned in the PR description, Spark only supports
- infer both schema and partitioning.
- specifies schema and infer partitioning.
- specifies both schema and partitioning.
It seems very weird if we allow users to specify partitioning and infer schema. Since partitioning is something depending on the schema (e.g. you can't pick a non-existing column as partition column), I think in general it makes sense to have the schema parameter in inferPartitioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've also had bugs in the past where the inference picks a different data type than what you want, therefore it's safer that if a user provides a schema to use the data type in the provided schema for the partition columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems very weird if we allow users to specify partitioning and infer schema.
This isn't what I'm suggesting. We can set up rules for when schema and partition inference are called that restrict to just those 3 cases.
What I'm suggesting is that schema inference and partition inference are independent so we don't need to pass a schema in to inferPartition. The schema isn't actually used by file sources, and file sources are why we are making these changes.
you can't pick a non-existing column as partition column
There's no reason why this must be the case. Another partition column could be added to the schema. Data files don't usually store partition columns, so the schema is usually the union of all file schemas plus whatever is inferred for the partition schema. That means schema depends on partitioning, not the other way around.
We've also had bugs in the past where the inference picks a different data type than what you want.
I see what you mean here, but I think is better to reconcile the differences in Spark instead of in the source. If the source infers that a partition is a string, but the user supplies a schema with an integer type, then all the source would do is throw an exception. Spark can do that once the partitioning is passed back, couldn't it?
|
@cloud-fan, I'll take a look at this one next. Is it possible to make these changes in multiple PRs? 57 files is a lot to go through for a single review. |
|
This changes the API so all the implementations need to be updated. Now sure how to make it smaller. |
|
Test build #114540 has finished for PR 26297 at commit
|
|
Test build #114542 has finished for PR 26297 at commit
|
|
Can we move forward with this PR? The only argument I see is: why have the First, we can guarantee that Second, if the schema is specified and partitioning needs to be inferred, we can save the time of inferring partition column data type, by simply picking the column data type from the specified schema. This is needed for file source. If we can't reach a consensus, I'd propose to keep the existing API style, and simply add a new method |
|
I think this needs to be split into multiple commits to avoid problems. I think it can be done by adding the interface changes, along with a stub implementation on the file source base classes. |
|
It's hard to write a stub implementation. The API change is too big. Previously file source v2 implements schema/partition inference in I've open #26750 to give up the refactor and only do the necessary change to accept user-specified partitioning in |
What changes were proposed in this pull request?
Instead of having several overloads of
getTablemethod inTableProvider, it's better to have 2 methods explicitly:inferSchemaandinferPartitioning. With a singlegetTablemethod that takes everything: schema, partitioning and properties.Spark supports: 1) infer both schema and partitioning. 2) specifies schema and infer partitioning 3) specifies both schema and partitioning. So
inferPartitioningmethod takesschemaas input, as schema must be known before inferring partitioning.A lot of changes are made to file source v2, to move the schema inference from
FileTabletoFileDataSourceV2.Why are the changes needed?
This is inspired by the discussion in #25651 (comment)
It's better to let the APIs have explicit meanings.
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing tests.