-
Notifications
You must be signed in to change notification settings - Fork 116
Support DataSourceV2 sources #321
Conversation
581d02c to
2f5269b
Compare
sezruby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, as there's no test, could you open the next PR including IceBergIntegrationTest based on this PR? so that we could check the code works as expected.
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
cb9295d to
5ce3bf8
Compare
sezruby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me!
Could you add a simple test case for "delete" hybrid scan in IcebergIntegrationTest?
Since Hybrid Scan test refactoring is on the way (#274), it's difficult to add Iceberg Hybrid Scan tests until the refactoring change is merged.
@imback82 Could you have a look at this? Thanks!
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
|
@andrei-ionescu Yep, I checked it; the request is adding a test case for Hybrid Scan with deleted files. You can test it by:
Please use |
5ce3bf8 to
09cf7e0
Compare
|
@sezruby I integrated this second round of feedback and rebased the PR on top of today's master. In regards to the "Hybrid Scan with deleted files" I have some questions:
There is a complexity added to it: I'll add the test on the |
|
@andrei-ionescu 2 => Yes, we need to check the exact plan transformation of Hybrid Scan + Datasource v2, but let's just check if the plan is transformed or not for now. Few things I'd like to check:
And could you share the result of
Yep I'll check the Thanks for the work 👍 👍 |
|
@sezruby The Asking for a test for If I try to add a test for it first I need to implement the source based on the
|
|
Sorry for the confusion, I meant Hybrid Scan + Iceberg. Please add the tests in #320 :) We need to make sure that this change works as expected before merging it. |
09cf7e0 to
2638de3
Compare
|
@sezruby I added the requested tests in https://github.com/microsoft/hyperspace/pull/320/files#diff-ce1f32f296e1683385beb0fe1954b154710c0ba0120f028167afbe5953347dd3 similar to the BTW, it did show up a place where I missed adding the pattern matching on |
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
e65c924 to
c74b9f5
Compare
sezruby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 Could you review this change?
Tests are in #320 https://github.com/microsoft/hyperspace/pull/320/files#diff-ce1f32f296e1683385beb0fe1954b154710c0ba0120f028167afbe5953347dd3R61
Thanks!
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick review. I will do a more thorough review this week. Since this is touching the core parts, I want to make sure this is reviewed thoroughly.
@apoorvedave1 / @pirz Can you please take a look?
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
| // Extract partition keys, if original data is partitioned. | ||
| val partitionSchemas = df.queryExecution.optimizedPlan.collect { | ||
| case LogicalRelation(HadoopFsRelation(_, pSchema, _, _, _, _), _, _, _) => pSchema | ||
| case DataSourceV2Relation(_, _, _, _, uSchema) => uSchema.getOrElse(StructType(Nil)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think we should have DataSourceV2Relation specific code here. Can we move this to the source provider API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both LogicalRelation and DataSourceV2Relation are on the same level. Both directly extend LeafNode. If LogicalRelation is present here I would say that DataSourceV2Relation should also be here, as in this PR we open up to DataSourceV2 Spark API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't get your argument on the same level. Why can't we introduce partitionSchema to the source provider? I think we missed moving this into source provider since default/delta have the same implementation; we can have the different implementation (matching FileIndex) for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Delta is not built on top of DataSourceV2 Spark API thus it's not the same implementation.
- Both
LogicalRelationandDataSourceV2Relationare first "child" fromLeafNode, both directly extendLeafNode.
LeafNode
// \\
// \\
// \\
LogicalRelation DataSourceV2Relation
- This is the PR addressing support for DataSourceV2 which is Spark not Iceberg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that rules shouldn't directly work with LogicalRelation or DataSourceV2Relation. I think we can abstract that out. Source provider can choose which relation it supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I will create a PR to your branch this week.
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
| baseOutput, | ||
| _, | ||
| _) => | ||
| val (filesDeleted, filesAppended) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrei-ionescu Could you also add some comments for "guides" in this file. I see that many lines are being moved / refactored without the logic being changed. It would help reviewers if you add something like "this part has moved to line blah, refactored to "foo", etc. Thanks.
3616107 to
37100fa
Compare
|
@andrei-ionescu we are planning to do the next release at the end of February. I marked the milestone for this PR accordingly. |
|
@imback82 Thanks! Is there anything else that I have to do on my side? |
|
Not at the moment. I will finish reviewing this PR this week. Thanks! |
37100fa to
469d40a
Compare
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrei-ionescu I went thru this PR once more. I see lots of code that depends on LogicalRelation and DataSourceV2Relation in rules. I think we can solve this by introducing one level of abstraction (so that rules don't need to worry about handling specific relations). I will create a PR to your branch in coming days.
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/interfaces.scala
Outdated
Show resolved
Hide resolved
| // Extract partition keys, if original data is partitioned. | ||
| val partitionSchemas = df.queryExecution.optimizedPlan.collect { | ||
| case LogicalRelation(HadoopFsRelation(_, pSchema, _, _, _, _), _, _, _) => pSchema | ||
| case DataSourceV2Relation(_, _, _, _, uSchema) => uSchema.getOrElse(StructType(Nil)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't get your argument on the same level. Why can't we introduce partitionSchema to the source provider? I think we missed moving this into source provider since default/delta have the same implementation; we can have the different implementation (matching FileIndex) for them.
| val relation = makeHadoopFsRelation(index, v2Relation) | ||
| val updatedOutput = | ||
| output.filter(attr => relation.schema.fieldNames.contains(attr.name)) | ||
| new LogicalRelation(relation, updatedOutput, None, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we lose anything by going from DataSourceV2Relation to LogicalRelation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is to overwrite the plan with the relation that the index dataset uses. The index uses Parquet files which are not related to DataSourceV2 API in any way.
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
469d40a to
fb070fc
Compare
|
@andrei-ionescu Sorry for being late to create the PR. I just created a draft PR to your branch: andrei-ionescu#1. I haven't finished it, but you can see that duplicate logics to handle I tried to finish the PR on your branch, but I am introducing some changes to the existing code and need to revert some of your code in Are you OK with it? I will make you a co-author since you inspired the refactoring. And once the PR is done, you can convert this PR to to implement |
|
@imback82 I'm ok with it and thanks for your involvement. I would like to ask you to do it ASAP because I will need to update the implementation for supporting Iceberg too. |
|
Thanks, I'll have it by tomorrow (oof today) |
…m multiple sources
The new `IndexHadoopFsRelation` wrapper was not used after rebasing Added wrapper functions to maintain the closure context
fb070fc to
4dc8a94
Compare
| case _ => false | ||
| } | ||
| case v2: DataSourceV2Relation => | ||
| v2.options.exists(_.equals(IndexConstants.INDEX_RELATION_IDENTIFIER)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: where are you injecting this option to v2.options? Looks like this will always be false?
@andrei-ionescu would it be OK for you if we do the release at the end of March. It seems more realistic to do bi-monthly releases to pack more features. But if this is a blocker for you, we can do a minor release this week. Please let me know. Thanks! |
What is the context for this pull request?
What changes were proposed in this pull request?
This PR adds support for DataSourceV2.
The following changes are in this PR and each of them are separate commits:
LogicaPlaninstead ofLogicalRelation. This gives us the possibility to add support for other kinds of relations like Spark datas source version 2.DataSourceV2.Does this PR introduce any user-facing change?
The source interfaces has changed. Now instead of using
LogicalRelationas parameter type it now usesLogicalPlan. There has been added support forDataSourceV2sources.Detailed information can be found in the #318 proposal.
How was this patch tested?