-
Notifications
You must be signed in to change notification settings - Fork 116
Support Iceberg table format #320
Conversation
2b55d68 to
86c510a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test file for Iceberg? - for example https://github.com/microsoft/hyperspace/blob/master/src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala and incremental refresh test
https://github.com/microsoft/hyperspace/pull/301/files#diff-f32a70d0b9c560ff5d6a55595db0f12be911fef2ccd303ec24fe0799c7b31b0eR102
BTW to reduce PR size, could you split this PR into
- support DataSourceV2 (LogicalRelation -> LogicalPlan, ExtractIndexSupportedLogicalPlan(see below comment)) + unit tests for DataSourceV2 if possible
- based on 1), IcebergFileBasedSourceProvider + Iceberg tests
This will help us to understand your change better :)
You can keep this PR for reference and open 2 new PRs.
src/main/scala/com/microsoft/hyperspace/util/HyperspaceConf.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
ec30343 to
84f7597
Compare
|
@sezruby I'll keep this PR for Iceberg related commits:
I'll create another PR for the |
6a3cf87 to
c928b80
Compare
|
I'll rebase this PR as soon as the #321 gets merged. |
c928b80 to
8ec9e5f
Compare
2abb1e1 to
e71d6e4
Compare
src/test/scala/com/microsoft/hyperspace/index/IcebergIntegrationTest.scala
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/IcebergIntegrationTest.scala
Outdated
Show resolved
Hide resolved
| // The index should be applied for the updated version. | ||
| assert(isIndexUsed(query().queryExecution.optimizedPlan, "iceIndex", true)) | ||
|
|
||
| // Append data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than "append", you could remove 1~2 files from the source data.
To delete the source files easily, I used partitioned data in other test cases.
& in any case, we also need to test partitioned source data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Iceberg (version 0.10), I can only remove entire files. If I try to remove just some rows from a file it will fail the delete action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep Hyperspace indexes also only support entire file delete, not row-level delete.
Could you add a test for a partitioned table of Iceberg & Hybrid Scan append?
It's similar to this test, but use partitioned df .
e71d6e4 to
a3210bd
Compare
|
@sezruby Here are the requested plans with Hybrid Scan enabled: Right after creationAfter deleting a fileAfter adding some more dataI think they look as expected. Just FYI... IcebergSource read plan w/o index: |
|
@andrei-ionescu |
a3210bd to
a037b97
Compare
|
@sezruby This is the join |
|
@andrei-ionescu could you share sparkPlan? |
|
@sezruby The Spark plan |
|
@andrei-ionescu Seems bucketSpec is not applied properly.
Could you investigate the cause? We could check the join plan with index, but no hybrid scan case first. |
a037b97 to
345c52d
Compare
|
@sezruby I found yet another place that I missed adding the Here are the |
345c52d to
14fbec5
Compare
|
@andrei-ionescu Thanks! There's still missing 'Sort' node. Could you check this? |
|
@sezruby I did compare with the Delta output I don't see any difference. Here is the If there is something missing then it is missing from Delta too. BTW there is a |
|
From Delta Lake hybrid scan test: I think there might be no duplicated bucket between appended data & original source data in your testcase. |
|
@sezruby, In regards to your question we do add duplicate data here: https://github.com/microsoft/hyperspace/pull/320/files#diff-ce1f32f296e1683385beb0fe1954b154710c0ba0120f028167afbe5953347dd3R186-R192. I'm not sure about the output you did paste and where that comes from. Can you please provide the code of the test? Can you point me to the test that gives this output? I want to run the same test on Iceberg too and then debug and compare the differences. Thank you. |
|
This is the test:
Then there might be a problem in quick refresh? |
|
@sezruby The suggested test is a PR that has changes on the Hybrid Scan logic. I can try take that test and add it in my PR and print out the output for both Delta and Iceberg. But I will not add the changes added by that PR in the logic of the Hybrid Scan. |
|
@sezruby I did merge your hybridtest_refactoring branch that contains the #274 changes, into my local development and I did run all the tests ( I did try to replicate the Taking into account the following things:
I would suggest merging the #321 and #274 PRs and after that I can fruitfully work on bringing this Iceberg implementation on par with Delta one. Or, merge the #274 PR first and I'll rebase the my #321 and keep the tests to validate that nothing is broken by my @sezruby what do you think? |
|
@andrei-ionescu I'm okay with either way. BTW we need @imback82's review to merge the changes :) Thanks for the great work! |
|
Sorry for the delay. I will get to #274 soon. |
4fbd068 to
e68fc57
Compare
|
@andrei-ionescu |
0b54e91 to
5b009d4
Compare
5b009d4 to
0d0dec9
Compare
5de333e to
d9505fd
Compare
beee2a0 to
14b8f35
Compare
14b8f35 to
b3fc870
Compare
What is the context for this pull request?
What changes were proposed in this pull request?
This PR adds support for Iceberg.
The following changes are in this PR and each of them are separate commits:
Does this PR introduce any user-facing change?
No. The main changes to user-facing APIs are in the #321 PR. Detailed information can be found in the #318 proposal.
How was this patch tested?