-
Notifications
You must be signed in to change notification settings - Fork 116
Support Iceberg table format #358
Conversation
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor/nit comments, but generally looking good to me.
I don't know every detail of Iceberg, but since the changes are self-contained, I think this is good to go. @sezruby could you also take a look? Thanks.
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
sezruby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me! Could you also consider adding HybridScanForIcebergTest like HybridScanForDeltaLakeTest?
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergFileBasedSource.scala
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (pending minor/nit comments + @sezruby's comments), thanks @andrei-ionescu!
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergFileBasedSource.scala
Outdated
Show resolved
Hide resolved
Yes, but we need to verify the exact plan transformation and the result of query. |
|
@sezruby I did take the Delta Lake hybrid test and modified it for Iceberg. That is what is in the integration test from line 299 to 338. Is there other hybrid test for Delta Lake that I missed? |
src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala
Outdated
Show resolved
Hide resolved
5a23a11 to
5e89eee
Compare
|
@sezruby I added HybridScanForIcebergTest.scala. Please have another look. |
sezruby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrei-ionescu Thanks! Seems it works as expected! 👍👍
For start & end snapshot id test, I'd just like to check the use case & how those configs work well (or not well) with Hyperspace.
src/test/scala/com/microsoft/hyperspace/index/HybridScanForIcebergTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/HybridScanForIcebergTest.scala
Outdated
Show resolved
Hide resolved
12347df to
a158175
Compare
Is there such check for Delta so that I could inspire from? |
|
@andrei-ionescu No, not sure Delta has the feature.
|
|
@sezruby I'll try to see it is possible but from my knowledge on Iceberg I don't think this kind of test is possible. The snapshot ids are used for time travel and to isolate data at read time. A snapshot is a sort of full image of the dataset's metadata -- is not a delta from the previous version. This is a bit different than Delta. The parameters are extracted from the Delta has the same parameter under And in regards to time travel, I don't know that we have any functionality that allows us to link a version of the table to a version of the index yet. I would suggest to have a separate PR with time travel for Iceberg at the right time. |
|
@andrei-ionescu I guess Hybrid scan will work w/o any link of version info - it utilizes the list of source files from DataFrame. For time travel query optimization, to pick a proper version of a candidate index (if it has multiple versions from refreshes), |
a158175 to
768564e
Compare
|
@sezruby: I refactored it a bit and it is no longer needed to have knowledge of |
|
LGTM thanks @andrei-ionescu! |
|
Let me take a final look since there were some changes since my last LGTM. |
| * | ||
| * File paths should be the same format as "input_file_name()" of the given relation type. | ||
| * For [[IcebergRelation]], each file path should be in this format: | ||
| * `file:/path/to/file` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this based on the implementation below?
|
@andrei-ionescu Can you do a follow up PR to address #358 (comment)? Thanks! |
|
It has been a fantastic collaboration so far and I have personally learned a lot in the process! Thank you @andrei-ionescu! 🙂 |
|
Thanks for the work @andrei-ionescu! Could you also simply update the user document if possible? |
What is the context for this pull request?
What changes were proposed in this pull request?
This PR adds support for Iceberg.
The following changes are in this PR and each of them are separate commits:
Does this PR introduce any user-facing change?
No. The main changes to user-facing APIs are in the #321 PR. Detailed information can be found in the #318 proposal.
How was this patch tested?