-
Notifications
You must be signed in to change notification settings - Fork 116
Conversation
build.sbt
Outdated
| if (scalaVersion.value == scala211 || sparkVersion.value == "2.4.2") { | ||
| "io.delta" %% "delta-core" % "0.6.1" | ||
| } else { | ||
| "io.delta" %% "delta-core" % "0.7.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do "provided" since few cloud providers come with delta JARs built in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, let's not worry about 0.7.0 since it is for Spark 3.0
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
| .filesForScan(Seq(), Seq(), false) | ||
| .files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use matchingFiles or not?
Can you add comment why Seq() is fine for partition/data filters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matchingFiles seems to be used when there are additional data filters or partition filters. So I think we don't need to use it for now.
And I added partitionFilter.
| dataSchema.json, | ||
| fileFormatName, | ||
| opts) | ||
| case LogicalRelation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is it possible that the location is TahoeBatchFileIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for version range and used in merge/update command internally. I think it would be better not to allow index creation with it to avoid complexity. WDYT? @imback82
src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala
Outdated
Show resolved
Hide resolved
|
Btw, I am working on the extension model so that you can plug in new source formats. I will have the PR up soon so that we can test it using Delta first. |
apoorvedave1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the approach looks fine to me. Thanks @sezruby
src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala
Show resolved
Hide resolved
|
This PR will be delivered by #265. |
|
Closed by #265 |
What is the context for this pull request?
What changes were proposed in this pull request?
Support Delta lake relation for "immutable" dataset.
This PR covers index creation/refresh(full mode)/application using FilterIndexRule and JoinIndexRule.
As we can rely on Delta Lake for data change and management, the version info and path of Delta Lake relation can be used in FileBasedSignatureProvider - instead of hash computation of all source files:
Does this PR introduce any user-facing change?
Yes, a user can create/refresh/apply indexes on Delta Lake relation.
How was this patch tested?
Unit test