-
Notifications
You must be signed in to change notification settings - Fork 116
Check available index versions for Delta Lake time travel query #389
Conversation
| } | ||
| } | ||
|
|
||
| override def getIndex(indexName: String, logVersion: Int): Option[IndexLogEntry] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: moved to above
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogManager.scala
Outdated
Show resolved
Hide resolved
| def getLatestStableLog(): Option[LogEntry] | ||
|
|
||
| /** Returns Active index log versions */ | ||
| def getActiveIndexVersions(): Seq[Int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that we are using "id" to refer the "version". Should we change it to "id"? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think version is more general term to represent it, but the code base is id.. might be good to be consistent.
But in the plan, we print it as LogVersion: ?
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala
Show resolved
Hide resolved
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (one minor comment), thanks @sezruby!
src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala
Outdated
Show resolved
Hide resolved
|
|
||
| val deltaDf = spark.read.format("delta").load(dataPath) | ||
| hyperspace.createIndex(deltaDf, IndexConfig("deltaIndex", Seq("clicks"), Seq("Query"))) | ||
| withSQLConf(IndexConstants.INDEX_LINEAGE_ENABLED -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this required now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I added the condition above:
https://github.com/microsoft/hyperspace/pull/389/files/b938556191c3dcceab0ec784fa26defc5b02804b#diff-ccc638dd1aab8f054c6d13cdfafeea9c1634646e90dd999b491e4e43a18d9efbR188
It's possible to allow append-only hybrid scan if possible, but the condition in closestIndex will become more complicated. So just limit the case for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so before this PR, append-only hybrid scan was supported? I am curious which change in this PR triggered this new requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be supported. The index was not applied because of the conditions in getHybridScanCandidate if the version has deleted files (compared to the given relation) - before this PR.
I found that I was writing a test for directory check change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Could you add a comment to code why we are setting the lineage enabled to true (and same for line 107)?
| } | ||
| // TODO: Currently assume all versions of index data exist. | ||
| // Need to check and remove candidate indexes. | ||
| // See https://github.com/microsoft/hyperspace/issues/387 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be closed now, or no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could close the issue now as index data existence is a different issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala
Outdated
Show resolved
Hide resolved
|
|
||
| val deltaDf = spark.read.format("delta").load(dataPath) | ||
| hyperspace.createIndex(deltaDf, IndexConfig("deltaIndex", Seq("clicks"), Seq("Query"))) | ||
| withSQLConf(IndexConstants.INDEX_LINEAGE_ENABLED -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Could you add a comment to code why we are setting the lineage enabled to true (and same for line 107)?
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala
Outdated
Show resolved
Hide resolved
…ltaLakeRelation.scala
What is the context for this pull request?
Follow up PR - #272 (comment)
Fixes #387
What changes were proposed in this pull request?
Check the existence of index data directories for each index log version before using them in Delta Lake time travel query.
To do so, added below APIs:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test