Check available index versions for Delta Lake time travel query #389

sezruby · 2021-03-23T21:43:04Z

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: n/a
Dependencies: n/a

Fixes #387

What changes were proposed in this pull request?

Check the existence of index data directories for each index log version before using them in Delta Lake time travel query.
To do so, added below APIs:

// in IndexLogManager
  /** Returns index log versions whose state is in the given states */
  def getIndexVersions(states: Seq[String]): Seq[Int]

// in IndexManager
/**
   * Get index log version ids of the given index that match any of the given states.
   *
   * @param indexName Name of the index.
   * @param states List of index states of interest.
   * @return Index log versions.
   */
  def getIndexVersions(indexName: String, states: Seq[String]): Seq[Int]

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

sezruby · 2021-03-23T21:43:42Z

src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala

    }
  }

-  override def getIndex(indexName: String, logVersion: Int): Option[IndexLogEntry] = {


note: moved to above

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala

src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala

src/main/scala/com/microsoft/hyperspace/index/IndexLogManager.scala

imback82 · 2021-03-29T17:52:05Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogManager.scala

  def getLatestStableLog(): Option[LogEntry]

+  /** Returns Active index log versions */
+  def getActiveIndexVersions(): Seq[Int]


I noticed that we are using "id" to refer the "version". Should we change it to "id"? WDYT?

I think version is more general term to represent it, but the code base is id.. might be good to be consistent.
But in the plan, we print it as LogVersion: ?

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala

imback82

LGTM (one minor comment), thanks @sezruby!

src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala

imback82 · 2021-04-02T05:12:05Z

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala


      val deltaDf = spark.read.format("delta").load(dataPath)
-      hyperspace.createIndex(deltaDf, IndexConfig("deltaIndex", Seq("clicks"), Seq("Query")))
+      withSQLConf(IndexConstants.INDEX_LINEAGE_ENABLED -> "true") {


why is this required now?

Since I added the condition above:
https://github.com/microsoft/hyperspace/pull/389/files/b938556191c3dcceab0ec784fa26defc5b02804b#diff-ccc638dd1aab8f054c6d13cdfafeea9c1634646e90dd999b491e4e43a18d9efbR188

It's possible to allow append-only hybrid scan if possible, but the condition in closestIndex will become more complicated. So just limit the case for now.

Hmm, so before this PR, append-only hybrid scan was supported? I am curious which change in this PR triggered this new requirement.

Might be supported. The index was not applied because of the conditions in getHybridScanCandidate if the version has deleted files (compared to the given relation) - before this PR.

I found that I was writing a test for directory check change.

I see. Could you add a comment to code why we are setting the lineage enabled to true (and same for line 107)?

imback82 · 2021-04-02T15:52:24Z

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala

    }
-    // TODO: Currently assume all versions of index data exist.
-    //  Need to check and remove candidate indexes.
-    //  See https://github.com/microsoft/hyperspace/issues/387


Can this be closed now, or no?

I think we could close the issue now as index data existence is a different issue.

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala

imback82 · 2021-04-02T15:57:42Z

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala


      val deltaDf = spark.read.format("delta").load(dataPath)
-      hyperspace.createIndex(deltaDf, IndexConfig("deltaIndex", Seq("clicks"), Seq("Query")))
+      withSQLConf(IndexConstants.INDEX_LINEAGE_ENABLED -> "true") {


I see. Could you add a comment to code why we are setting the lineage enabled to true (and same for line 107)?

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala

…ltaLakeRelation.scala

Support index log version in index statistics API

b186b67

sezruby requested a review from imback82 March 23, 2021 21:43

sezruby commented Mar 23, 2021

View reviewed changes

sezruby self-assigned this Mar 23, 2021

fix test

a85d22b

imback82 reviewed Mar 28, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala Outdated Show resolved Hide resolved

add availableIndexVersions

9cc09b9

sezruby changed the title ~~Support index log version in index statistics API~~ Check available index versions for Delta Lake time travel query Mar 29, 2021

sezruby force-pushed the getindexpublic branch from 5a4d774 to 9cc09b9 Compare March 29, 2021 16:59

sezruby commented Mar 29, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala Show resolved Hide resolved

sezruby mentioned this pull request Mar 29, 2021

Option to clean up log files in vacuumIndex #397

Open

imback82 reviewed Mar 29, 2021

View reviewed changes

sezruby added 5 commits March 29, 2021 20:11

review commit

8a6ba11

review commit

e51d5b1

revert

35522eb

misc

c8871c8

Merge remote-tracking branch 'upstream/master' into getindexpublic

b938556

imback82 reviewed Apr 2, 2021

View reviewed changes

review commit

a7ad6bd

imback82 reviewed Apr 2, 2021

View reviewed changes

sezruby mentioned this pull request Apr 4, 2021

Support time travel query optimization with append-only Hybrid Scan #408

Open

sezruby added 3 commits April 4, 2021 10:01

review commit

2e6547d

misc

5ad4011

misc2

1839d98

imback82 previously approved these changes Apr 5, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeRelation.scala Outdated Show resolved Hide resolved

Update src/main/scala/com/microsoft/hyperspace/index/sources/delta/De…

76276ae

…ltaLakeRelation.scala

imback82 dismissed stale reviews from themself via 76276ae April 5, 2021 05:07

imback82 approved these changes Apr 5, 2021

View reviewed changes

Merge branch 'master' into getindexpublic

5adbec8

imback82 added the enhancement New feature or request label Apr 5, 2021

imback82 added this to the February/March 2021 (v0.5.0) milestone Apr 5, 2021

imback82 merged commit edafcfe into microsoft:master Apr 5, 2021

sezruby deleted the getindexpublic branch April 30, 2021 03:27

Check available index versions for Delta Lake time travel query #389

Check available index versions for Delta Lake time travel query #389

Uh oh!

Conversation

sezruby commented Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sezruby commented Mar 23, 2021 •

edited

Loading