Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@sezruby
Copy link
Collaborator

@sezruby sezruby commented Mar 23, 2021

What is the context for this pull request?

  • Tracking Issue: n/a
  • Parent Issue: n/a
  • Dependencies: n/a

Follow up PR - #272 (comment)

Fixes #387

What changes were proposed in this pull request?

Check the existence of index data directories for each index log version before using them in Delta Lake time travel query.
To do so, added below APIs:

// in IndexLogManager
  /** Returns index log versions whose state is in the given states */
  def getIndexVersions(states: Seq[String]): Seq[Int]

// in IndexManager
/**
   * Get index log version ids of the given index that match any of the given states.
   *
   * @param indexName Name of the index.
   * @param states List of index states of interest.
   * @return Index log versions.
   */
  def getIndexVersions(indexName: String, states: Seq[String]): Seq[Int]

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

@sezruby sezruby requested a review from imback82 March 23, 2021 21:43
}
}

override def getIndex(indexName: String, logVersion: Int): Option[IndexLogEntry] = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: moved to above

@sezruby sezruby self-assigned this Mar 23, 2021
@sezruby sezruby changed the title Support index log version in index statistics API Check available index versions for Delta Lake time travel query Mar 29, 2021
def getLatestStableLog(): Option[LogEntry]

/** Returns Active index log versions */
def getActiveIndexVersions(): Seq[Int]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that we are using "id" to refer the "version". Should we change it to "id"? WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think version is more general term to represent it, but the code base is id.. might be good to be consistent.
But in the plan, we print it as LogVersion: ?

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (one minor comment), thanks @sezruby!


val deltaDf = spark.read.format("delta").load(dataPath)
hyperspace.createIndex(deltaDf, IndexConfig("deltaIndex", Seq("clicks"), Seq("Query")))
withSQLConf(IndexConstants.INDEX_LINEAGE_ENABLED -> "true") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this required now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I added the condition above:
https://github.com/microsoft/hyperspace/pull/389/files/b938556191c3dcceab0ec784fa26defc5b02804b#diff-ccc638dd1aab8f054c6d13cdfafeea9c1634646e90dd999b491e4e43a18d9efbR188

It's possible to allow append-only hybrid scan if possible, but the condition in closestIndex will become more complicated. So just limit the case for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so before this PR, append-only hybrid scan was supported? I am curious which change in this PR triggered this new requirement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be supported. The index was not applied because of the conditions in getHybridScanCandidate if the version has deleted files (compared to the given relation) - before this PR.

I found that I was writing a test for directory check change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could you add a comment to code why we are setting the lineage enabled to true (and same for line 107)?

}
// TODO: Currently assume all versions of index data exist.
// Need to check and remove candidate indexes.
// See https://github.com/microsoft/hyperspace/issues/387
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be closed now, or no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could close the issue now as index data existence is a different issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm


val deltaDf = spark.read.format("delta").load(dataPath)
hyperspace.createIndex(deltaDf, IndexConfig("deltaIndex", Seq("clicks"), Seq("Query")))
withSQLConf(IndexConstants.INDEX_LINEAGE_ENABLED -> "true") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could you add a comment to code why we are setting the lineage enabled to true (and same for line 107)?

imback82
imback82 previously approved these changes Apr 5, 2021
@imback82 imback82 dismissed stale reviews from themself via 76276ae April 5, 2021 05:07
@imback82 imback82 added the enhancement New feature or request label Apr 5, 2021
@imback82 imback82 added this to the February/March 2021 (v0.5.0) milestone Apr 5, 2021
@imback82 imback82 merged commit edafcfe into microsoft:master Apr 5, 2021
@sezruby sezruby deleted the getindexpublic branch April 30, 2021 03:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check removed index log versions of time travel query index candidates

2 participants