Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@sezruby
Copy link
Collaborator

@sezruby sezruby commented Nov 25, 2020

What is the context for this pull request?

Fixes #270
`

What changes were proposed in this pull request?

This PR improves Hybrid Scan for a time travel query of Delta Lake table by using an old version of index which is closest to the given time travel delta version.

This PR includes:

  • keep Delta Lake table versions at each create/refresh indexes in IndexLogEntry metadata.
    • in CoveringIndex.properties.properties, (to avoid searching all index log entry files)
      • key: DELTA_VERSION_HISTORY_PROPERTY
      • value: comma separated "INDEX_VERSION_ID:DELTA_VERSION_ID"
        • e.g 1:1,3:5,5:7,7:10,9:15 for refreshed delta version 1, 5, 7, 10, 15
  • find and replace an old version of index in getCandidateIndexes
    • new closestIndexVersion API in source provider

Does this PR introduce any user-facing change?

Yes, an old version of index can be applied for a time travel query on Delta Lake.

How was this patch tested?

Unit tests

with Action {
final override def logEntry: LogEntry = getIndexLogEntry(spark, df, indexConfig, indexDataPath)
final override def logEntry: LogEntry =
getIndexLogEntry(spark, df, indexConfig, indexDataPath, super[Action].endId)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: need to get the index version id, but it's not available in CreateActionBase.getIndexLogEnrtry.
So I added endId function and with Action

@sezruby sezruby changed the title [WIP] Add Delta Lake version history for efficient Hybrid Scan [WIP] Add Delta Lake version history for efficient time travel query Nov 25, 2020
@sezruby sezruby force-pushed the source_extension_timetravel branch from d843425 to 6e8b2d6 Compare November 25, 2020 09:12
@sezruby sezruby self-assigned this Nov 25, 2020
@sezruby sezruby added this to the November 2020 milestone Nov 25, 2020
@sezruby sezruby added advanced issue This is the tag for advanced issues which involve major design changes or introduction enhancement New feature or request labels Nov 25, 2020
@sezruby sezruby changed the title [WIP] Add Delta Lake version history for efficient time travel query Add Delta Lake version history to IndexLogEntry for efficient time travel query Nov 25, 2020
@sezruby sezruby force-pushed the source_extension_timetravel branch 2 times, most recently from 3dc848c to 66e736f Compare November 26, 2020 10:47
@sezruby sezruby force-pushed the source_extension_timetravel branch from 66e736f to 862bb98 Compare December 9, 2020 06:42
@sezruby
Copy link
Collaborator Author

sezruby commented Dec 10, 2020

@pirz @apoorvedave1 Could you review the change and give any comment on the approach - keeping version history in properties -? Thanks!

// Get the timestamp string for time travel query using "timestampAsOf" option.
val timestampStr = getSparkFormattedTimestamps(System.currentTimeMillis).head
// Sleep 1 second because the unit of commit timestamp is second.
Thread.sleep(1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using sleep in ut as it will take too much time especially when this method is called multiple times.

To simulate time just take a global Time variable currTime = System.currentTimeMillis and then keep on adding 1000 to it whenever you want to move time and then just use with getSparkFormattedTimestamps function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because delta lake internally records the time in second unit. So I forces to record different times for each version update, so that I could test the query with timestamp properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I ran your test and used DeltaTable.forPath(spark, dataPath).history().show(10, false) to view timestamp (2021-02-02 14:49:19.077). It used milliseconds and each op should have a readVersion ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the timestamp in delta table to store read versions ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems Azure test framework needs sleep for some reason :D.. f19614b
I might have added this after debugging / investigation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because delta lake internally records the time in second unit. So I forces to record different times for each version update, so that I could test the query with timestamp properly.

Maybe add this comment to the code? Reading the existing comment, I didn't know why we had to sleep.

*/
trait Action extends HyperspaceEventLogging with Logging with ActiveSparkSession {
protected val baseId: Int = logManager.getLatestId().getOrElse(-1)
protected def endId: Int = baseId + 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, why +2, can you please explain ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's because we write the log twice at begin() using baseId + 1 and end() using baseId + 2.
As begin() has in-progress state (e.g. creating), Hyperspace always reads the log file written in "end()"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we rename it to firstLogEntryId and lastLogEntryId. And add a comment for the same.
Also should we have an assert that firstLogEntryId would always be -1 or an odd number ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

baseId is not firstLogEntryId and it requires some change..
If an operation failed in op(), then firstLogEntryId could be an odd number. (though failures are not handled properly now. e.g. #248)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is required because LogEntry.id is not available when creating - IndexLogEntry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add final modifier?

@imback82
Copy link
Contributor

imback82 commented Feb 2, 2021

@sezruby Could you fix the conflicts?

Unit tests (test will be added)

This can be removed now?

}

/**
* Returns index version related properties.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: That looks like something for @return below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this line is consistent with src/main/scala/com/microsoft/hyperspace/index/sources/interfaces.scala and other functions/source providers also have the comments in the same way.
Let's keep this for consistency for now. Thanks!

@imback82
Copy link
Contributor

@sezruby Could you resolve the conflicts? Let's try to get this in next (sorry for the delay).

@apoorvedave1 Can you also help with the review when you get a chance? Thanks.

@sezruby sezruby force-pushed the source_extension_timetravel branch from ded45b6 to 7615f92 Compare March 2, 2021 03:33
@sezruby sezruby requested review from imback82 and removed request for pirz March 19, 2021 02:09
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there

* @param logVersion Index log version to retrieve.
* @return IndexLogEntry if the index of the given log version exists, otherwise None.
*/
def getIndexLogEntry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have def getIndexes(states: Seq[String] = Seq()): Seq[IndexLogEntry]. Should we make the naming consistent? Also, we should expose an API to return log versions: #272 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you can create a separate PR that just updates IndexManager APIs, but I am also fine if you do it in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do this with a follow up PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

// Get the timestamp string for time travel query using "timestampAsOf" option.
val timestampStr = getSparkFormattedTimestamps(System.currentTimeMillis).head
// Sleep 1 second because the unit of commit timestamp is second.
Thread.sleep(1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because delta lake internally records the time in second unit. So I forces to record different times for each version update, so that I could test the query with timestamp properly.

Maybe add this comment to the code? Reading the existing comment, I didn't know why we had to sleep.

@sezruby sezruby force-pushed the source_extension_timetravel branch from 661f634 to 04b3b53 Compare March 22, 2021 00:04
* @param logVersion Index log version to retrieve.
* @return IndexLogEntry if the index of the given log version exists, otherwise None.
*/
def getIndexLogEntry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (few minor comments), thanks @sezruby!

.getOrElse(DeltaLakeConstants.DELTA_VERSION_HISTORY_PROPERTY, "")

// The value is comma separated versions - <index log version>:<delta table version>.
// e.g. "1:2,3:5,5:9"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Versions are processed in a reverse order to keep the higher index log version in case different index
// log versions refer to the same delta lake version.
// For example, "1:1, 2:2, 3:2" will become Seq((1, 1), (3, 2)).

, if I understand this correctly?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

advanced issue This is the tag for advanced issues which involve major design changes or introduction enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improvement of Time Travel Query on Delta Lake

4 participants