Support Delta lake relation #197

sezruby · 2020-10-12T02:17:40Z

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: Hyperspace for Delta Lake #148
Dependencies: Pluggable source provider #227

What changes were proposed in this pull request?

Support Delta lake relation for "immutable" dataset.
This PR covers index creation/refresh(full mode)/application using FilterIndexRule and JoinIndexRule.

As we can rely on Delta Lake for data change and management, the version info and path of Delta Lake relation can be used in FileBasedSignatureProvider - instead of hash computation of all source files:

...
 case LogicalRelation(
          HadoopFsRelation(location: PartitioningAwareFileIndex, _, _, _, _, _),
          _,
          _,
          _) =>
        fingerprint ++= location.allFiles.foldLeft("")(
          (accumulate: String, fileStatus: FileStatus) =>
            HashingUtils.md5Hex(accumulate + getFingerprint(fileStatus)))
 case LogicalRelation(
          HadoopFsRelation(location: TahoeLogFileIndex, _, _, _, _, _),
          _,
          _,
          _) =>
        fingerprint ++= location.tableVersion + location.path.toString
...

Does this PR introduce any user-facing change?

Yes, a user can create/refresh/apply indexes on Delta Lake relation.

How was this patch tested?

Unit test

build.sbt

imback82 · 2020-10-22T04:10:34Z

build.sbt

+  if (scalaVersion.value == scala211 || sparkVersion.value == "2.4.2") {
+    "io.delta" %% "delta-core" % "0.6.1"
+  } else {
+    "io.delta" %% "delta-core" % "0.7.0"


I think we should do "provided" since few cloud providers come with delta JARs built in.

Also, let's not worry about 0.7.0 since it is for Spark 3.0

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

imback82 · 2020-10-22T04:26:54Z

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

+          .filesForScan(Seq(), Seq(), false)
+          .files


Can we use matchingFiles or not?

Can you add comment why Seq() is fine for partition/data filters?

matchingFiles seems to be used when there are additional data filters or partition filters. So I think we don't need to use it for now.

And I added partitionFilter.

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

imback82 · 2020-10-22T04:36:18Z

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

          dataSchema.json,
          fileFormatName,
          opts)
+      case LogicalRelation(


Question: Is it possible that the location is TahoeBatchFileIndex?

It's for version range and used in merge/update command internally. I think it would be better not to allow index creation with it to avoid complexity. WDYT? @imback82

src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala

imback82 · 2020-10-24T01:07:15Z

Btw, I am working on the extension model so that you can plug in new source formats. I will have the PR up soon so that we can test it using Delta first.

apoorvedave1

the approach looks fine to me. Thanks @sezruby

src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala

sezruby · 2020-11-25T09:35:30Z

This PR will be delivered by #265.

sezruby · 2020-12-09T12:40:22Z

Closed by #265

sezruby commented Oct 12, 2020

View reviewed changes

build.sbt Outdated Show resolved Hide resolved

sezruby self-assigned this Oct 12, 2020

sezruby added this to the 0.5.0 milestone Oct 12, 2020

imback82 modified the milestones: October 2020, November 2020 Oct 13, 2020

Support Delta lake relation

2615d95

sezruby force-pushed the delta branch from 6c0ff96 to 042c908 Compare October 19, 2020 04:43

minor fix

8892fc0

sezruby force-pushed the delta branch from 042c908 to 8892fc0 Compare October 19, 2020 04:45

sezruby requested review from imback82 and pirz October 19, 2020 04:45

build fix

eddd24b

sezruby requested review from AFFogarty and apoorvedave1 October 19, 2020 05:41

sezruby added the enhancement New feature or request label Oct 19, 2020

minor fix

7c53a72

This was referenced Oct 21, 2020

Support Hybrid Scan with Delta Lake relation #224

Closed

Hyperspace for Delta Lake #148

Open

sezruby force-pushed the delta branch from 5bebe4f to f188591 Compare October 21, 2020 10:30

handle timestampAsOf

439ea5f

sezruby force-pushed the delta branch from f188591 to 439ea5f Compare October 21, 2020 10:31

AFFogarty reviewed Oct 21, 2020

View reviewed changes

build.sbt Outdated Show resolved Hide resolved

imback82 modified the milestones: November 2020, October 2020 Oct 22, 2020

imback82 reviewed Oct 22, 2020

View reviewed changes

sezruby added 2 commits October 22, 2020 15:33

review commit

fc0e994

fix build.sbt

9f2e71f

rapoth modified the milestones: October 2020, November 2020 Oct 29, 2020

sezruby added 3 commits November 6, 2020 11:42

Merge remote-tracking branch 'upstream/master' into delta

5c2305a

build fix

20f266e

comment fix

4fb1507

apoorvedave1 reviewed Nov 17, 2020

View reviewed changes

pirz reviewed Nov 18, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala Show resolved Hide resolved

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala Show resolved Hide resolved

imback82 mentioned this pull request Nov 23, 2020

Pluggable source provider #227

Merged

This was referenced Nov 23, 2020

Support Delta Lake file-based source provider #265

Merged

Add Delta Lake version history to IndexLogEntry for efficient time travel query #272

Merged

sezruby marked this pull request as draft November 25, 2020 09:33

sezruby closed this Dec 9, 2020

sezruby deleted the delta branch April 30, 2021 03:25

Support Delta lake relation #197

Support Delta lake relation #197

Uh oh!

Conversation

sezruby commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

imback82 Oct 22, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Oct 22, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

imback82 Oct 22, 2020

Choose a reason for hiding this comment

Uh oh!

sezruby Oct 22, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

imback82 Oct 22, 2020

Choose a reason for hiding this comment

Uh oh!

sezruby Oct 22, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

imback82 commented Oct 24, 2020

Uh oh!

apoorvedave1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sezruby commented Nov 25, 2020

Uh oh!

sezruby commented Dec 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sezruby commented Oct 12, 2020 •

edited

Loading