Support Delta Lake file-based source provider #265

sezruby · 2020-11-23T05:05:00Z

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: Hyperspace for Delta Lake #148
Dependencies: n/a

What changes were proposed in this pull request?

Refactor #197 + #224 (except for hybrid scan test refactoring) based on #227.

This PR introduces DeltaLakeFileBasedSource to support indexes on Delta Lake Sources.

Does this PR introduce any user-facing change?

Yes,

Users can create/refresh indexes on Delta Lake table and also utilize Hybrid Scan.

How was this patch tested?

Unit test

sezruby · 2020-11-23T05:10:46Z

@imback82 Need to add some comments but could you review this PR?
I'll handle Hybrid Scan test refactoring in #224 with another PR.

imback82 · 2020-11-23T05:14:27Z

@imback82 Need to add some comments but could you review this PR?

Cool! I will take a look.

I'll handle Hybrid Scan test refactoring in #224 with another PR.

Yes, I believe Hybrid Scan requires new APIs, so let's handle it separately. Thanks!

imback82 · 2020-11-23T05:22:49Z

DeltaLakeFileBasedSource looks good to me. If you are OK with overall approach with #227, I can merge it and resolve comments (unit tests) as a follow up PRs.

Btw, can you check the build failure?

sezruby · 2020-11-23T06:17:15Z

Yes, I believe Hybrid Scan requires new APIs, so let's handle it separately. Thanks!

@imback82 I added allFiles API for Hybrid Scan and seems that's all for Hybrid Scan.

BTW #227 also has the same test failure and it's not reproducible on my local. Let me check further.

imback82 · 2020-11-25T04:07:48Z

#227 has been merged. Thanks!

sezruby · 2020-11-25T05:19:44Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala

-      .format(latestRelation.fileFormat)
-      .options(latestRelation.options)
-      .load(latestRelation.rootPaths: _*)
+    if (latestRelation.rootPaths.size == 1) {


Delta Lake only allows one path in load()

Yea, we can do something like:

val df = spark.read .schema(dataSchema) .format(latestRelation.fileFormat) .options(latestRelation.options) // Due to the difference in how the "path" option is set: https://github.com/apache/spark/blob/ef1441b56c5cab02335d8d2e4ff95cf7e9c9b9ca/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L197, // load() with a single parameter needs to be handled differently. if (latestRelation.rootPaths.size == 1) { df.load(latestRelation.rootPaths.head) } else { df.load(latestRelation.rootPaths: _*) }

Btw, what happens if the delta lake implementation adds the "path" option in latestRelation.options?

https://github.com/microsoft/hyperspace/pull/265/files#diff-f7ecc68d1799e9c2c916973786081d5f35f312785537ccd2eed61bce41a10786R72
"path" key will be removed in createRelation.

imback82 · 2020-11-25T05:54:33Z

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

      import spark.implicits._
      val dataPathColumn = "_data_path"
-      val lineageDF = fileIdTracker.getFileToIdMap.toSeq
+      val isDeltaLakeSource = DeltaLakeRuleUtils.isDeltaLakeSource(df.queryExecution.optimizedPlan)


Can we move this to source provider somehow? Basically, we don't want source specific implementation outside the provider.

question, why does delta not require the replace?

input_file_neme() function of Delta returns "file:/" not "file:///" - and I think we need to add some assert for this - join result will be empty if they don't match.

I see. How about we normalize the output of input_file_name by removing "file:/" or "file:///"? (and same to the getFileToIdMap keys)

replacing filename - id map is cheaper than normalizing all file paths of all rows. I'll add an API for this.

src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

sezruby · 2020-11-25T09:09:06Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

 import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference, In, Literal, Not}
 import org.apache.spark.sql.catalyst.optimizer.OptimizeIn
 import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.delta.files.TahoeLogFileIndex


imback82

Generally approach/integration looks good! I will do a detailed review this week.

@pirz @apoorvedave1 Could you take a look as well?

apoorvedave1 · 2020-12-03T00:44:33Z

LGTM, thanks @sezruby

sezruby · 2020-12-04T09:50:25Z

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala

+      }
+    }
+  }
+


Note to reviewers: more hybrid scan + delta lake test cases - #274

Is the result comparison tested in #274?

yes checkAnswer is used to compare the results in #274

pirz

LGTM, Thanks @sezruby

imback82 · 2020-12-08T02:40:43Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

      useBucketSpec: Boolean): LogicalPlan = {
-    val isParquetSourceFormat = index.relations.head.fileFormat.equals("parquet")
+    val fileFormat = index.relations.head.fileFormat
+    val isParquetSourceFormat = fileFormat.equals("parquet") || fileFormat.equals("delta")


Hmm, does this mean adding a new source provider is not enough?

Can we introduce hasParquetAsSourceFormat to the provider API and record this info in the metadata?

You can do this as a separate PR if that is preferred. Please create an issue in that case.

I think this case is too specific to create an API to source provider; and it refers fileformat string in index metadata, not relation.
So it's better to create the function in IndexLogEntry or some Utils class if needed. WDYT? @imback82

You should be able to plug in a source provider defined externally without changing Hyperspace codebase. For example, let's say I have a source format "blah" that uses parquet internally, and how can I plug in without modifying Hyperspace? One easy way to think about is whether you can implement delta source provider outside Hyperspace.

I think this case is too specific to create an API to source provider; and it refers fileformat string in index metadata, not relation.

You can do this in the create path and record it in the metadata.

Got your point - it should be able to add a new source provider externally.
Let me handle this with a new pr & issue. Thanks!

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala

imback82 · 2020-12-08T03:03:09Z

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala

+      }
+    }
+  }
+


Is the result comparison tested in #274?

imback82

LGTM except for one pending comment about introducing hasParquetAsSourceFormat.

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala

sezruby force-pushed the source_extension branch from 7bd4ee1 to e043245 Compare November 23, 2020 08:12

Add DeltaLakeFileBasedSource

e12ef65

sezruby force-pushed the source_extension branch from 06b3034 to e12ef65 Compare November 25, 2020 05:18

sezruby commented Nov 25, 2020

View reviewed changes

sezruby changed the title ~~[WIP] Pluggable source provider with Delta Lake support~~ Support Delta Lake file-based source provider Nov 25, 2020

Test fix

4452822

imback82 reviewed Nov 25, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala Outdated Show resolved Hide resolved

imback82 reviewed Nov 25, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala Outdated Show resolved Hide resolved

sezruby added 3 commits November 25, 2020 14:57

Review commit

8f07454

Review commit

61c152c

Disable delta

3455efb

sezruby requested review from apoorvedave1, imback82 and pirz November 25, 2020 07:51

sezruby self-assigned this Nov 25, 2020

sezruby added this to the November 2020 milestone Nov 25, 2020

Add comment

02f8c45

sezruby commented Nov 25, 2020

View reviewed changes

This was referenced Nov 25, 2020

Add Delta Lake version history to IndexLogEntry for efficient time travel query #272

Merged

Support Hybrid Scan with Delta Lake relation #224

Closed

Support Delta lake relation #197

Closed

imback82 reviewed Nov 25, 2020

View reviewed changes

sezruby mentioned this pull request Nov 26, 2020

Refactor Hybrid Scan test suites #274

Merged

minor fix

c5e8b77

imback82 mentioned this pull request Dec 3, 2020

Support refresh quick mode by using Hybrid Scan #238

Merged

sezruby added 2 commits December 4, 2020 18:38

Merge remote-tracking branch 'upstream/master' into deltalake_source

0524bdb

Minor fix

496c345

sezruby commented Dec 4, 2020

View reviewed changes

test fix

b3dfe92

sezruby mentioned this pull request Dec 5, 2020

Support incremental refresh index with hive-partition columns #281

Merged

Merge branch 'master' into source_extension

f1ae0a3

pirz reviewed Dec 7, 2020

View reviewed changes

imback82 reviewed Dec 8, 2020

View reviewed changes

sezruby closed this Dec 8, 2020

sezruby deleted the source_extension branch December 8, 2020 05:45

sezruby restored the source_extension branch December 8, 2020 05:47

sezruby reopened this Dec 8, 2020

Review commit

952d06a

imback82 reviewed Dec 8, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala Outdated Show resolved Hide resolved

review commit

56d5721

sezruby mentioned this pull request Dec 9, 2020

Support pluggable internal parquet format via source provider API #288

Closed

imback82 approved these changes Dec 9, 2020

View reviewed changes

imback82 merged commit 8fdf28b into microsoft:master Dec 9, 2020

sezruby mentioned this pull request Dec 22, 2020

Hyperspace for Delta Lake #148

Open

4 tasks

imback82 modified the milestones: November 2020, January 2021 Jan 29, 2021

imback82 added the enhancement New feature or request label Jan 29, 2021

sezruby deleted the source_extension branch April 30, 2021 03:26

Support Delta Lake file-based source provider #265

Support Delta Lake file-based source provider #265

Uh oh!

Conversation

sezruby commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sezruby commented Nov 23, 2020

Uh oh!

imback82 commented Nov 23, 2020

Uh oh!

imback82 commented Nov 23, 2020

Uh oh!

sezruby commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imback82 commented Nov 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sezruby Nov 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

apoorvedave1 commented Dec 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pirz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imback82 Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sezruby commented Nov 23, 2020 •

edited

Loading

sezruby commented Nov 23, 2020 •

edited

Loading

sezruby Nov 25, 2020 •

edited

Loading

imback82 Dec 8, 2020 •

edited

Loading