Support incremental refresh index with hive-partition columns #281

apoorvedave1 · 2020-12-03T19:50:35Z

What is the context for this pull request?

Tracking Issue: Hive-partition columns are not correctly stored in index content when refresh-incremental is called #280

What changes were proposed in this pull request?

Set "basePath" option during creation and refresh of indexes on a hive-partitioned data source. This will ensure that all calls to refreshIndex will pick this basePath while identifying newly appended data files for incremental refresh.

When 'basePath' is known, spark will identify the partition columns automatically and properly pick them to store in the index.

NOTE: This PR doesn't include changes which may be required for Deltalake data source. Please follow up with #295 for delta lake

Does this PR introduce any user-facing change?

no

How was this patch tested?

unit tests

apoorvedave1 · 2020-12-03T19:52:41Z

src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala

+      // Check if partition columns are correctly stored in index contents.
+      val indexDf2 = spark.read.parquet(s"$systemPath/indexName").where("Date = '2019-10-03'")
+      assert(appendData.size == indexDf2.count())


Note to reviewers: this section of the test will fail without the code changes. for the appended data (date = 2019-10-03), the indexDf2.count() will return 0 because "Date = null" for all appended rows).

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

imback82 · 2020-12-04T20:38:18Z

@apoorvedave1 Can you fix the conflicts please?

# Conflicts: # src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala # src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala

apoorvedave1 · 2020-12-04T21:54:42Z

@apoorvedave1 Can you fix the conflicts please?

@imback82 , thanks , done

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

# Conflicts: # src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

…to refreshpart

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

apoorvedave1 · 2020-12-16T01:04:50Z

@imback82 could you please take a look when you get a chance?

sezruby

Except for one #281 (comment), LGTM Thanks @apoorvedave1!

And please handle Delta Lake #295 asap as options in lowercase can cause a bug.

apoorvedave1 · 2020-12-17T01:33:58Z

Except for one #281 (comment), LGTM Thanks @apoorvedave1!

And please handle Delta Lake #295 asap as options in lowercase can cause a bug.

thank you. I will see if I get the bw. But please feel free to start a PR if you get a chance.

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

imback82 · 2021-01-04T22:40:15Z

Store "basePath" in metadata while creating the index on a hive-partitioned data source.

Can you update this? (I think we are setting basePath option when creating a relation). I misunderstood it as updating the index metadata to store this info.

apoorvedave1 · 2021-01-11T21:18:49Z

Store "basePath" in metadata while creating the index on a hive-partitioned data source.

Can you update this? (I think we are setting basePath option when creating a relation). I misunderstood it as updating the index metadata to store this info.

thanks @imback82 , yeah that was the original implementation. It changed during the review. I have updated the description

imback82

LGTM, thanks @apoorvedave1!

@sezruby Can you also finish reviewing this PR?

sezruby · 2021-01-12T06:37:20Z

@imback82 one comment: #281 (comment)
but it seems minor. Let's merge this PR :) Thanks @apoorvedave1!

imback82 · 2021-01-12T07:27:50Z

Ah OK. I missed that comment since it was resolved. @apoorvedave1 Can you update it with flatten approach? That seems cleaner. Thanks!

…to refreshpart

apoorvedave1 · 2021-01-12T20:30:28Z

Ah OK. I missed that comment since it was resolved. @apoorvedave1 Can you update it with flatten approach? That seems cleaner. Thanks!

done, thanks @imback82 !

initial commit

e05692c

apoorvedave1 requested review from imback82, pirz and sezruby December 3, 2020 19:50

apoorvedave1 commented Dec 3, 2020

View reviewed changes

test improve

598d270

apoorvedave1 self-assigned this Dec 3, 2020

apoorvedave1 added the bug Something isn't working label Dec 3, 2020

apoorvedave1 added this to the December 2020 milestone Dec 3, 2020

sezruby reviewed Dec 4, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Outdated Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Outdated Show resolved Hide resolved

review comments

1664c61

apoorvedave1 requested a review from sezruby December 4, 2020 18:48

Merge branch 'master' into refreshpart

29550f2

Merge remote-tracking branch 'upstream/master' into refreshpart

10aa2a5

# Conflicts: # src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala # src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala

apoorvedave1 commented Dec 4, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala Outdated Show resolved Hide resolved

sezruby reviewed Dec 5, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Outdated Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala Outdated Show resolved Hide resolved

apoorvedave1 added 3 commits December 9, 2020 11:32

Merge remote-tracking branch 'upstream/master' into refreshpart

2bc19ed

# Conflicts: # src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

cleanup imports

6907630

minor bugfix

eadb329

sezruby reviewed Dec 10, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Show resolved Hide resolved

review comments: make partition base path optional

6540d9f

sezruby reviewed Dec 11, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala Outdated Show resolved Hide resolved

apoorvedave1 and others added 2 commits December 11, 2020 10:35

review comments

d19f602

Merge branch 'master' into refreshpart

b0420b9

sezruby reviewed Dec 14, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Outdated Show resolved Hide resolved

apoorvedave1 mentioned this pull request Dec 14, 2020

Investigate if refresh incremental works fine with Delta lake #295

Closed

apoorvedave1 added 2 commits December 14, 2020 17:26

review comments

7fff826

Merge branch 'refreshpart' of github.com:apoorvedave1/hyperspace-1 in…

f5d788d

…to refreshpart

scalastyle

5937eaa

sezruby reviewed Dec 15, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Outdated Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Show resolved Hide resolved

apoorvedave1 and others added 2 commits December 15, 2020 10:27

Merge branch 'master' into refreshpart

df3c0b7

remove type

5195bf3

apoorvedave1 mentioned this pull request Dec 15, 2020

Incremental Refresh support with scan patterns #298

Open

sezruby reviewed Dec 15, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Show resolved Hide resolved

sezruby reviewed Dec 17, 2020

View reviewed changes

sezruby mentioned this pull request Dec 22, 2020

Support incremental refresh for Delta Lake #301

Merged

imback82 reviewed Jan 4, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Outdated Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Outdated Show resolved Hide resolved

sezruby mentioned this pull request Jan 11, 2021

Support Iceberg table format #320

Closed

apoorvedave1 added 2 commits January 11, 2021 13:02

Merge remote-tracking branch 'upstream/master' into refreshpart

dd27db9

review comments

5460f32

Merge branch 'master' into refreshpart

604bc14

imback82 previously approved these changes Jan 12, 2021

View reviewed changes

apoorvedave1 added 2 commits January 12, 2021 12:29

review comments

875f0c4

Merge branch 'refreshpart' of github.com:apoorvedave1/hyperspace-1 in…

612fb7f

…to refreshpart

apoorvedave1 dismissed imback82’s stale review via 612fb7f January 12, 2021 20:29

review comments

365a8d1

imback82 approved these changes Jan 12, 2021

View reviewed changes

imback82 merged commit 117a070 into microsoft:master Jan 12, 2021

imback82 modified the milestones: December 2020, January 2021 Jan 29, 2021

Support incremental refresh index with hive-partition columns #281

Support incremental refresh index with hive-partition columns #281

Uh oh!

Conversation

apoorvedave1 commented Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

apoorvedave1 Dec 3, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

imback82 commented Dec 4, 2020

Uh oh!

apoorvedave1 commented Dec 4, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

apoorvedave1 commented Dec 16, 2020

Uh oh!

sezruby left a comment

Choose a reason for hiding this comment

Uh oh!

apoorvedave1 commented Dec 17, 2020

Uh oh!

Uh oh!

Uh oh!

imback82 commented Jan 4, 2021

Uh oh!

apoorvedave1 commented Jan 11, 2021

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

sezruby commented Jan 12, 2021

Uh oh!

imback82 commented Jan 12, 2021

Uh oh!

apoorvedave1 commented Jan 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

apoorvedave1 commented Dec 3, 2020 •

edited

Loading

apoorvedave1 commented Jan 12, 2021 •

edited

Loading