Support globbing patterns in dataframes for creation/maintenance/usage of indexes. #269

apoorvedave1 · 2020-11-23T20:06:25Z

What is the context for this pull request?

Tracking Issue: [FEATURE REQUEST]: Support mutable source data when it's loaded with globbing pattern #226

What changes were proposed in this pull request?

Support for globbing patterns in data sources when creating/maintaining/using indexes.

Does this PR introduce any user-facing change?

Users will now be able to create and maintain indexes on dataframes created with globbing patterns e.g.

  val path = "tmp/1/*"
  val path2 = "tmp/2/*"
  val df = spark.read
    .option("spark.hyperspace.source.globbingPattern", s"$path,$path2") // This is only required the first time index is being created.
    .parquet(path, path2)

  val hs = new Hyperspace(spark)
  hs.createIndex(df, ...)

Note: for multiple paths, as shown in the example above, please concatenate the paths with commas ,.

How was this patch tested?

unit tests

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

…se.scala

sezruby

Minor comments, but LGTM!

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

imback82

LGTM (except for pending comments), thanks @apoorvedave1!

Could you also do a follow up PR to update doc (config + behavior)?

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala

sezruby · 2020-12-03T09:23:08Z

src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala

    }
  }

+  test("Verify index maintenance (create, refresh) works with globbing patterns.") {


Can you add create & refresh index tests for multiple * patterns ? e.g. table/*/m=1/*, table/*/* ..

thanks @sezruby , i added a test for multiple levels of /*/*.

For specialized case e.g. /*/m=1/*, i think it would be an overkill, because I mean if
spark.read.parquet("glob/*/m=1/*") works then this will also work right, as long as we pass in the right value in .parquet(<path>)? WDYT?

You mean - is it required additional changes? or not necessary as it will work?

And could you add some tests for the other glob patterns? like

* (match 0 or more character) ? (match single character) [ab] (character class) [^ab] (negated character class) [a-b] (character range) {a,b} (alternation) \c (escape character)

I think we should check if the index with these patterns will be refreshed correctly.

You mean - is it required additional changes? or not necessary as it will work?

And could you add some tests for the other glob patterns? like

* (match 0 or more character) ? (match single character) [ab] (character class) [^ab] (negated character class) [a-b] (character range) {a,b} (alternation) \c (escape character)

I think we should check if the index with these patterns will be refreshed correctly.

I think it's overkill to add test for every regex possible since we are not adding any regex/wildcard handling logic at all.
cc @imback82 , @pirz , @rapoth for opinions?

I mean if spark.read.parquet("any-wildcard-pattern") works and we make sure that we are passing the same argument always, then it will work always right?

Yeah but we only have the tests using * - we can add a test using all of those patterns are used.
I just worried about handling special characters in json & refresh index action, and not wanted to test regex behavior.

@sezruby if you only want a test handling special characters, we can just add it to JsonUtilsTests? I think we can address as a follow up PR.

Thanks, I have added this #284 issue for tracking json conversion tests

Cool, I am merging this to master. Let's do a follow up PR for this discussion. Thanks!

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala

apoorvedave1 · 2020-12-04T18:51:32Z

LGTM (except for pending comments), thanks @apoorvedave1!

Could you also do a follow up PR to update doc (config + behavior)?

Thanks @imback82 , This issue for tracking documentation. I will start the pr soon
#283

imback82 · 2021-01-05T20:46:27Z

@apoorvedave1 Can you do a follow-up PR to add a test for the table case? I think you can create a table like the following: CREATE TABLE t(k int) USING json OPTIONS ('spark.hyperspace.source.globbingPattern', 'globbing path')

imback82 · 2021-01-05T21:24:39Z

Can we also add the following case? Basically, we can inject this option after the table is created:

CREATE TABLE t(k int) USING json
ALTER TABLE t SET SERDEPROPERTIES ('spark.hyperspace.source.globbingPattern' = 'globbing pattern')
REFRESH TABLE t

imback82 · 2021-01-29T18:46:28Z

@apoorvedave1 Could you take a look at the follow ups asked?

initial commit

dec77b4

apoorvedave1 requested review from imback82, pirz and sezruby and removed request for sezruby November 23, 2020 20:06

apoorvedave1 self-assigned this Nov 23, 2020

apoorvedave1 added 2 commits November 23, 2020 16:20

add a unit test

b417a93

more unit tests

0c4f460

apoorvedave1 marked this pull request as ready for review November 24, 2020 00:49

sezruby reviewed Nov 24, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into glob

caf1974

sezruby reviewed Nov 25, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala Outdated Show resolved Hide resolved

sezruby reviewed Nov 25, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala Outdated Show resolved Hide resolved

rapoth added this to the November 2020 milestone Nov 26, 2020

apoorvedave1 mentioned this pull request Dec 2, 2020

[PROPOSAL]: Support for Scan Pattern for Scalable Refresh Index for Large Data Sources #276

Open

apoorvedave1 commented Dec 2, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala Outdated Show resolved Hide resolved

apoorvedave1 and others added 2 commits December 2, 2020 10:47

Update src/main/scala/com/microsoft/hyperspace/actions/CreateActionBa…

4c9badb

…se.scala

merge conflicts resolved

7deabe4

apoorvedave1 requested a review from sezruby December 2, 2020 19:50

rapoth modified the milestones: November 2020, December 2020 Dec 3, 2020

sezruby reviewed Dec 3, 2020

View reviewed changes

imback82 reviewed Dec 3, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala Outdated Show resolved Hide resolved

sezruby reviewed Dec 3, 2020

View reviewed changes

apoorvedave1 added 2 commits December 3, 2020 14:05

review comments

7556977

Merge remote-tracking branch 'upstream/master' into glob

2b7c8a4

additional test

906b229

imback82 reviewed Dec 4, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala Outdated Show resolved Hide resolved

apoorvedave1 mentioned this pull request Dec 4, 2020

Add Json conversion tests for regex/wildcard patterns #284

Open

test cleanup

73357d8

apoorvedave1 requested review from imback82 and sezruby December 4, 2020 19:17

imback82 approved these changes Dec 4, 2020

View reviewed changes

imback82 merged commit 82e02cf into microsoft:master Dec 4, 2020

apoorvedave1 mentioned this pull request Dec 16, 2020

Add globbing pattern instructions to quick start guide #299

Merged

imback82 modified the milestones: November 2020, January 2021 Jan 29, 2021

imback82 added the enhancement New feature or request label Jan 29, 2021

Support globbing patterns in dataframes for creation/maintenance/usage of indexes. #269

Support globbing patterns in dataframes for creation/maintenance/usage of indexes. #269

Conversation

apoorvedave1 commented Nov 23, 2020

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sezruby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sezruby Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apoorvedave1 Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sezruby Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

apoorvedave1 Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

apoorvedave1 Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

sezruby Dec 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imback82 Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

apoorvedave1 Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

apoorvedave1 commented Dec 4, 2020

Uh oh!

imback82 commented Jan 5, 2021

Uh oh!

imback82 commented Jan 5, 2021

Uh oh!

imback82 commented Jan 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sezruby Dec 3, 2020 •

edited

Loading

apoorvedave1 Dec 3, 2020 •

edited

Loading

sezruby Dec 4, 2020 •

edited

Loading