Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@apoorvedave1
Copy link
Contributor

What is the context for this pull request?

What changes were proposed in this pull request?

Support for globbing patterns in data sources when creating/maintaining/using indexes.

Does this PR introduce any user-facing change?

Users will now be able to create and maintain indexes on dataframes created with globbing patterns e.g.

  val path = "tmp/1/*"
  val path2 = "tmp/2/*"
  val df = spark.read
    .option("spark.hyperspace.source.globbingPattern", s"$path,$path2") // This is only required the first time index is being created.
    .parquet(path, path2)

  val hs = new Hyperspace(spark)
  hs.createIndex(df, ...)

Note: for multiple paths, as shown in the example above, please concatenate the paths with commas ,.

How was this patch tested?

unit tests

@apoorvedave1 apoorvedave1 requested review from imback82, pirz and sezruby and removed request for sezruby November 23, 2020 20:06
@apoorvedave1 apoorvedave1 self-assigned this Nov 23, 2020
@apoorvedave1 apoorvedave1 marked this pull request as ready for review November 24, 2020 00:49
@apoorvedave1 apoorvedave1 requested a review from sezruby December 2, 2020 19:50
@rapoth rapoth modified the milestones: November 2020, December 2020 Dec 3, 2020
Copy link
Collaborator

@sezruby sezruby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but LGTM!

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (except for pending comments), thanks @apoorvedave1!

Could you also do a follow up PR to update doc (config + behavior)?

}
}

test("Verify index maintenance (create, refresh) works with globbing patterns.") {
Copy link
Collaborator

@sezruby sezruby Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add create & refresh index tests for multiple * patterns ? e.g. table/*/m=1/*, table/*/* ..

Copy link
Contributor Author

@apoorvedave1 apoorvedave1 Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sezruby , i added a test for multiple levels of /*/*.

For specialized case e.g. /*/m=1/*, i think it would be an overkill, because I mean if
spark.read.parquet("glob/*/m=1/*") works then this will also work right, as long as we pass in the right value in .parquet(<path>)? WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean - is it required additional changes? or not necessary as it will work?

And could you add some tests for the other glob patterns? like

* (match 0 or more character)
? (match single character)
[ab] (character class)
[^ab] (negated character class)
[a-b] (character range)
{a,b} (alternation)
\c (escape character)

I think we should check if the index with these patterns will be refreshed correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean - is it required additional changes? or not necessary as it will work?

And could you add some tests for the other glob patterns? like

* (match 0 or more character)
? (match single character)
[ab] (character class)
[^ab] (negated character class)
[a-b] (character range)
{a,b} (alternation)
\c (escape character)

I think we should check if the index with these patterns will be refreshed correctly.

I think it's overkill to add test for every regex possible since we are not adding any regex/wildcard handling logic at all.
cc @imback82 , @pirz , @rapoth for opinions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if spark.read.parquet("any-wildcard-pattern") works and we make sure that we are passing the same argument always, then it will work always right?

Copy link
Collaborator

@sezruby sezruby Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but we only have the tests using * - we can add a test using all of those patterns are used.
I just worried about handling special characters in json & refresh index action, and not wanted to test regex behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sezruby if you only want a test handling special characters, we can just add it to JsonUtilsTests? I think we can address as a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have added this #284 issue for tracking json conversion tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I am merging this to master. Let's do a follow up PR for this discussion. Thanks!

@apoorvedave1
Copy link
Contributor Author

LGTM (except for pending comments), thanks @apoorvedave1!

Could you also do a follow up PR to update doc (config + behavior)?

Thanks @imback82 , This issue for tracking documentation. I will start the pr soon
#283

@imback82
Copy link
Contributor

imback82 commented Jan 5, 2021

@apoorvedave1 Can you do a follow-up PR to add a test for the table case? I think you can create a table like the following: CREATE TABLE t(k int) USING json OPTIONS ('spark.hyperspace.source.globbingPattern', 'globbing path')

@imback82
Copy link
Contributor

imback82 commented Jan 5, 2021

Can we also add the following case? Basically, we can inject this option after the table is created:

CREATE TABLE t(k int) USING json
ALTER TABLE t SET SERDEPROPERTIES ('spark.hyperspace.source.globbingPattern' = 'globbing pattern')
REFRESH TABLE t

@imback82 imback82 modified the milestones: November 2020, January 2021 Jan 29, 2021
@imback82 imback82 added the enhancement New feature or request label Jan 29, 2021
@imback82
Copy link
Contributor

@apoorvedave1 Could you take a look at the follow ups asked?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants