-
Notifications
You must be signed in to change notification settings - Fork 116
Support globbing patterns in dataframes for creation/maintenance/usage of indexes. #269
Conversation
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
sezruby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, but LGTM!
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (except for pending comments), thanks @apoorvedave1!
Could you also do a follow up PR to update doc (config + behavior)?
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| test("Verify index maintenance (create, refresh) works with globbing patterns.") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add create & refresh index tests for multiple * patterns ? e.g. table/*/m=1/*, table/*/* ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sezruby , i added a test for multiple levels of /*/*.
For specialized case e.g. /*/m=1/*, i think it would be an overkill, because I mean if
spark.read.parquet("glob/*/m=1/*") works then this will also work right, as long as we pass in the right value in .parquet(<path>)? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean - is it required additional changes? or not necessary as it will work?
And could you add some tests for the other glob patterns? like
* (match 0 or more character)
? (match single character)
[ab] (character class)
[^ab] (negated character class)
[a-b] (character range)
{a,b} (alternation)
\c (escape character)
I think we should check if the index with these patterns will be refreshed correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean - is it required additional changes? or not necessary as it will work?
And could you add some tests for the other glob patterns? like
* (match 0 or more character) ? (match single character) [ab] (character class) [^ab] (negated character class) [a-b] (character range) {a,b} (alternation) \c (escape character)I think we should check if the index with these patterns will be refreshed correctly.
I think it's overkill to add test for every regex possible since we are not adding any regex/wildcard handling logic at all.
cc @imback82 , @pirz , @rapoth for opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean if spark.read.parquet("any-wildcard-pattern") works and we make sure that we are passing the same argument always, then it will work always right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but we only have the tests using * - we can add a test using all of those patterns are used.
I just worried about handling special characters in json & refresh index action, and not wanted to test regex behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sezruby if you only want a test handling special characters, we can just add it to JsonUtilsTests? I think we can address as a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have added this #284 issue for tracking json conversion tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I am merging this to master. Let's do a follow up PR for this discussion. Thanks!
src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala
Outdated
Show resolved
Hide resolved
Thanks @imback82 , This issue for tracking documentation. I will start the pr soon |
|
@apoorvedave1 Can you do a follow-up PR to add a test for the table case? I think you can create a table like the following: |
|
Can we also add the following case? Basically, we can inject this option after the table is created: |
|
@apoorvedave1 Could you take a look at the follow ups asked? |
What is the context for this pull request?
What changes were proposed in this pull request?
Support for globbing patterns in data sources when creating/maintaining/using indexes.
Does this PR introduce any user-facing change?
Users will now be able to create and maintain indexes on dataframes created with globbing patterns e.g.
How was this patch tested?
unit tests