-
Notifications
You must be signed in to change notification settings - Fork 116
Support incremental refresh index with hive-partition columns #281
Conversation
| // Check if partition columns are correctly stored in index contents. | ||
| val indexDf2 = spark.read.parquet(s"$systemPath/indexName").where("Date = '2019-10-03'") | ||
| assert(appendData.size == indexDf2.count()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: this section of the test will fail without the code changes. for the appended data (date = 2019-10-03), the indexDf2.count() will return 0 because "Date = null" for all appended rows).
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
|
@apoorvedave1 Can you fix the conflicts please? |
# Conflicts: # src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala # src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala
@imback82 , thanks , done |
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
# Conflicts: # src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Show resolved
Hide resolved
|
@imback82 could you please take a look when you get a chance? |
sezruby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for one #281 (comment), LGTM Thanks @apoorvedave1!
And please handle Delta Lake #295 asap as options in lowercase can cause a bug.
thank you. I will see if I get the bw. But please feel free to start a PR if you get a chance. |
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
Can you update this? (I think we are setting |
thanks @imback82 , yeah that was the original implementation. It changed during the review. I have updated the description |
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @apoorvedave1!
@sezruby Can you also finish reviewing this PR?
|
@imback82 one comment: #281 (comment) |
|
Ah OK. I missed that comment since it was resolved. @apoorvedave1 Can you update it with |
done, thanks @imback82 ! |
What is the context for this pull request?
What changes were proposed in this pull request?
Set "basePath" option during creation and refresh of indexes on a hive-partitioned data source. This will ensure that all calls to
refreshIndexwill pick this basePath while identifying newly appended data files for incremental refresh.When 'basePath' is known, spark will identify the partition columns automatically and properly pick them to store in the index.
Does this PR introduce any user-facing change?
no
How was this patch tested?
unit tests