-
Notifications
You must be signed in to change notification settings - Fork 116
Modify logical plan to merge newly appended files and index data #165
Conversation
Can you add an example of how |
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did one round of review (but not tests yet). It looks pretty cool!
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
| // Remove sort order because we cannot guarantee the ordering of source files | ||
| val bucketSpec = index.bucketSpec.copy(sortColumnNames = Seq()) | ||
|
|
||
| object ExtractTopLevelPlanForShuffle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define this outside this function with proper comment, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's located here because of index.indexedColumns. I tried to define outside and pass the column names but I couldn't find a way.
Anyway it seems this extractor is not required for now..#165 (comment)
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
| * @param indexPlan replaced plan with index | ||
| * @return complementIndexPlan integrated plan of indexPlan and complementPlan | ||
| */ | ||
| private def getComplementIndexPlan( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the "return" value of getHybridScanIndexPlan so why is this called getComplementPlan? What is it complementing? From what I can understand the use of complement doesn't seem appropriate. I don't have any good ideas but wanted to understand the semantic meaning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used "complement" as this additional plan (for appended files) complements the index plan.
getCompleteIndexPlan / getComplemantaryIndexPlan / .. 🤔
Any good suggestion would be welcomed. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw your other renames. How about complementIndexScanWithDataScan? (since you are already using a verb like transform, I thought we could use complement as a word)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the getComplemetPlan into 2 different functions - transformPlanWithAppendedFiles, shufflePlanWithIndexSpec :)
src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala
Outdated
Show resolved
Hide resolved
rapoth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor opinionated comments but please feel free to ignore. Looking great, thanks!
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
| * @param indexPlan replaced plan with index | ||
| * @return complementIndexPlan integrated plan of indexPlan and complementPlan | ||
| */ | ||
| private def getComplementIndexPlan( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw your other renames. How about complementIndexScanWithDataScan? (since you are already using a verb like transform, I thought we could use complement as a word)
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
|
@pirz @apoorvedave1 Could you do a review for this PR? Thanks! |
|
LGTM 👍 thanks @sezruby |
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
| _) => | ||
| val curFileSet = location.allFiles | ||
| .map(f => FileInfo(f.getPath.toString, f.getLen, f.getModificationTime)) | ||
| filesAppended = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think filesAppended (and deleted) should be calculated before this function gets called, and call this function only if it's eligible to do hybrid scan. We don't have to address this now, but I will bring this up in a later PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I'll handle this with #171 or another PR.
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
| // shuffle the appended data in the same way to correctly merge with bucketed index data. | ||
|
|
||
| // Clear sortColumnNames as BucketUnion does not keep the sort order within a bucket. | ||
| val bucketSpec = index.bucketSpec.copy(sortColumnNames = Seq()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can use Nil here. (more consistent in this file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just use numBuckets in BucketUnion*? Are we using any other fields from BucketSpec?
Also, can you add an assert in BucketUnionExec.outputPartitioning that the number of partitions is same as the number of buckets? I think that's our assumption right?
assert(children.head.outputPartitioning.asInstanceOf[HashPartitioning].numPartitions == bucketSpec.numBuckets)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, do we need to check if the bucketed columns are the partitioning expressions (somewhere in BucketUnion*)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just use numBuckets in BucketUnion*? Are we using any other fields from BucketSpec?
Yes currently only numBuckets is used in BucketUnion*. But bucket column & sort column info are also shown in plan, which might help to analyze. And maybe we could use it later in optimization.
Btw, do we need to check if the bucketed columns are the partitioning expressions (somewhere in BucketUnion*)?
I tried to find a way to check partitioning expressions in BucketUnion* but there's no such API for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes currently only numBuckets is used in
BucketUnion*. But bucket column & sort column info are also shown in plan, which might help to analyze. And maybe we could use it later in optimization.
The confusion I have is that the following sounds like it has an effect, whereas it doesn't really do anything:
// Clear sortColumnNames as BucketUnion does not keep the sort order within a bucket.
val bucketSpec = index.bucketSpec.copy(sortColumnNames = Seq())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, as far as I know, bucket spec is meant to be used for the scan node. So it may be more confusing if we print out it in the BucketUnion. We can take this up as a separate PR, but I think we need to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I revised the comment a bit. But I think the information might help to understand the behavior? Does this look confusing?
+- *(5) Project [o_orderkey#61L, o_orderdate#65, o_shippriority#68]
: +- *(5) SortMergeJoin [c_custkey#5L], [o_custkey#62L], Inner
: :- *(1) Project [c_custkey#5L]
: : +- *(1) Filter ((isnotnull(c_mktsegment#11) && (c_mktsegment#11 = BUILDING)) && isnotnull(c_custkey#5L))
: : +- *(1) FileScan parquet [c_custkey#5L,c_mktsegment#11] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://eunjin-hyperspace-test@taruntestdiag.blob.core.windows.net/indexes/index..., PartitionFilters: [], PushedFilters: [IsNotNull(c_mktsegment), EqualTo(c_mktsegment,BUILDING), IsNotNull(c_custkey)], ReadSchema: struct<c_custkey:bigint,c_mktsegment:string>, SelectedBucketsCount: 200 out of 200
: +- *(4) Sort [o_custkey#62L ASC NULLS FIRST], false, 0
: +- BucketUnion 200 buckets, bucket columns: [o_custkey] <===============
: :- *(2) Project [o_orderkey#61L, o_custkey#62L, o_orderdate#65, o_shippriority#68]
: : +- *(2) Filter (((isnotnull(o_orderdate#65) && (o_orderdate#65 < 1995-03-15)) && isnotnull(o_custkey#62L)) && isnotnull(o_orderkey#61L))
: : +- *(2) FileScan parquet [o_custkey#62L,o_orderkey#61L,o_orderdate#65,o_shippriority#68] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://eunjin-hyperspace-test@taruntestdiag.blob.core.windows.net/indexes/index..., PartitionFilters: [], PushedFilters: [IsNotNull(o_orderdate), LessThan(o_orderdate,1995-03-15), IsNotNull(o_custkey), IsNotNull(o_orde..., ReadSchema: struct<o_custkey:bigint,o_orderkey:bigint,o_orderdate:string,o_shippriority:int>, SelectedBucketsCount: 200 out of 200
: +- Exchange hashpartitioning(o_custkey#62L, 200)
: +- *(3) Project [o_orderkey#61L, o_custkey#62L, o_orderdate#65, o_shippriority#68]
: +- *(3) Filter (((isnotnull(o_orderdate#65) && (o_orderdate#65 < 1995-03-15)) && isnotnull(o_custkey#62L)) && isnotnull(o_orderkey#61L))
: +- *(3) FileScan parquet [o_custkey#62L,o_orderkey#61L,o_orderdate#65,o_shippriority#68] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://eunjin-hyperspace-test@taruntestdiag.blob.core.windows.net/data/tpch-par..., PartitionFilters: [], PushedFilters: [IsNotNull(o_orderdate), LessThan(o_orderdate,1995-03-15), IsNotNull(o_custkey), IsNotNull(o_orde..., ReadSchema: struct<o_custkey:bigint,o_orderkey:bigint,o_orderdate:string,o_shippriority:int>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be better if we print out the output partitioning, similar to Exchange since we are unioning plans that have the same hash partitioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But new comment sounds better and let's take this up separately.
|
LGTM, Thanks @sezruby |
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
| // in their output; Case 1 won't be shown in use cases. The implementation is kept | ||
| // for future use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't comment on #165 (comment), but is this now being tested? If it's not covered, then I wouldn't add the case and just throw if this case is hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is covered by below test("Verify the location of injected shuffle for Hybrid Scan.") {
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
| // shuffle the appended data in the same way to correctly merge with bucketed index data. | ||
|
|
||
| // Clear sortColumnNames as BucketUnion does not keep the sort order within a bucket. | ||
| val bucketSpec = index.bucketSpec.copy(sortColumnNames = Seq()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes currently only numBuckets is used in
BucketUnion*. But bucket column & sort column info are also shown in plan, which might help to analyze. And maybe we could use it later in optimization.
The confusion I have is that the following sounds like it has an effect, whereas it doesn't really do anything:
// Clear sortColumnNames as BucketUnion does not keep the sort order within a bucket.
val bucketSpec = index.bucketSpec.copy(sortColumnNames = Seq())
src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| test("Verify the location of injected shuffle for Hybrid Scan.") { | ||
| val dataPath = systemPath.toString + "/hbtable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if you can utilize withTempDir? I just committed.
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! (pending minor comments)
src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala
Outdated
Show resolved
Hide resolved
| // Make sure there is no shuffle. | ||
| execPlan.foreach(p => assert(!p.isInstanceOf[ShuffleExchangeExec])) | ||
|
|
||
| checkAnswer(baseQuery, filter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks much better! 👍
| withTempDir { tempDir => | ||
| val dataPath = tempDir + "/hbtable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do withTempDir { dataPath => right? (No need for "/hbtable")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"/hbtabe" is requires as tempDir is directory - write.parquet(dataPath) will fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it fails since the directory is already created. You need withTempPath (which is same implemenation as withTempDir except that it deletes the directory created). I pushed the changes.
Btw, parquet() works on a directory (if you do overwrite).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @sezruby! 🚀🚀🚀
What changes were proposed in this pull request?
This PR allows users to use the hybrid scan for append-only dataset.
In order to support Hybrid Scan for append-only dataset, we need to merge the newly appended files and index data properly. Currently we have the following cases:
InMemoryFileIndextransformPlanToUseIndexreturns the transformed plan to utilize the given index.transformPlanToUsePureIndex: ifhybridScanEnbaled=false, it replaces the source location with the index data location same as before.transformPlanToUseHybridIndexDataScan: ifhybridScanEnabled=true, it firstly creates the plan with the index location similar totransformPlanToUsePureIndexWhy are the changes needed?
To support Hybrid Scan for append-only dataset. (#150)
Does this PR introduce any user-facing change?
Yes, if a user turns on Hybrid Scan (
spark.hyperspace.index.hybridscan.enabled=true), outdated indexes whose dataset got appended new files could be a candidate for both FilterIndexRule & JoinIndexRule.The query plan is modified in optimizer to support this and the changed plan can be checked with
hs.explain()API.FilterRule - Case1
FilterRule - Case2
JoinRule - BroadcastHashJoin
Join Rule - Sort Merge Join
How was this patch tested?
Unit test & TPCH validation