Modify logical plan to merge newly appended files and index data #165

sezruby · 2020-09-15T08:37:41Z

What changes were proposed in this pull request?

This PR allows users to use the hybrid scan for append-only dataset.

In order to support Hybrid Scan for append-only dataset, we need to merge the newly appended files and index data properly. Currently we have the following cases:

Case 1) Filter Index Rule & parquet source format
- In this case, we can just add the file list of appended files to the file list of the index relation (i.e. index data). It's because:
  - it's guaranteed that newly appended source files always have the all columns in the index data (except for lineage).
  - Filter Index Rule does not utilize bucketing information for now; able to read both index data & newly appended data with 1 FileScan node. See below InMemoryFileIndex

...
<----:- *(1) Project [name#1, id#0]---->
<----:  +- *(1) Filter (isnotnull(id#0) && (id#0 >= 1))---->
<----:     +- *(1) FileScan parquet [id#0,name#1] Batched: true, Format: Parquet, 
                     Location: InMemoryFileIndex[<list of index data files> , <list of newly appended files>], PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThanOrEqual(id,1)], ReadSchema: struct<id:int,name:string>---->
...

Case 2) Filter Index Rule & non-parquet source format
- In this case, We could use Union instead of BucketUnion. Please refer Improvement of HybridScan for FilterIndexRule #145.
Case 3) Join Index Rule
- We can utilize BucketUnion (Hybrid scan operator for leveraging index alongside newly appended data - BucketUnion #151) to merge index data and appended data. In this way, we can retain the bucketing information of index, which enables to avoid unnecessary shuffling of index data.

 +- BucketUnion 200 buckets, bucket columns: [l_orderkey]      <===== merge both plans
            :- Project [l_orderkey#21L]                        <===== original index plan
            :  +- Filter ((isnotnull(l_commitdate#32) && isnotnull(l_receiptdate#33)) && (l_commitdate#32 < l_receiptdate#33))
            :     +- Relation[l_orderkey#21L,l_partkey#22L,l_suppkey#23L,l_quantity#25,l_extendedprice#26,l_discount#27,l_returnflag#29,l_shipdate#31,l_commitdate#32,l_receiptdate#33,l_shipmode#35] parquet
            +- RepartitionByExpression [l_orderkey#21L], 200   <===== on-the-fly shuffle with index spec
               +- Project [l_orderkey#21L]                     <===== newly appended data
                  +- Filter ((isnotnull(l_commitdate#32) && isnotnull(l_receiptdate#33)) && (l_commitdate#32 < l_receiptdate#33))
                     +- Relation[l_orderkey#21L,l_partkey#22L,l_suppkey#23L,l_quantity#25,l_extendedprice#26,l_discount#27,l_returnflag#29,l_shipdate#31,l_commitdate#32,l_receiptdate#33,l_shipmode#35] parquet

transformPlanToUseIndex returns the transformed plan to utilize the given index.
transformPlanToUsePureIndex: if hybridScanEnbaled=false, it replaces the source location with the index data location same as before.
transformPlanToUseHybridIndexDataScan: if hybridScanEnabled=true, it firstly creates the plan with the index location similar to transformPlanToUsePureIndex
- and if there is appended data,
  - if Case1, add appended file lists to the location along with index data files, and then return.
  - if Case3, create on-the-fly shuffle for the appended data and do BucketUnion.

Why are the changes needed?

To support Hybrid Scan for append-only dataset. (#150)

Does this PR introduce any user-facing change?

Yes, if a user turns on Hybrid Scan (spark.hyperspace.index.hybridscan.enabled=true), outdated indexes whose dataset got appended new files could be a candidate for both FilterIndexRule & JoinIndexRule.
The query plan is modified in optimizer to support this and the changed plan can be checked with hs.explain() API.

FilterRule - Case1

scala> hs.explain(query)
=============================================================
Plan with indexes:
=============================================================
Project [id#0, name#1]
+- Filter (isnotnull(id#0) && (id#0 = 1))
   <----+- FileScan parquet [id#0,name#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/spark-warehouse/indexes/index33/v__=..., PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:int,name:string>---->

=============================================================
Plan without indexes:
=============================================================
Project [id#0, name#1]
+- Filter (isnotnull(id#0) && (id#0 = 1))
   <----+- FileScan parquet [id#0,name#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/table], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:int,name:string>---->

=============================================================
Indexes used:
=============================================================
index33:file:/C:/Users/eunsong/IdeaProjects/spark2/spark-warehouse/indexes/index33/v__=0

FilterRule - Case2

=============================================================
Plan with indexes:
=============================================================
<----Union---->
<----:- *(1) Project [id#75L, name#76]---->
<----:  +- *(1) Filter (isnotnull(id#75L) && (id#75L = 1))---->
<----:     +- *(1) FileScan parquet [id#75L,name#76] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/spark-warehouse/indexes/indexjj/v__=..., PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:bigint,name:string>, SelectedBucketsCount: 1 out of 200---->
<----+- *(2) Project [id#75L, name#76]---->
   <----+- *(2) Filter (isnotnull(id#75L) && (id#75L = 1))---->
      <----+- *(2) FileScan json [id#75L,name#76] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/tablej/part-00000-b6853714-dc1b-450b..., PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:bigint,name:string>---->

JoinRule - BroadcastHashJoin

=============================================================
Plan with indexes:
=============================================================
Project [id#0, name#1, name#40]
+- BroadcastHashJoin [id#0], [id#39], Inner, BuildRight
   <----:- BucketUnion 200 buckets, bucket columns: [id]---->
   <----:  :- *(1) Project [id#0, name#1]---->
   <----:  :  +- *(1) Filter ((isnotnull(id#0) && (id#0 = 1)) && (id#0 >= 1))---->
   <----:  :     +- *(1) FileScan parquet [id#0,name#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/spark-warehouse/indexes/index33/v__=..., PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1), GreaterThanOrEqual(id,1)], ReadSchema: struct<id:int,name:string>, SelectedBucketsCount: 1 out of 200---->
   <----:  +- Exchange hashpartitioning(id#0, 200)---->
   <----:     +- *(2) Project [id#0, name#1]---->
   <----:        +- *(2) Filter ((isnotnull(id#0) && (id#0 = 1)) && (id#0 >= 1))---->
   <----:           +- *(2) FileScan parquet [id#0,name#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/table/part-00003-0fc50086-fd6e-4527-..., PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1), GreaterThanOrEqual(id,1)], ReadSchema: struct<id:int,name:string>---->
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
      <----+- BucketUnion 200 buckets, bucket columns: [id]---->
         <----:- *(3) Project [id#39, name#40]---->
         <----:  +- *(3) Filter ((isnotnull(id#39) && (id#39 >= 1)) && (id#39 = 1))---->
         <----:     +- *(3) FileScan parquet [id#39,name#40] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/spark-warehouse/indexes/index33/v__=..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThanOrEqual(id,1), EqualTo(id,1)], ReadSchema: struct<id:int,name:string>, SelectedBucketsCount: 1 out of 200---->
         <----+- Exchange hashpartitioning(id#39, 200)---->
            <----+- *(4) Project [id#39, name#40]---->
               <----+- *(4) Filter ((isnotnull(id#39) && (id#39 >= 1)) && (id#39 = 1))---->
                  <----+- *(4) FileScan parquet [id#39,name#40] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/table/part-00003-0fc50086-fd6e-4527-..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThanOrEqual(id,1), EqualTo(id,1)], ReadSchema: struct<id:int,name:string>---->

=============================================================
Plan without indexes:
=============================================================
Project [id#0, name#1, name#40]
+- BroadcastHashJoin [id#0], [id#39], Inner, BuildRight
   <----:- Project [id#0, name#1]---->
   <----:  +- Filter ((isnotnull(id#0) && (id#0 = 1)) && (id#0 >= 1))---->
   <----:     +- FileScan parquet [id#0,name#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/table], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1), GreaterThanOrEqual(id,1)], ReadSchema: struct<id:int,name:string>---->
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
      <----+- *(1) Project [id#39, name#40]---->
         <----+- *(1) Filter ((isnotnull(id#39) && (id#39 >= 1)) && (id#39 = 1))---->
            <----+- *(1) FileScan parquet [id#39,name#40] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/table], PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThanOrEqual(id,1), EqualTo(id,1)], ReadSchema: struct<id:int,name:string>---->

=============================================================
Indexes used:
=============================================================
index33:file:/C:/Users/eunsong/IdeaProjects/spark2/spark-warehouse/indexes/index33/v__=0

Join Rule - Sort Merge Join

=============================================================
Plan with indexes:
=============================================================
Project [id#69, name#70, name#103]
+- SortMergeJoin [id#69], [id#102], Inner
   :- *(3) Sort [id#69 ASC NULLS FIRST], false, 0
   <----:  +- BucketUnion 200 buckets, bucket columns: [id]---->
   <----:     :- *(1) Project [id#69, name#70]---->
   <----:     :  +- *(1) Filter ((isnotnull(id#69) && (id#69 = 1)) && (id#69 >= 1))---->
   <----:     :     +- *(1) FileScan parquet [id#69,name#70] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/spark-warehouse/indexes/index33/v__=..., PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1), GreaterThanOrEqual(id,1)], ReadSchema: struct<id:int,name:string>, SelectedBucketsCount: 1 out of 200---->
   <----:     +- Exchange hashpartitioning(id#69, 200)---->
   <----:        +- *(2) Project [id#69, name#70]---->
   <----:           +- *(2) Filter ((isnotnull(id#69) && (id#69 = 1)) && (id#69 >= 1))---->
   <----:              +- *(2) FileScan parquet [id#69,name#70] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/table/part-00003-0fc50086-fd6e-4527-..., PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1), GreaterThanOrEqual(id,1)], ReadSchema: struct<id:int,name:string>---->
   +- *(6) Sort [id#102 ASC NULLS FIRST], false, 0
      <----+- BucketUnion 200 buckets, bucket columns: [id]---->
         <----:- *(4) Project [id#102, name#103]---->
         <----:  +- *(4) Filter ((isnotnull(id#102) && (id#102 >= 1)) && (id#102 = 1))---->
         <----:     +- *(4) FileScan parquet [id#102,name#103] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/spark-warehouse/indexes/index33/v__=..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThanOrEqual(id,1), EqualTo(id,1)], ReadSchema: struct<id:int,name:string>, SelectedBucketsCount: 1 out of 200---->
         <----+- Exchange hashpartitioning(id#102, 200)---->
            <----+- *(5) Project [id#102, name#103]---->
               <----+- *(5) Filter ((isnotnull(id#102) && (id#102 >= 1)) && (id#102 = 1))---->
                  <----+- *(5) FileScan parquet [id#102,name#103] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/table/part-00003-0fc50086-fd6e-4527-..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThanOrEqual(id,1), EqualTo(id,1)], ReadSchema: struct<id:int,name:string>---->

=============================================================
Plan without indexes:
=============================================================
Project [id#69, name#70, name#103]
+- SortMergeJoin [id#69], [id#102], Inner
   :- *(2) Sort [id#69 ASC NULLS FIRST], false, 0
   <----:  +- Exchange hashpartitioning(id#69, 200)---->
   <----:     +- *(1) Project [id#69, name#70]---->
   <----:        +- *(1) Filter ((isnotnull(id#69) && (id#69 = 1)) && (id#69 >= 1))---->
   <----:           +- *(1) FileScan parquet [id#69,name#70] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/table], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1), GreaterThanOrEqual(id,1)], ReadSchema: struct<id:int,name:string>---->
   +- *(4) Sort [id#102 ASC NULLS FIRST], false, 0
      <----+- Exchange hashpartitioning(id#102, 200)---->
         <----+- *(3) Project [id#102, name#103]---->
            <----+- *(3) Filter ((isnotnull(id#102) && (id#102 >= 1)) && (id#102 = 1))---->
               <----+- *(3) FileScan parquet [id#102,name#103] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/eunsong/IdeaProjects/spark2/table], PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThanOrEqual(id,1), EqualTo(id,1)], ReadSchema: struct<id:int,name:string>---->

=============================================================
Indexes used:
=============================================================
index33:file:/C:/Users/eunsong/IdeaProjects/spark2/spark-warehouse/indexes/index33/v__=0

How was this patch tested?

Unit test & TPCH validation

rapoth · 2020-09-16T02:17:08Z

The query plan is modified in optimizer to support this and the changed plan can be checked with hs.explain() API.

Can you add an example of how hs.explain will look like for a simple scenario?

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

imback82

I did one round of review (but not tests yet). It looks pretty cool!

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

imback82 · 2020-09-18T00:09:10Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+      // Remove sort order because we cannot guarantee the ordering of source files
+      val bucketSpec = index.bucketSpec.copy(sortColumnNames = Seq())
+
+      object ExtractTopLevelPlanForShuffle {


Can we define this outside this function with proper comment, etc.?

It's located here because of index.indexedColumns. I tried to define outside and pass the column names but I couldn't find a way.

Anyway it seems this extractor is not required for now..#165 (comment)

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

rapoth · 2020-09-18T03:33:35Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+   * @param indexPlan replaced plan with index
+   * @return complementIndexPlan integrated plan of indexPlan and complementPlan
+   */
+  private def getComplementIndexPlan(


This seems to be the "return" value of getHybridScanIndexPlan so why is this called getComplementPlan? What is it complementing? From what I can understand the use of complement doesn't seem appropriate. I don't have any good ideas but wanted to understand the semantic meaning here.

I used "complement" as this additional plan (for appended files) complements the index plan.
getCompleteIndexPlan / getComplemantaryIndexPlan / .. 🤔

Any good suggestion would be welcomed. :)

I saw your other renames. How about complementIndexScanWithDataScan? (since you are already using a verb like transform, I thought we could use complement as a word)

I refactored the getComplemetPlan into 2 different functions - transformPlanWithAppendedFiles, shufflePlanWithIndexSpec :)

src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala

src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala

rapoth

Minor opinionated comments but please feel free to ignore. Looking great, thanks!

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

rapoth · 2020-09-18T08:01:19Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+   * @param indexPlan replaced plan with index
+   * @return complementIndexPlan integrated plan of indexPlan and complementPlan
+   */
+  private def getComplementIndexPlan(


I saw your other renames. How about complementIndexScanWithDataScan? (since you are already using a verb like transform, I thought we could use complement as a word)

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

sezruby · 2020-09-22T01:15:22Z

@pirz @apoorvedave1 Could you do a review for this PR? Thanks!

apoorvedave1 · 2020-09-25T04:09:05Z

LGTM 👍 thanks @sezruby

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala

src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala

src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

imback82 · 2020-09-25T17:38:14Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+            _) =>
+        val curFileSet = location.allFiles
+          .map(f => FileInfo(f.getPath.toString, f.getLen, f.getModificationTime))
+        filesAppended =


I think filesAppended (and deleted) should be calculated before this function gets called, and call this function only if it's eligible to do hybrid scan. We don't have to address this now, but I will bring this up in a later PRs.

Good point - I'll handle this with #171 or another PR.

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

imback82 · 2020-09-25T18:02:54Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+        // shuffle the appended data in the same way to correctly merge with bucketed index data.
+
+        // Clear sortColumnNames as BucketUnion does not keep the sort order within a bucket.
+        val bucketSpec = index.bucketSpec.copy(sortColumnNames = Seq())


nit: we can use Nil here. (more consistent in this file)

Maybe we should just use numBuckets in BucketUnion*? Are we using any other fields from BucketSpec?

Also, can you add an assert in BucketUnionExec.outputPartitioning that the number of partitions is same as the number of buckets? I think that's our assumption right?

assert(children.head.outputPartitioning.asInstanceOf[HashPartitioning].numPartitions == bucketSpec.numBuckets)

Btw, do we need to check if the bucketed columns are the partitioning expressions (somewhere in BucketUnion*)?

Maybe we should just use numBuckets in BucketUnion*? Are we using any other fields from BucketSpec?

Yes currently only numBuckets is used in BucketUnion*. But bucket column & sort column info are also shown in plan, which might help to analyze. And maybe we could use it later in optimization.

Btw, do we need to check if the bucketed columns are the partitioning expressions (somewhere in BucketUnion*)?

I tried to find a way to check partitioning expressions in BucketUnion* but there's no such API for it.

Yes currently only numBuckets is used in BucketUnion*. But bucket column & sort column info are also shown in plan, which might help to analyze. And maybe we could use it later in optimization.

The confusion I have is that the following sounds like it has an effect, whereas it doesn't really do anything:

// Clear sortColumnNames as BucketUnion does not keep the sort order within a bucket. val bucketSpec = index.bucketSpec.copy(sortColumnNames = Seq())

Also, as far as I know, bucket spec is meant to be used for the scan node. So it may be more confusing if we print out it in the BucketUnion. We can take this up as a separate PR, but I think we need to address this.

Ok I revised the comment a bit. But I think the information might help to understand the behavior? Does this look confusing?

+- *(5) Project [o_orderkey#61L, o_orderdate#65, o_shippriority#68] : +- *(5) SortMergeJoin [c_custkey#5L], [o_custkey#62L], Inner : :- *(1) Project [c_custkey#5L] : : +- *(1) Filter ((isnotnull(c_mktsegment#11) && (c_mktsegment#11 = BUILDING)) && isnotnull(c_custkey#5L)) : : +- *(1) FileScan parquet [c_custkey#5L,c_mktsegment#11] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://eunjin-hyperspace-test@taruntestdiag.blob.core.windows.net/indexes/index..., PartitionFilters: [], PushedFilters: [IsNotNull(c_mktsegment), EqualTo(c_mktsegment,BUILDING), IsNotNull(c_custkey)], ReadSchema: struct<c_custkey:bigint,c_mktsegment:string>, SelectedBucketsCount: 200 out of 200 : +- *(4) Sort [o_custkey#62L ASC NULLS FIRST], false, 0 : +- BucketUnion 200 buckets, bucket columns: [o_custkey] <=============== : :- *(2) Project [o_orderkey#61L, o_custkey#62L, o_orderdate#65, o_shippriority#68] : : +- *(2) Filter (((isnotnull(o_orderdate#65) && (o_orderdate#65 < 1995-03-15)) && isnotnull(o_custkey#62L)) && isnotnull(o_orderkey#61L)) : : +- *(2) FileScan parquet [o_custkey#62L,o_orderkey#61L,o_orderdate#65,o_shippriority#68] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://eunjin-hyperspace-test@taruntestdiag.blob.core.windows.net/indexes/index..., PartitionFilters: [], PushedFilters: [IsNotNull(o_orderdate), LessThan(o_orderdate,1995-03-15), IsNotNull(o_custkey), IsNotNull(o_orde..., ReadSchema: struct<o_custkey:bigint,o_orderkey:bigint,o_orderdate:string,o_shippriority:int>, SelectedBucketsCount: 200 out of 200 : +- Exchange hashpartitioning(o_custkey#62L, 200) : +- *(3) Project [o_orderkey#61L, o_custkey#62L, o_orderdate#65, o_shippriority#68] : +- *(3) Filter (((isnotnull(o_orderdate#65) && (o_orderdate#65 < 1995-03-15)) && isnotnull(o_custkey#62L)) && isnotnull(o_orderkey#61L)) : +- *(3) FileScan parquet [o_custkey#62L,o_orderkey#61L,o_orderdate#65,o_shippriority#68] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://eunjin-hyperspace-test@taruntestdiag.blob.core.windows.net/data/tpch-par..., PartitionFilters: [], PushedFilters: [IsNotNull(o_orderdate), LessThan(o_orderdate,1995-03-15), IsNotNull(o_custkey), IsNotNull(o_orde..., ReadSchema: struct<o_custkey:bigint,o_orderkey:bigint,o_orderdate:string,o_shippriority:int>

I think it may be better if we print out the output partitioning, similar to Exchange since we are unioning plans that have the same hash partitioning.

But new comment sounds better and let's take this up separately.

pirz · 2020-09-25T20:34:14Z

LGTM, Thanks @sezruby

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

imback82 · 2020-09-25T20:42:26Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+    // in their output; Case 1 won't be shown in use cases. The implementation is kept
+    // for future use cases.


I couldn't comment on #165 (comment), but is this now being tested? If it's not covered, then I wouldn't add the case and just throw if this case is hit.

This behavior is covered by below test("Verify the location of injected shuffle for Hybrid Scan.") {

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

imback82 · 2020-09-26T02:37:47Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+        // shuffle the appended data in the same way to correctly merge with bucketed index data.
+
+        // Clear sortColumnNames as BucketUnion does not keep the sort order within a bucket.
+        val bucketSpec = index.bucketSpec.copy(sortColumnNames = Seq())


Yes currently only numBuckets is used in BucketUnion*. But bucket column & sort column info are also shown in plan, which might help to analyze. And maybe we could use it later in optimization.

The confusion I have is that the following sounds like it has an effect, whereas it doesn't really do anything:

// Clear sortColumnNames as BucketUnion does not keep the sort order within a bucket. val bucketSpec = index.bucketSpec.copy(sortColumnNames = Seq())

src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala

imback82 · 2020-09-26T03:01:22Z

src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala

  }

+  test("Verify the location of injected shuffle for Hybrid Scan.") {
+    val dataPath = systemPath.toString + "/hbtable"


Can you check if you can utilize withTempDir? I just committed.

imback82

Looks good to me! (pending minor comments)

src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala

imback82 · 2020-09-26T03:49:11Z

src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala

+        // Make sure there is no shuffle.
+        execPlan.foreach(p => assert(!p.isInstanceOf[ShuffleExchangeExec]))
+
+        checkAnswer(baseQuery, filter)


Test looks much better! 👍

imback82 · 2020-09-26T04:34:03Z

src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala

+    withTempDir { tempDir =>
+      val dataPath = tempDir + "/hbtable"


you can just do withTempDir { dataPath => right? (No need for "/hbtable")

"/hbtabe" is requires as tempDir is directory - write.parquet(dataPath) will fail

Ah, it fails since the directory is already created. You need withTempPath (which is same implemenation as withTempDir except that it deletes the directory created). I pushed the changes.

Btw, parquet() works on a directory (if you do overwrite).

imback82

LGTM, thanks @sezruby! 🚀🚀🚀

imback82 · 2020-09-26T05:41:10Z

Merged to master. Great work @sezruby! cc: @rapoth

sezruby added 2 commits September 15, 2020 15:19

Logical plan modification for Hybrid Scan

0d5440b

HybridScanTest fix

feafa7c

sezruby mentioned this pull request Sep 15, 2020

Hybrid Scan for File/Partition Mutable Datasets #150

Closed

7 tasks

rapoth requested review from apoorvedave1 and pirz September 16, 2020 01:11

rapoth added this to the 0.4.0 milestone Sep 16, 2020

rapoth added advanced issue This is the tag for advanced issues which involve major design changes or introduction enhancement New feature or request labels Sep 16, 2020

rapoth reviewed Sep 16, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala Outdated Show resolved Hide resolved

rapoth reviewed Sep 16, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala Outdated Show resolved Hide resolved

rapoth reviewed Sep 16, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala Outdated Show resolved Hide resolved

sezruby added 2 commits September 16, 2020 16:30

review commit

cdc35b6

minor fix

56507bb

sezruby mentioned this pull request Sep 17, 2020

Add Union for FilterIndexRule and non-parquet format - Hybrid Scan #168

Closed

minor fix

4b77e53

imback82 reviewed Sep 18, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala Outdated Show resolved Hide resolved

rapoth mentioned this pull request Sep 18, 2020

Add support for delete to index refresh #142

Merged

rapoth reviewed Sep 18, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala Outdated Show resolved Hide resolved

rapoth reviewed Sep 18, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala Outdated Show resolved Hide resolved

Review commit

739c193

rapoth reviewed Sep 18, 2020

View reviewed changes

Review commit2

3e8685b

imback82 reviewed Sep 18, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala Outdated Show resolved Hide resolved

src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala Outdated Show resolved Hide resolved

sezruby added 2 commits September 18, 2020 15:00

Review commit3

6adcad4

Review commit4

1a3fee5

rapoth reviewed Sep 18, 2020

View reviewed changes

rapoth assigned sezruby Sep 18, 2020

Refactor getComplementIndexPlan

53d7419

sezruby mentioned this pull request Sep 21, 2020

Inject push-down filter to exclude indexed rows from deleted files #171

Merged

review commit

8bb39a3

minor fix

f0c3db9

apoorvedave1 previously approved these changes Sep 25, 2020

View reviewed changes

imback82 reviewed Sep 25, 2020

View reviewed changes

Review commit

452f98e

sezruby dismissed apoorvedave1’s stale review via 452f98e September 25, 2020 09:11

sezruby added 3 commits September 25, 2020 18:17

Review commit2

3f9f079

Minor fix

8bce3ab

Test fix

befa7d5

imback82 reviewed Sep 25, 2020

View reviewed changes

Review commit

662976e

imback82 reviewed Sep 26, 2020

View reviewed changes

sezruby added 3 commits September 26, 2020 12:57

review commit

fb57f98

Merge remote-tracking branch 'upstream/master' into hybridscan_plan

01be013

review commit2

38f1227

imback82 reviewed Sep 26, 2020

View reviewed changes

sezruby and others added 3 commits September 26, 2020 14:01

review commit3

8927071

review commit33

b669cf9

use withTempPath

a992b7f

imback82 approved these changes Sep 26, 2020

View reviewed changes

imback82 merged commit 83a9ab5 into microsoft:master Sep 26, 2020

sezruby mentioned this pull request Sep 26, 2020

Improvement of HybridScan for FilterIndexRule #145

Closed

sezruby deleted the hybridscan_plan branch October 7, 2020 13:17

		// in their output; Case 1 won't be shown in use cases. The implementation is kept
		// for future use cases.

Modify logical plan to merge newly appended files and index data #165

Modify logical plan to merge newly appended files and index data #165

Uh oh!

Conversation

sezruby commented Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

rapoth commented Sep 16, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sezruby Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rapoth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sezruby commented Sep 22, 2020

Uh oh!

apoorvedave1 commented Sep 25, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sezruby commented Sep 15, 2020 •

edited

Loading

sezruby Sep 18, 2020 •

edited

Loading

imback82 Sep 25, 2020 •

edited

Loading