Support Iceberg table format #320

andrei-ionescu · 2021-01-09T14:55:39Z

What is the context for this pull request?

Tracking Issue: [FEATURE REQUEST]: Add support for Iceberg table format #306
Proposal: [PROPOSAL]: Support Iceberg table format #318
Dependencies: Support DataSourceV2 sources #321
Fixes: [FEATURE REQUEST]: Add support for Iceberg table format #306
Fixes: [PROPOSAL]: Support Iceberg table format #318

What changes were proposed in this pull request?

This PR adds support for Iceberg.

The following changes are in this PR and each of them are separate commits:

Add Iceberg source.
Add support for incremental refresh. This is based on Support incremental refresh for Delta Lake #301 PR from @sezruby.
Add integration test

Does this PR introduce any user-facing change?

No. The main changes to user-facing APIs are in the #321 PR. Detailed information can be found in the #318 proposal.

How was this patch tested?

Integration test added for the new functionality
Locally & Databricks Runtime tests

Local build

sbt publishLocal

Run Spark shell with Hyperspace and Iceberg libraries loaded

$ spark-shell \
--driver-memory 4g \
--packages "com.microsoft.hyperspace:hyperspace-core_2.11:0.4.0-SNAPSHOT,org.apache.iceberg:iceberg-spark-runtime:0.10.0" \
--driver-java-options "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5006 -XX:+UseG1GC -Dlog4j.debug=true"

Paste the following code

import org.apache.spark.sql._
import com.microsoft.hyperspace._
import com.microsoft.hyperspace.index._
import scala.collection.JavaConverters._
import org.apache.iceberg.PartitionSpec
import org.apache.iceberg.TableProperties
import org.apache.iceberg.spark._
import org.apache.iceberg.hadoop._

val hs = new Hyperspace(spark)

// create Iceberg table
val props = Map(TableProperties.WRITE_NEW_DATA_LOCATION -> "table3").asJava
val sourceDf = Seq((1, "name1"), (2, "name2")).toDF("id", "name")
val schema = SparkSchemaUtil.convert(sourceDf.schema)
val part = PartitionSpec.builderFor(schema).build()
val icebergTable = new HadoopTables().create(schema, part, props, "table3")
sourceDf.write.mode("overwrite").format("iceberg").save("./table3")

// read created table
val iceDf = spark.read.format("iceberg").load("./table3")

// create indexes
hs.createIndex(iceDf, IndexConfig("index_ice0", indexedColumns = Seq("id"), includedColumns = Seq("name")))
hs.createIndex(iceDf, IndexConfig("index_ice1", indexedColumns = Seq("name")))

// verify plans
val query = iceDf.filter(iceDf("id") === 1).select("name")
hs.explain(query, verbose = true)

sezruby

Could you add a test file for Iceberg? - for example https://github.com/microsoft/hyperspace/blob/master/src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala and incremental refresh test
https://github.com/microsoft/hyperspace/pull/301/files#diff-f32a70d0b9c560ff5d6a55595db0f12be911fef2ccd303ec24fe0799c7b31b0eR102

BTW to reduce PR size, could you split this PR into

support DataSourceV2 (LogicalRelation -> LogicalPlan, ExtractIndexSupportedLogicalPlan(see below comment)) + unit tests for DataSourceV2 if possible
based on 1), IcebergFileBasedSourceProvider + Iceberg tests

This will help us to understand your change better :)
You can keep this PR for reference and open 2 new PRs.

src/main/scala/com/microsoft/hyperspace/util/HyperspaceConf.scala

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

andrei-ionescu · 2021-01-11T22:21:16Z

@sezruby I'll keep this PR for Iceberg related commits:

Add Iceberg support
Add support for incremental refresh
Add Iceberg integration test

I'll create another PR for the DataSourceV2 changes.

andrei-ionescu · 2021-01-11T22:34:26Z

I'll rebase this PR as soon as the #321 gets merged.

src/test/scala/com/microsoft/hyperspace/index/IcebergIntegrationTest.scala

sezruby · 2021-01-14T10:40:37Z

src/test/scala/com/microsoft/hyperspace/index/IcebergIntegrationTest.scala

+          // The index should be applied for the updated version.
+          assert(isIndexUsed(query().queryExecution.optimizedPlan, "iceIndex", true))
+
+          // Append data.


Other than "append", you could remove 1~2 files from the source data.
To delete the source files easily, I used partitioned data in other test cases.
& in any case, we also need to test partitioned source data.

With Iceberg (version 0.10), I can only remove entire files. If I try to remove just some rows from a file it will fail the delete action.

Yep Hyperspace indexes also only support entire file delete, not row-level delete.

Could you add a test for a partitioned table of Iceberg & Hybrid Scan append?
It's similar to this test, but use partitioned df .

andrei-ionescu · 2021-01-14T21:23:00Z

@sezruby Here are the requested plans with Hybrid Scan enabled:

Right after creation

Project [query#215]
+- Filter (clicks#217 <= 2000)
   +- Relation[Query#215,clicks#217] parquet

After deleting a file

Project [query#237]
+- Filter (clicks#239 <= 2000)
   +- Project [Query#237, clicks#239]
      +- Filter NOT (_data_file_id#246L = 0)
         +- Relation[Query#237,clicks#239,_data_file_id#246L] parquet

After adding some more data

Union
:- Project [query#266]
:  +- Filter (clicks#268 <= 2000)
:     +- Project [Query#266, clicks#268]
:        +- Filter NOT (_data_file_id#275L = 0)
:           +- Relation[Query#266,clicks#268,_data_file_id#275L] parquet
+- Project [query#266]
   +- Filter (clicks#268 <= 2000)
      +- Relation[Query#266,clicks#268] parquet

I think they look as expected.

Just FYI...

IcebergSource read plan w/o index:

Project [query#28]
+- Filter (clicks#30 <= 2000)
   +- RelationV2 iceberg[Date#26, RGUID#27, Query#28, imprs#29, clicks#30] (Options: [path=/private/var/folders/dm/9mytk9kx49s4sf1b3f0cvcs80000gn/T/spark-25d9e8bb-cc56-4cac-b1f5-e2a2...)

sezruby · 2021-01-15T00:24:00Z

@andrei-ionescu
Thanks! It looks good 👍
Could you also share the plans of join query? (refer join() in the quick refresh test)
And please try collect for each filter/join query (with the transformed plan) and compare the result without indexes?
(jfyi, after spark.disableHyperspace or disable hybrid scan, you need to redefine(e.g. val filter = ..) the query to generate a new plan)
We'll add Hybrid Scan Test to check the plan transformation & result comparison later :)

andrei-ionescu · 2021-01-15T13:04:30Z

@sezruby This is the join

Project [c2#206, c4#218]
+- Join Inner, (c2#206 = c2#216)
   :- Union
   :  :- Project [c2#206]
   :  :  +- Filter isnotnull(c2#206)
   :  :     +- Project [c2#206, c4#208]
   :  :        +- Filter NOT (_data_file_id#423L = 0)
   :  :           +- Relation[c2#206,c4#208,_data_file_id#423L] parquet
   :  +- Project [c2#206]
   :     +- Filter isnotnull(c2#206)
   :        +- Relation[c2#206,c4#208] parquet
   +- Union
      :- Project [c2#216, c4#218]
      :  +- Filter isnotnull(c2#216)
      :     +- Project [c2#216, c4#218]
      :        +- Filter NOT (_data_file_id#424L = 0)
      :           +- Relation[c2#216,c4#218,_data_file_id#424L] parquet
      +- Project [c2#216, c4#218]
         +- Filter isnotnull(c2#216)
            +- Relation[c2#216,c4#218] parquet

sezruby · 2021-01-15T13:15:22Z

@andrei-ionescu could you share sparkPlan?
Make sure to set 'spark.sql.autoBroadcastJoinThreshold' as -1 to see shuffle is removed.

andrei-ionescu · 2021-01-15T14:37:01Z

@sezruby The Spark plan

Project [c2#206, c4#218]
+- SortMergeJoin [c2#206], [c2#216], Inner
   :- Union
   :  :- Project [c2#206]
   :  :  +- Filter (NOT (_data_file_id#687L = 0) && isnotnull(c2#206))
   :  :     +- FileScan parquet [c2#206,_data_file_id#687L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/aionescu/github/hyperspace/src/test/resources/icebergIntegrationTes..., PartitionFilters: [], PushedFilters: [Not(EqualTo(_data_file_id,0)), IsNotNull(c2)], ReadSchema: struct<c2:string,_data_file_id:bigint>
   :  +- Project [c2#206]
   :     +- Filter isnotnull(c2#206)
   :        +- FileScan parquet [c2#206] Batched: true, Format: Parquet, Location: InMemoryFileIndex[/private/var/folders/dm/9mytk9kx49s4sf1b3f0cvcs80000gn/T/spark-6ef5741e-dd25-49..., PartitionFilters: [], PushedFilters: [IsNotNull(c2)], ReadSchema: struct<c2:string>
   +- Union
      :- Project [c2#216, c4#218]
      :  +- Filter (NOT (_data_file_id#688L = 0) && isnotnull(c2#216))
      :     +- FileScan parquet [c2#216,c4#218,_data_file_id#688L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/aionescu/github/hyperspace/src/test/resources/icebergIntegrationTes..., PartitionFilters: [], PushedFilters: [Not(EqualTo(_data_file_id,0)), IsNotNull(c2)], ReadSchema: struct<c2:string,c4:int,_data_file_id:bigint>
      +- Project [c2#216, c4#218]
         +- Filter isnotnull(c2#216)
            +- FileScan parquet [c2#216,c4#218] Batched: true, Format: Parquet, Location: InMemoryFileIndex[/private/var/folders/dm/9mytk9kx49s4sf1b3f0cvcs80000gn/T/spark-6ef5741e-dd25-49..., PartitionFilters: [], PushedFilters: [IsNotNull(c2)], ReadSchema: struct<c2:string,c4:int>

sezruby · 2021-01-15T15:06:58Z

@andrei-ionescu Seems bucketSpec is not applied properly.

Union => BucketUnion
relation should have bucketing information - e.g. selected 200 of 200
Sort should exist between SortMergeJoin and BucketUnion.

Could you investigate the cause? We could check the join plan with index, but no hybrid scan case first.
Thanks!

andrei-ionescu · 2021-01-15T21:24:12Z

@sezruby I found yet another place that I missed adding the DataSourceV2Relation pattern matching.

Here are the optimizedPlan and the sparkPlan:

Project [c2#206, c4#218]
+- Join Inner, (c2#206 = c2#216)
   :- BucketUnion 200 buckets, bucket columns: [c2]
   :  :- Project [c2#206]
   :  :  +- Filter isnotnull(c2#206)
   :  :     +- Project [c2#206, c4#208]
   :  :        +- Filter NOT (_data_file_id#423L = 0)
   :  :           +- Relation[c2#206,c4#208,_data_file_id#423L] parquet
   :  +- RepartitionByExpression [c2#206], 200
   :     +- Project [c2#206]
   :        +- Filter isnotnull(c2#206)
   :           +- Relation[c2#206,c4#208] parquet
   +- BucketUnion 200 buckets, bucket columns: [c2]
      :- Project [c2#216, c4#218]
      :  +- Filter isnotnull(c2#216)
      :     +- Project [c2#216, c4#218]
      :        +- Filter NOT (_data_file_id#424L = 0)
      :           +- Relation[c2#216,c4#218,_data_file_id#424L] parquet
      +- RepartitionByExpression [c2#216], 200
         +- Project [c2#216, c4#218]
            +- Filter isnotnull(c2#216)
               +- Relation[c2#216,c4#218] parquet

Project [c2#206, c4#218]
+- SortMergeJoin [c2#206], [c2#216], Inner
   :- BucketUnion 200 buckets, bucket columns: [c2]
   :  :- Project [c2#206]
   :  :  +- Filter (NOT (_data_file_id#553L = 0) && isnotnull(c2#206))
   :  :     +- FileScan parquet [c2#206,_data_file_id#553L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/aionescu/github/hyperspace/src/test/resources/icebergIntegrationTes..., PartitionFilters: [], PushedFilters: [Not(EqualTo(_data_file_id,0)), IsNotNull(c2)], ReadSchema: struct<c2:string,_data_file_id:bigint>, SelectedBucketsCount: 200 out of 200
   :  +- Exchange hashpartitioning(c2#206, 200)
   :     +- Project [c2#206]
   :        +- Filter isnotnull(c2#206)
   :           +- FileScan parquet [c2#206] Batched: true, Format: Parquet, Location: InMemoryFileIndex[/private/var/folders/dm/9mytk9kx49s4sf1b3f0cvcs80000gn/T/spark-1c3acd51-9f16-42..., PartitionFilters: [], PushedFilters: [IsNotNull(c2)], ReadSchema: struct<c2:string>
   +- BucketUnion 200 buckets, bucket columns: [c2]
      :- Project [c2#216, c4#218]
      :  +- Filter (NOT (_data_file_id#554L = 0) && isnotnull(c2#216))
      :     +- FileScan parquet [c2#216,c4#218,_data_file_id#554L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/aionescu/github/hyperspace/src/test/resources/icebergIntegrationTes..., PartitionFilters: [], PushedFilters: [Not(EqualTo(_data_file_id,0)), IsNotNull(c2)], ReadSchema: struct<c2:string,c4:int,_data_file_id:bigint>, SelectedBucketsCount: 200 out of 200
      +- Exchange hashpartitioning(c2#216, 200)
         +- Project [c2#216, c4#218]
            +- Filter isnotnull(c2#216)
               +- FileScan parquet [c2#216,c4#218] Batched: true, Format: Parquet, Location: InMemoryFileIndex[/private/var/folders/dm/9mytk9kx49s4sf1b3f0cvcs80000gn/T/spark-1c3acd51-9f16-42..., PartitionFilters: [], PushedFilters: [IsNotNull(c2)], ReadSchema: struct<c2:string,c4:int>

sezruby · 2021-01-16T00:10:36Z

@andrei-ionescu Thanks! There's still missing 'Sort' node.

Could you check this?
Thanks a lot!

andrei-ionescu · 2021-01-17T22:05:16Z

@sezruby I did compare with the Delta output I don't see any difference. Here is the optimizedPlan and the sparkPlan output of the Delta test:

Project [c2#3663, c4#3675]
+- Join Inner, (c2#3663 = c2#3673)
   :- BucketUnion 200 buckets, bucket columns: [c2]
   :  :- Project [c2#3663]
   :  :  +- Filter isnotnull(c2#3663)
   :  :     +- Relation[c2#3663,c4#3665] parquet
   :  +- RepartitionByExpression [c2#3663], 200
   :     +- Project [c2#3663]
   :        +- Filter isnotnull(c2#3663)
   :           +- Relation[c2#3663,c4#3665] parquet
   +- BucketUnion 200 buckets, bucket columns: [c2]
      :- Project [c2#3673, c4#3675]
      :  +- Filter isnotnull(c2#3673)
      :     +- Relation[c2#3673,c4#3675] parquet
      +- RepartitionByExpression [c2#3673], 200
         +- Project [c2#3673, c4#3675]
            +- Filter isnotnull(c2#3673)
               +- Relation[c2#3673,c4#3675] parquet

Project [c2#3663, c4#3675]
+- SortMergeJoin [c2#3663], [c2#3673], Inner
   :- BucketUnion 200 buckets, bucket columns: [c2]
   :  :- Project [c2#3663]
   :  :  +- Filter isnotnull(c2#3663)
   :  :     +- FileScan parquet [c2#3663] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/aionescu/github/hyperspace/src/test/resources/deltaLakeIntegrationT..., PartitionFilters: [], PushedFilters: [IsNotNull(c2)], ReadSchema: struct<c2:string>, SelectedBucketsCount: 200 out of 200
   :  +- Exchange hashpartitioning(c2#3663, 200)
   :     +- Project [c2#3663]
   :        +- Filter isnotnull(c2#3663)
   :           +- FileScan parquet [c2#3663] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/dm/9mytk9kx49s4sf1b3f0cvcs80000gn/T/spark-c30f7f12-4a..., PartitionFilters: [], PushedFilters: [IsNotNull(c2)], ReadSchema: struct<c2:string>
   +- BucketUnion 200 buckets, bucket columns: [c2]
      :- Project [c2#3673, c4#3675]
      :  +- Filter isnotnull(c2#3673)
      :     +- FileScan parquet [c2#3673,c4#3675] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/aionescu/github/hyperspace/src/test/resources/deltaLakeIntegrationT..., PartitionFilters: [], PushedFilters: [IsNotNull(c2)], ReadSchema: struct<c2:string,c4:int>, SelectedBucketsCount: 200 out of 200
      +- Exchange hashpartitioning(c2#3673, 200)
         +- Project [c2#3673, c4#3675]
            +- Filter isnotnull(c2#3673)
               +- FileScan parquet [c2#3673,c4#3675] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/dm/9mytk9kx49s4sf1b3f0cvcs80000gn/T/spark-c30f7f12-4a..., PartitionFilters: [], PushedFilters: [IsNotNull(c2)], ReadSchema: struct<c2:string,c4:int>

If there is something missing then it is missing from Delta too.

BTW there is a SortMergeJoin node in the sparkPlan.

sezruby · 2021-01-18T00:27:17Z

From Delta Lake hybrid scan test:

Project [clicks#1983, query#1981, Date#1989]
+- SortMergeJoin [clicks#1983], [clicks#1993], Inner
   :- Sort [clicks#1983 ASC NULLS FIRST], false, 0
   :  +- Filter isnotnull(clicks#1983)
   :     +- BucketUnion 200 buckets, bucket columns: [clicks]
   :        :- Project [clicks#1983, query#1981]
   :        :  +- Filter ((isnotnull(clicks#1983) && (clicks#1983 >= 2000)) && (clicks#1983 <= 4000))
   :        :     +- FileScan parquet [clicks#1983,Query#1981] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/path/to/src/test/resources/hybridScanTest/index..., PartitionFilters: [], PushedFilters: [IsNotNull(clicks), GreaterThanOrEqual(clicks,2000), LessThanOrEqual(clicks,4000)], ReadSchema: struct<clicks:int,Query:string>, SelectedBucketsCount: 200 out of 200
   :        +- Exchange hashpartitioning(clicks#1983, 200)
   :           +- Project [clicks#1983, query#1981]
   :              +- Filter ((isnotnull(clicks#1983) && (clicks#1983 >= 2000)) && (clicks#1983 <= 4000))
   :                 +- FileScan parquet [Query#1981,clicks#1983,RGUID#1980] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/path/to/AppData/Local/Temp/spark-60a4469f-edc0-4286-9944-2769bbd..., PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(clicks), GreaterThanOrEqual(clicks,2000), LessThanOrEqual(clicks,4000)], ReadSchema: struct<Query:string,clicks:int>
   +- Sort [clicks#1993 ASC NULLS FIRST], false, 0
      +- Filter isnotnull(clicks#1993)
         +- BucketUnion 200 buckets, bucket columns: [clicks]
            :- Project [clicks#1993, Date#1989]
            :  +- Filter ((isnotnull(clicks#1993) && (clicks#1993 <= 4000)) && (clicks#1993 >= 2000))
            :     +- FileScan parquet [clicks#1993,Date#1989] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/C:/path/to/src/test/resources/hybridScanTest/index..., PartitionFilters: [], PushedFilters: [IsNotNull(clicks), LessThanOrEqual(clicks,4000), GreaterThanOrEqual(clicks,2000)], ReadSchema: struct<clicks:int,Date:string>, SelectedBucketsCount: 200 out of 200
            +- Exchange hashpartitioning(clicks#1993, 200)
               +- Project [clicks#1993, Date#1989]
                  +- Filter ((isnotnull(clicks#1993) && (clicks#1993 <= 4000)) && (clicks#1993 >= 2000))
                     +- FileScan parquet [Date#1989,clicks#1993,RGUID#1990] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/path/to/AppData/Local/Temp/spark-60a4469f-edc0-4286-9944-2769bbd..., PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(clicks), LessThanOrEqual(clicks,4000), GreaterThanOrEqual(clicks,2000)], ReadSchema: struct<Date:string,clicks:int>

I think there might be no duplicated bucket between appended data & original source data in your testcase.
Could you check it again by appending the same data? Thanks!

andrei-ionescu · 2021-01-18T10:32:46Z

@sezruby,

The results (optimizedPlan & sparkPlan) I did paste above are both from Verify JoinIndexRule utilizes indexes correctly after quick refresh when some file gets deleted and some appended to source data. tests in each Delta (DeltaLakeIntegrationTest.scala#L146) & Iceberg (IcebergIntegrationTest.scala#L162) tests. They seem to have the same output. In this respect the test has the same behaviour on both cases.

In regards to your question we do add duplicate data here: https://github.com/microsoft/hyperspace/pull/320/files#diff-ce1f32f296e1683385beb0fe1954b154710c0ba0120f028167afbe5953347dd3R186-R192.

I'm not sure about the output you did paste and where that comes from. Can you please provide the code of the test? Can you point me to the test that gives this output? I want to run the same test on Iceberg too and then debug and compare the differences.

Thank you.

sezruby · 2021-01-18T10:53:17Z

@andrei-ionescu

This is the test:

Need to cherry-pick: Refactor Hybrid Scan test suites #274
TestName: HybridScanForDeltaLakeTest

// code in HybridScanSuite.scala
 test(
    "Append-only: join rule, appended data should be shuffled with indexed columns " +
      "and merged by BucketUnion")

Then there might be a problem in quick refresh?
Could you check hybrid scan first? In the join + quick refresh Iceberg test, you could test it by :

// hyperspace.refreshIndex(indexConfig.indexName, REFRESH_MODE_QUICK)`
// instead of refresh, enable hybrid scan:
withSQLConf(TestConfig.HybridScanEnabled: _*) {

andrei-ionescu · 2021-01-18T10:56:19Z

@sezruby The suggested test is a PR that has changes on the Hybrid Scan logic. I can try take that test and add it in my PR and print out the output for both Delta and Iceberg. But I will not add the changes added by that PR in the logic of the Hybrid Scan.

andrei-ionescu · 2021-01-19T00:03:29Z

@sezruby I did merge your hybridtest_refactoring branch that contains the #274 changes, into my local development and I did run all the tests (set +test) and it did successfully pass. This means that the DataSourceV2 PR #321 does not bring any new changes into the current functionality of Hyperspace.

I did try to replicate the HybridScanForDeltaLakeTest into and Iceberg test: HybridScanForIcebergTest but there is a lot of work to be done as Iceberg has a different way of getting the appended and deleted files. I need to understand more the test and use it as a reference for Iceberg but it will require more changes in some test related areas.

Taking into account the following things:

No changes into the current implementation
No changes into the Hybrid Scan test refactor branch
The HybridScan for Iceberg is tightly linked to Iceberg implementation

I would suggest merging the #321 and #274 PRs and after that I can fruitfully work on bringing this Iceberg implementation on par with Delta one.

Or, merge the #274 PR first and I'll rebase the my #321 and keep the tests to validate that nothing is broken by my DataSourceV2 support addition.

@sezruby what do you think?

sezruby · 2021-01-19T01:26:40Z

@andrei-ionescu I'm okay with either way. BTW we need @imback82's review to merge the changes :)
Please understand any delay in our review... 🙏

Thanks for the great work!

imback82 · 2021-01-19T01:34:27Z

Sorry for the delay. I will get to #274 soon.

sezruby · 2021-01-24T14:23:14Z

@andrei-ionescu
Seems "missing sort node" is because it's sparkPlan, not executedPlan.
Sorry I wasn't aware of the difference 😅 Could you check executedPlan again? Thanks!

andrei-ionescu · 2021-02-15T16:25:25Z

Closing this PR because of the new work from @imback82 - PR #355. I created this new Iceberg format table related PR only: #358.

andrei-ionescu force-pushed the iceberg branch 2 times, most recently from 2b55d68 to 86c510a Compare January 11, 2021 15:34

sezruby reviewed Jan 11, 2021

View reviewed changes

andrei-ionescu force-pushed the iceberg branch 2 times, most recently from ec30343 to 84f7597 Compare January 11, 2021 21:27

andrei-ionescu force-pushed the iceberg branch from 6a3cf87 to c928b80 Compare January 11, 2021 22:31

andrei-ionescu force-pushed the iceberg branch from c928b80 to 8ec9e5f Compare January 11, 2021 22:40

sezruby assigned andrei-ionescu Jan 12, 2021

andrei-ionescu force-pushed the iceberg branch 2 times, most recently from 2abb1e1 to e71d6e4 Compare January 13, 2021 16:09

andrei-ionescu mentioned this pull request Jan 14, 2021

Support DataSourceV2 sources #321

Closed

sezruby reviewed Jan 14, 2021

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/IcebergIntegrationTest.scala Show resolved Hide resolved

sezruby reviewed Jan 14, 2021

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/IcebergIntegrationTest.scala Outdated Show resolved Hide resolved

sezruby reviewed Jan 14, 2021

View reviewed changes

andrei-ionescu force-pushed the iceberg branch from e71d6e4 to a3210bd Compare January 14, 2021 21:05

andrei-ionescu force-pushed the iceberg branch from a3210bd to a037b97 Compare January 15, 2021 10:23

andrei-ionescu force-pushed the iceberg branch from a037b97 to 345c52d Compare January 15, 2021 21:22

andrei-ionescu force-pushed the iceberg branch from 345c52d to 14fbec5 Compare January 15, 2021 21:42

andrei-ionescu force-pushed the iceberg branch 3 times, most recently from 4fbd068 to e68fc57 Compare January 22, 2021 14:44

andrei-ionescu force-pushed the iceberg branch 5 times, most recently from 0b54e91 to 5b009d4 Compare February 1, 2021 18:12

andrei-ionescu force-pushed the iceberg branch from 5b009d4 to 0d0dec9 Compare February 2, 2021 18:23

andrei-ionescu mentioned this pull request Feb 5, 2021

Add config to use bucketed scan for filter indexes #329

Merged

andrei-ionescu force-pushed the iceberg branch 2 times, most recently from 5de333e to d9505fd Compare February 10, 2021 18:52

imback82 mentioned this pull request Feb 12, 2021

Introduce SourceRelation/FileBasedRelation traits to remove direct dependency on LogicalRelation from actions/rules #355

Merged

andrei-ionescu force-pushed the iceberg branch 2 times, most recently from beee2a0 to 14b8f35 Compare February 15, 2021 13:28

Add Iceberg support

b3fc870

andrei-ionescu force-pushed the iceberg branch from 14b8f35 to b3fc870 Compare February 15, 2021 15:05

andrei-ionescu mentioned this pull request Feb 15, 2021

Support Iceberg table format #358

Merged

andrei-ionescu closed this Feb 15, 2021

andrei-ionescu deleted the iceberg branch February 22, 2021 20:42

Support Iceberg table format #320

Support Iceberg table format #320

Uh oh!

Conversation

andrei-ionescu commented Jan 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sezruby left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrei-ionescu commented Jan 11, 2021

Uh oh!

andrei-ionescu commented Jan 11, 2021

Uh oh!

Uh oh!

Uh oh!

sezruby Jan 14, 2021

Choose a reason for hiding this comment

Uh oh!

andrei-ionescu Jan 14, 2021

Choose a reason for hiding this comment

Uh oh!

sezruby Jan 15, 2021

Choose a reason for hiding this comment

Uh oh!

andrei-ionescu commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Right after creation

After deleting a file

After adding some more data

IcebergSource read plan w/o index:

Uh oh!

sezruby commented Jan 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrei-ionescu commented Jan 15, 2021

Uh oh!

sezruby commented Jan 15, 2021

Uh oh!

andrei-ionescu commented Jan 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sezruby commented Jan 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrei-ionescu commented Jan 15, 2021

Uh oh!

sezruby commented Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrei-ionescu commented Jan 17, 2021

Uh oh!

sezruby commented Jan 18, 2021

Uh oh!

andrei-ionescu commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sezruby commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrei-ionescu commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrei-ionescu commented Jan 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sezruby commented Jan 19, 2021

Uh oh!

imback82 commented Jan 19, 2021

Uh oh!

andrei-ionescu commented Jan 9, 2021 •

edited

Loading

sezruby left a comment •

edited

Loading

andrei-ionescu commented Jan 14, 2021 •

edited

Loading

sezruby commented Jan 15, 2021 •

edited

Loading

andrei-ionescu commented Jan 15, 2021 •

edited

Loading

sezruby commented Jan 15, 2021 •

edited

Loading

sezruby commented Jan 16, 2021 •

edited

Loading

andrei-ionescu commented Jan 18, 2021 •

edited

Loading

sezruby commented Jan 18, 2021 •

edited

Loading

andrei-ionescu commented Jan 18, 2021 •

edited

Loading

andrei-ionescu commented Jan 19, 2021 •

edited

Loading