Support Iceberg table format #358

andrei-ionescu · 2021-02-15T16:23:58Z

What is the context for this pull request?

Tracking Issue: [FEATURE REQUEST]: Add support for Iceberg table format #306
Proposal: [PROPOSAL]: Support Iceberg table format #318
Dependencies: Introduce SourceRelation/FileBasedRelation traits to remove direct dependency on LogicalRelation from actions/rules #355
Fixes: [FEATURE REQUEST]: Add support for Iceberg table format #306
Fixes: [PROPOSAL]: Support Iceberg table format #318
Closes: Support DataSourceV2 sources #321
Closes: Support Iceberg table format #320

What changes were proposed in this pull request?

This PR adds support for Iceberg.

The following changes are in this PR and each of them are separate commits:

Add Iceberg source.
Add support for incremental refresh. This is based on Support incremental refresh for Delta Lake #301 PR from @sezruby.
Add integration test

Does this PR introduce any user-facing change?

No. The main changes to user-facing APIs are in the #321 PR. Detailed information can be found in the #318 proposal.

How was this patch tested?

Integration test added for the new functionality
Locally & Databricks Runtime tests

Local build

sbt publishLocal

Run Spark shell with Hyperspace and Iceberg libraries loaded

$ spark-shell \
--driver-memory 4g \
--packages "com.microsoft.hyperspace:hyperspace-core_2.11:0.4.0-SNAPSHOT,org.apache.iceberg:iceberg-spark-runtime:0.10.0" \
--driver-java-options "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5006 -XX:+UseG1GC -Dlog4j.debug=true"

Paste the following code

import org.apache.spark.sql._
import com.microsoft.hyperspace._
import com.microsoft.hyperspace.index._
import scala.collection.JavaConverters._
import org.apache.iceberg.PartitionSpec
import org.apache.iceberg.TableProperties
import org.apache.iceberg.spark._
import org.apache.iceberg.hadoop._

val hs = new Hyperspace(spark)

// create Iceberg table
val props = Map(TableProperties.WRITE_NEW_DATA_LOCATION -> "table3").asJava
val sourceDf = Seq((1, "name1"), (2, "name2")).toDF("id", "name")
val schema = SparkSchemaUtil.convert(sourceDf.schema)
val part = PartitionSpec.builderFor(schema).build()
val icebergTable = new HadoopTables().create(schema, part, props, "table3")
sourceDf.write.mode("overwrite").format("iceberg").save("./table3")

// read created table
val iceDf = spark.read.format("iceberg").load("./table3")

// create indexes
hs.createIndex(iceDf, IndexConfig("index_ice0", indexedColumns = Seq("id"), includedColumns = Seq("name")))
hs.createIndex(iceDf, IndexConfig("index_ice1", indexedColumns = Seq("name")))

// verify plans
val query = iceDf.filter(iceDf("id") === 1).select("name")
hs.explain(query, verbose = true)

andrei-ionescu · 2021-02-15T16:27:27Z

@imback82 I closed the #321, #320 PRs which were obsolete after your refactoring work on #355 - thanks for it! Please review this PR that adds support for Iceberg table format.

imback82

Few minor/nit comments, but generally looking good to me.

I don't know every detail of Iceberg, but since the changes are self-contained, I think this is good to go. @sezruby could you also take a look? Thanks.

src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala

sezruby

Generally looks good to me! Could you also consider adding HybridScanForIcebergTest like HybridScanForDeltaLakeTest?

src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergFileBasedSource.scala

src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala

andrei-ionescu · 2021-02-17T09:40:07Z

@sezruby The test for hybrid scan is in the IcebergIntegrationTest lines 299-338.

imback82

LGTM (pending minor/nit comments + @sezruby's comments), thanks @andrei-ionescu!

src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala

src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergFileBasedSource.scala

sezruby · 2021-02-17T17:55:00Z

@sezruby The test for hybrid scan is in the IcebergIntegrationTest lines 299-338.

Yes, but we need to verify the exact plan transformation and the result of query.

andrei-ionescu · 2021-02-17T18:48:08Z

@sezruby I did take the Delta Lake hybrid test and modified it for Iceberg. That is what is in the integration test from line 299 to 338. Is there other hybrid test for Delta Lake that I missed?

sezruby · 2021-02-17T18:55:35Z

@andrei-ionescu There's another test set:
https://github.com/microsoft/hyperspace/blob/master/src/test/scala/com/microsoft/hyperspace/index/HybridScanForDeltaLakeTest.scala
based on https://github.com/microsoft/hyperspace/blob/master/src/test/scala/com/microsoft/hyperspace/index/HybridScanSuite.scala

src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala

andrei-ionescu · 2021-02-19T20:27:06Z

@sezruby I added HybridScanForIcebergTest.scala. Please have another look.

sezruby

@andrei-ionescu Thanks! Seems it works as expected! 👍👍
For start & end snapshot id test, I'd just like to check the use case & how those configs work well (or not well) with Hyperspace.

src/test/scala/com/microsoft/hyperspace/index/HybridScanForIcebergTest.scala

andrei-ionescu · 2021-02-19T21:24:08Z

@sezruby

For start & end snapshot id test, I'd just like to check the use case & how those configs work well (or not well) with Hyperspace.

Is there such check for Delta so that I could inspire from?

sezruby · 2021-02-19T21:43:38Z

@andrei-ionescu No, not sure Delta has the feature.
How about this?

build test data
- snapshot 1) add a row with value 1 (append 1 file)
- snapshot 2) add a row with value 2 (append 1 file)
- snapshot 3) add a row with value 3 (append 1 file)
- snapshot 4) add a row with value 4 (append 1 file)
- snapshot 5) add a row with value 5 (append 1 file)
build index with df with start-snapshot-id 2, end-snapshot-id 4
- query with df with start-snapshot-id 2, end-snapshot-id 4 w/o hybrid scan
- query with df with start-snapshot-id 3, and check w/ hybrid scan
- query with df with end-snapshot-id 3, and check w/ hybrid scan

andrei-ionescu · 2021-02-19T22:59:40Z

@sezruby I'll try to see it is possible but from my knowledge on Iceberg I don't think this kind of test is possible. The snapshot ids are used for time travel and to isolate data at read time. A snapshot is a sort of full image of the dataset's metadata -- is not a delta from the previous version. This is a bit different than Delta.

The parameters are extracted from the IcebergSource that knows how to properly access the Iceberg table by getting the latest version and retrieving the files attached to that version. The scope of those parameters are only to properly get the list of files.

Delta has the same parameter under versionAsOf. If there are tests to validate the functionality of Delta with time travel I'll do my best add it to Iceberg too although the implementation is a bit different.

And in regards to time travel, I don't know that we have any functionality that allows us to link a version of the table to a version of the index yet.

I would suggest to have a separate PR with time travel for Iceberg at the right time.

sezruby · 2021-02-19T23:20:39Z

@andrei-ionescu I guess versionAsOf of Delta Lake is the same as Iceberg only with end-snapshot-id.
But a df with both start-snapshot-id or end-snapshot-id handles "delta" dataset between two snapshots, right?

Hybrid scan will work w/o any link of version info - it utilizes the list of source files from DataFrame.

For time travel query optimization, to pick a proper version of a candidate index (if it has multiple versions from refreshes),
I added version history info to link delta version and index version in #272.

andrei-ionescu · 2021-02-20T11:05:13Z

@sezruby: I refactored it a bit and it is no longer needed to have knowledge of start-snapshot-id, end-snapshot-id and other internal Iceberg properties. That task method was in part inspired by Iceberg code, but I decided to use the higher level Iceberg API newScan().planFiles() for getting all files.

sezruby · 2021-02-20T22:40:28Z

LGTM thanks @andrei-ionescu!

andrei-ionescu · 2021-02-22T08:49:32Z

@rapoth, @imback82, @sezruby: The PR has 2 LGTMs already - thanks @imback82 & @sezruby. What's the process? What next steps are ahead to have it merged and complete the feature?

imback82 · 2021-02-22T17:09:24Z

Let me take a final look since there were some changes since my last LGTM.

imback82 · 2021-02-22T17:11:12Z

src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala

+   *
+   * File paths should be the same format as "input_file_name()" of the given relation type.
+   * For [[IcebergRelation]], each file path should be in this format:
+   *   `file:/path/to/file`


Can you update this based on the implementation below?

imback82 · 2021-02-22T17:14:40Z

@andrei-ionescu Can you do a follow up PR to address #358 (comment)? Thanks!

andrei-ionescu · 2021-02-22T20:32:49Z

@imback82: I create #362 to address the doc comment.

@imback82, @sezruby, @rapoth: Thanks for help with this PR & feature.

rapoth · 2021-02-22T20:36:27Z

It has been a fantastic collaboration so far and I have personally learned a lot in the process! Thank you @andrei-ionescu! 🙂

sezruby · 2021-02-23T00:52:57Z

Thanks for the work @andrei-ionescu!

Could you also simply update the user document if possible?
ref: https://microsoft.github.io/hyperspace/docs/ug-supported-data-formats/
https://github.com/microsoft/hyperspace/blob/master/docs/_docs/08-ug-supported-data-formats.md

Add Iceberg support

b3fc870

This was referenced Feb 15, 2021

Support Iceberg table format #320

Closed

Support DataSourceV2 sources #321

Closed

imback82 assigned andrei-ionescu Feb 16, 2021

imback82 added the enhancement New feature or request label Feb 16, 2021

imback82 added this to the February 2021 (v0.5.0) milestone Feb 16, 2021

imback82 reviewed Feb 16, 2021

View reviewed changes

sezruby reviewed Feb 16, 2021

View reviewed changes

imback82 reviewed Feb 17, 2021

View reviewed changes

imback82 mentioned this pull request Feb 17, 2021

[FEATURE REQUEST]: Improve source provider API by introducing SourceRelationMetadata #360

Closed

imback82 reviewed Feb 17, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/iceberg/IcebergRelation.scala Outdated Show resolved Hide resolved

andrei-ionescu force-pushed the iceberg_support branch 3 times, most recently from 5a23a11 to 5e89eee Compare February 19, 2021 20:24

sezruby reviewed Feb 19, 2021

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/HybridScanForIcebergTest.scala Outdated Show resolved Hide resolved

src/test/scala/com/microsoft/hyperspace/index/HybridScanForIcebergTest.scala Outdated Show resolved Hide resolved

andrei-ionescu force-pushed the iceberg_support branch 2 times, most recently from 12347df to a158175 Compare February 19, 2021 21:22

andrei-ionescu added 3 commits February 20, 2021 12:55

Integrate review feedback

1fb3c16

Upgrade to latest version (0.11.0) of Iceberg

a634db7

Add full suite of Hybrid tests for Iceberg

768564e

andrei-ionescu force-pushed the iceberg_support branch from a158175 to 768564e Compare February 20, 2021 11:03

imback82 approved these changes Feb 22, 2021

View reviewed changes

imback82 merged commit 29ebdde into microsoft:master Feb 22, 2021

andrei-ionescu deleted the iceberg_support branch February 22, 2021 20:20

andrei-ionescu mentioned this pull request Feb 22, 2021

Fix for doc of lineagePairs method of IcebergRelation #362

Merged

vamsimanohar mentioned this pull request Apr 9, 2024

Adding support to run integ tests on iceberg tables opensearch-project/opensearch-spark#301

Merged

Support Iceberg table format #358

Support Iceberg table format #358

Uh oh!

Conversation

andrei-ionescu commented Feb 15, 2021

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

andrei-ionescu commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sezruby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrei-ionescu commented Feb 17, 2021

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sezruby commented Feb 17, 2021

Uh oh!

andrei-ionescu commented Feb 17, 2021

Uh oh!

sezruby commented Feb 17, 2021

Uh oh!

Uh oh!

andrei-ionescu commented Feb 19, 2021

Uh oh!

sezruby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andrei-ionescu commented Feb 19, 2021

Uh oh!

sezruby commented Feb 19, 2021

Uh oh!

andrei-ionescu commented Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sezruby commented Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrei-ionescu commented Feb 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sezruby commented Feb 20, 2021

Uh oh!

andrei-ionescu commented Feb 22, 2021

Uh oh!

imback82 commented Feb 22, 2021

Uh oh!

imback82 Feb 22, 2021

Choose a reason for hiding this comment

Uh oh!

imback82 commented Feb 22, 2021

Uh oh!

andrei-ionescu commented Feb 22, 2021

Uh oh!

andrei-ionescu commented Feb 15, 2021 •

edited

Loading

andrei-ionescu commented Feb 19, 2021 •

edited

Loading

sezruby commented Feb 19, 2021 •

edited

Loading

andrei-ionescu commented Feb 20, 2021 •

edited

Loading