Skip to content

Core, Spark 3.4: Write properties of PositionDeletesTable should respect ones of BaseTable#8428

Merged
szehon-ho merged 6 commits intoapache:masterfrom
roryqi:minor
Sep 19, 2023
Merged

Core, Spark 3.4: Write properties of PositionDeletesTable should respect ones of BaseTable#8428
szehon-ho merged 6 commits intoapache:masterfrom
roryqi:minor

Conversation

@roryqi
Copy link
Copy Markdown
Contributor

@roryqi roryqi commented Aug 30, 2023

What changes were proposed in this pull request?

Make write properties of PositionDeletesTable respect ones of BaseTable.

Why are the changes needed?

When we use PositionDeletesRewriteAction, we will use the properties of PositionDeletesTable, but the properties are empty before this pr. We will use default properties all the time. The write file format of PositionDeletesRewriteAction will be parquet although the table use orc as write format. It's unreasonable. More information you can see #8313 (comment)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT

@roryqi roryqi changed the title Spark 3.4: BaseMetadataTable should respect the base table properties [WIP] Spark 3.4: BaseMetadataTable should respect the base table properties Aug 30, 2023
@roryqi roryqi marked this pull request as draft August 30, 2023 14:55
@roryqi roryqi changed the title [WIP] Spark 3.4: BaseMetadataTable should respect the base table properties [WIP] Spark 3.4: The properties of BaseMetadataTable should respect the properties of BaseTable Aug 30, 2023
@szehon-ho
Copy link
Copy Markdown
Member

I think this option works for me, @aokolnychyi for any concerns?

@roryqi roryqi marked this pull request as ready for review August 31, 2023 22:55
@roryqi roryqi changed the title [WIP] Spark 3.4: The properties of BaseMetadataTable should respect the properties of BaseTable Spark 3.4: The properties of BaseMetadataTable should respect the properties of BaseTable Aug 31, 2023
@roryqi roryqi changed the title Spark 3.4: The properties of BaseMetadataTable should respect the properties of BaseTable Spark3.4, Core: The properties of BaseMetadataTable should respect the properties of BaseTable Sep 1, 2023
@roryqi roryqi changed the title Spark3.4, Core: The properties of BaseMetadataTable should respect the properties of BaseTable Core: The properties of BaseMetadataTable should respect the properties of BaseTable Sep 1, 2023
@roryqi roryqi changed the title Core: The properties of BaseMetadataTable should respect the properties of BaseTable Core: Properties of BaseMetadataTable should respect properties of BaseTable Sep 1, 2023
@roryqi roryqi changed the title Core: Properties of BaseMetadataTable should respect properties of BaseTable Core, Spark 3.4: Properties of BaseMetadataTable should respect properties of BaseTable Sep 2, 2023
@szehon-ho
Copy link
Copy Markdown
Member

@jerqi sorry about this, I was re-thinking about it and not 100% sure it makes sense, as there are some table properties that look weird on all metadata tables.

What do you think about first trying to solve it in [SparkBinPackPositionDeletesRewriter] (using ds write options) (https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackPositionDeletesRewriter.java) I think that's a bit less impact.

@roryqi
Copy link
Copy Markdown
Contributor Author

roryqi commented Sep 6, 2023

@jerqi sorry about this, I was re-thinking about it and not 100% sure it makes sense, as there are some table properties that look weird on all metadata tables.

What do you think about first trying to solve it in [SparkBinPackPositionDeletesRewriter] (using ds write options) (https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkBinPackPositionDeletesRewriter.java) I think that's a bit less impact.

  1. I can raise a this solution pr.

  2. And the properties of PositionDeletesTable is empty.
    There are many properties will be useful to write data.

write.delete.orc.compression-codec
write.delete.parquet.row-group-check-max-record-count
write.parquet.bloom-filter-enable
......

We need to set every write option for the data frame. It may be difficult for users to use.
What about my second solution? We only modify the properties of PositionDeletesTable.
Moreover, We can filter out the unreasonable table property. WDYT?

@szehon-ho
Copy link
Copy Markdown
Member

@jerqi ok that sounds good with me, let's go with that approach. Mostly write properties I assume?

Writing position deletes through position_deletes table is not exposed to user, as we dont support that outside rewrite_position_deletes. But I see your point that doing it inside the action makes the code harder as there's quite a few possible properties.

@roryqi
Copy link
Copy Markdown
Contributor Author

roryqi commented Sep 7, 2023

Mostly write properties

Maybe we should include read properties and commit properties, too, let me see what properties the rewrite_position_deletes used.

@roryqi
Copy link
Copy Markdown
Contributor Author

roryqi commented Sep 12, 2023

Mostly write properties

Maybe we should include read properties and commit properties, too, let me see what properties the rewrite_position_deletes used.

Yes, you're right. We need mostly write properties.

@roryqi roryqi changed the title Core, Spark 3.4: Properties of BaseMetadataTable should respect properties of BaseTable Core, Spark 3.4: Properties of PositionDeletesTable should respect properties of BaseTable Sep 12, 2023
@roryqi roryqi changed the title Core, Spark 3.4: Properties of PositionDeletesTable should respect properties of BaseTable Core, Spark 3.4: Write properties of PositionDeletesTable should respect properties of BaseTable Sep 12, 2023
// these properties should respect the ones of BaseTable.
return Collections.unmodifiableMap(
table().properties().entrySet().stream()
.filter(entry -> entry.getKey().startsWith("write."))
Copy link
Copy Markdown
Contributor Author

@roryqi roryqi Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that all the write properties are needed for our PositionDeletesRewriteAction. So I choose to match the key prefix here instead of copying some specific entries.

@roryqi roryqi changed the title Core, Spark 3.4: Write properties of PositionDeletesTable should respect properties of BaseTable Core, Spark 3.4: Write properties of PositionDeletesTable should respect ones of BaseTable Sep 12, 2023
@roryqi
Copy link
Copy Markdown
Contributor Author

roryqi commented Sep 12, 2023

@szehon-ho Could you review this pr again if you have time?

@roryqi roryqi requested a review from szehon-ho September 18, 2023 01:53
@szehon-ho szehon-ho merged commit faab813 into apache:master Sep 19, 2023
@szehon-ho
Copy link
Copy Markdown
Member

Merged, thanks @jerqi . Can you please update the pr description to make it clearer what problem we are fixing? And do we need to make the fix for other Spark versions?

@roryqi
Copy link
Copy Markdown
Contributor Author

roryqi commented Sep 19, 2023

Merged, thanks @jerqi . Can you please update the pr description to make it clearer what problem we are fixing? And do we need to make the fix for other Spark versions?

  1. ok, I will update the pr description.
  2. ok, I will raise another pr to fix Spark 3.5.

@roryqi
Copy link
Copy Markdown
Contributor Author

roryqi commented Sep 19, 2023

@szehon-ho I have raised a new pr for Spark 3.5. https://github.com/apache/iceberg/pull/8584/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants