Skip to content

Conversation

@kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Jan 17, 2026

Which issue does this PR close?

What changes are included in this PR?

We made some upgrades to the Spark Dockerfile in pyiceberg (apache/iceberg-python#2540) (which i think rust's Dockerfile copied over previously). Porting those changes over:

  • Use apache/spark as base image (should be faster than downloading spark from apache cdn)
  • Upgrade to spark 4.0
  • Use Spark connect for provisioning

Are these changes tested?

Yes

Comment on lines -21 to -24
# The configuration is important, otherwise we get many small
# parquet files with a single row. When a positional delete
# hits the Parquet file with one row, the parquet file gets
# dropped instead of having a merge-on-read delete file.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.coalesce(1) below has the same effect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevinjqliu for this fix!

@liurenjie1024
Copy link
Contributor

We need to resolve conflicts first.

@kevinjqliu kevinjqliu merged commit 6270ce4 into apache:main Jan 19, 2026
17 checks passed
@kevinjqliu kevinjqliu deleted the kevinjqliu/fix-spark-docker branch January 19, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci broken: spark 3.6.7 404s

2 participants