Skip to content

java.lang.IllegalStateException: Already closed file for partition #508

@mykolasmith

Description

@mykolasmith

Hi, I am trying to write a Spark DataFrame to Iceberg that contains rows that cross an hourly partition threshold (i.e. the DataFrame contains rows in >1 hour). The expected result would be to commit a different file for each partition. However, I am receiving this error:

19/09/30 16:30:52 INFO CodecPool: Got brand-new compressor [.gz]
19/09/30 16:30:52 WARN Writer: Duplicate key: [436073] == [436073]
19/09/30 16:30:52 ERROR Utils: Aborting task
java.lang.IllegalStateException: Already closed file for partition: ec_event_time_hour=2019-09-30-17
	at org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:389)
	at org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:350)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$2(WriteToDataSourceV2Exec.scala:118)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:116)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.$anonfun$doExecute$2(WriteToDataSourceV2Exec.scala:67)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Any idea what could be happening here? Do I need to group the DataFrame by hour in order to get two DFs that only contain rows for a single partition?

EDIT: Running the iceberg-spark-runtime built from master.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions