Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

Apache ORC 1.4.1 is released yesterday.

Like ORC-233 (Allow orc.include.columns to be empty), there are several important fixes.
This PR updates Apache ORC dependency to use the latest one, 1.4.1.

How was this patch tested?

Pass the Jenkins.

@SparkQA
Copy link

SparkQA commented Oct 17, 2017

Test build #82853 has finished for PR 19521 at commit 50ec007.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile and @cloud-fan .

This will remove the regression on on-going ORC PRs.

Could you review this?

@cloud-fan
Copy link
Contributor

looks good, no new dependencies introduced, just upgrading. cc @srowen to double check. Thanks!

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @cloud-fan !

@gatorsmile
Copy link
Member

Also LGTM

Regarding the test case you posted, does Parquet return null or empty string?

@gatorsmile
Copy link
Member

We can save an empty DataFrame as an ORC table, but we are unable to fetch it from the table.

      val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty)
      val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty))
      dfNoCols.write.format("orc").saveAsTable("t")
      spark.sql("select 1 from t").show()

This is not related to this upgrade, but you might be interested in this.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @gatorsmile .

  1. The test case was added at [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all columns when doing a simple count  #15898 (SPARK-18457). I guess Parquet returns null, but we had better have explicit test cases. I will try to extend that test case for parquet next time.
  2. Thanks for bringing that up. Yes. We can resolve that empty ORC file issue, SPARK-15474 (ORC data source fails to write and read back empty dataframe), with new ORC source by creating an empty file with the correct schema, not struct<>.

BTW, I've linked all related ORC issues into SPARK-20901 and am working on it. You can monitor ORC progress there.

@gatorsmile
Copy link
Member

SPARK-15474 is zero row. The above case is zero column. Are they the same issues?

@dongjoon-hyun
Copy link
Member Author

Oh, I confused with what I'm watching in these days.

For your example, Parquet also doesn't support. We may create an issue for both Parquet/ORC on empty schema .

scala> val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty)
scala> val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty))
scala> dfNoCols.write.format("parquet").saveAsTable("px")
17/10/18 05:46:17 ERROR Utils: Aborting task
org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message spark_schema {
}

@HyukjinKwon
Copy link
Member

LGTM too BTW.

@HyukjinKwon
Copy link
Member

Empty schema path probably related with this IIRC (not double checked):

case oi: StructObjectInspector if oi.getAllStructFieldRefs.size() == 0 =>
logInfo(
s"ORC file $path has empty schema, it probably contains no rows. " +
"Trying to read another ORC file to figure out the schema.")
false
case _ => true
}

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @HyukjinKwon .

@gatorsmile
Copy link
Member

cc @srowen @rxin

@rxin
Copy link
Contributor

rxin commented Oct 18, 2017

LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you, @rxin !

@cloud-fan
Copy link
Contributor

Thanks, merging to master!

@asfgit asfgit closed this in 6f1d0de Oct 19, 2017
@dongjoon-hyun
Copy link
Member Author

Thank you all for review and merge!

@dongjoon-hyun dongjoon-hyun deleted the SPARK-22300 branch October 19, 2017 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants