[SPARK-22300][BUILD] Update ORC to 1.4.1 #19521

dongjoon-hyun · 2017-10-17T19:06:39Z

What changes were proposed in this pull request?

Apache ORC 1.4.1 is released yesterday.

https://orc.apache.org/news/2017/10/16/ORC-1.4.1/

Like ORC-233 (Allow orc.include.columns to be empty), there are several important fixes.
This PR updates Apache ORC dependency to use the latest one, 1.4.1.

How was this patch tested?

Pass the Jenkins.

SparkQA · 2017-10-17T22:15:09Z

Test build #82853 has finished for PR 19521 at commit 50ec007.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-10-17T22:52:16Z

Hi, @gatorsmile and @cloud-fan .

This will remove the regression on on-going ORC PRs.

test("Empty schema does not read data from ORC file")

Could you review this?

cloud-fan · 2017-10-18T03:23:03Z

looks good, no new dependencies introduced, just upgrading. cc @srowen to double check. Thanks!

dongjoon-hyun · 2017-10-18T04:34:14Z

Thank you for review, @cloud-fan !

gatorsmile · 2017-10-18T04:57:17Z

Also LGTM

Regarding the test case you posted, does Parquet return null or empty string?

gatorsmile · 2017-10-18T05:13:17Z

We can save an empty DataFrame as an ORC table, but we are unable to fetch it from the table.

      val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty)
      val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty))
      dfNoCols.write.format("orc").saveAsTable("t")
      spark.sql("select 1 from t").show()

This is not related to this upgrade, but you might be interested in this.

dongjoon-hyun · 2017-10-18T05:32:22Z

Thank you for review, @gatorsmile .

The test case was added at [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all columns when doing a simple count #15898 (SPARK-18457). I guess Parquet returns null, but we had better have explicit test cases. I will try to extend that test case for parquet next time.
Thanks for bringing that up. Yes. We can resolve that empty ORC file issue, SPARK-15474 (ORC data source fails to write and read back empty dataframe), with new ORC source by creating an empty file with the correct schema, not struct<>.

BTW, I've linked all related ORC issues into SPARK-20901 and am working on it. You can monitor ORC progress there.

gatorsmile · 2017-10-18T05:36:40Z

SPARK-15474 is zero row. The above case is zero column. Are they the same issues?

dongjoon-hyun · 2017-10-18T12:54:43Z

Oh, I confused with what I'm watching in these days.

For your example, Parquet also doesn't support. We may create an issue for both Parquet/ORC on empty schema .

scala> val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty)
scala> val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty))
scala> dfNoCols.write.format("parquet").saveAsTable("px")
17/10/18 05:46:17 ERROR Utils: Aborting task
org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message spark_schema {
}

HyukjinKwon · 2017-10-18T12:55:37Z

LGTM too BTW.

HyukjinKwon · 2017-10-18T13:08:25Z

Empty schema path probably related with this IIRC (not double checked):

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala

Lines 52 to 58 in cca945b

    
             case oi: StructObjectInspector if oi.getAllStructFieldRefs.size() == 0 => 
        
               logInfo( 
        
                 s"ORC file $path has empty schema, it probably contains no rows. " + 
        
                   "Trying to read another ORC file to figure out the schema.") 
        
               false 
        
             case _ => true 
        
           }

dongjoon-hyun · 2017-10-18T13:38:04Z

Thank you for review, @HyukjinKwon .

gatorsmile · 2017-10-18T16:10:45Z

cc @srowen @rxin

rxin · 2017-10-18T20:53:35Z

LGTM

dongjoon-hyun · 2017-10-18T20:58:12Z

Thank you, @rxin !

cloud-fan · 2017-10-19T05:31:37Z

Thanks, merging to master!

dongjoon-hyun · 2017-10-19T14:33:43Z

Thank you all for review and merge!

[SPARK-22300][BUILD] Update ORC to 1.4.1

50ec007

asfgit closed this in 6f1d0de Oct 19, 2017

dongjoon-hyun deleted the SPARK-22300 branch October 19, 2017 14:33

[SPARK-22300][BUILD] Update ORC to 1.4.1 #19521

[SPARK-22300][BUILD] Update ORC to 1.4.1 #19521

Uh oh!

Conversation

dongjoon-hyun commented Oct 17, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 17, 2017

Uh oh!

dongjoon-hyun commented Oct 17, 2017

Uh oh!

cloud-fan commented Oct 18, 2017

Uh oh!

dongjoon-hyun commented Oct 18, 2017

Uh oh!

gatorsmile commented Oct 18, 2017

Uh oh!

gatorsmile commented Oct 18, 2017

Uh oh!

dongjoon-hyun commented Oct 18, 2017

Uh oh!

gatorsmile commented Oct 18, 2017

Uh oh!

dongjoon-hyun commented Oct 18, 2017

Uh oh!

HyukjinKwon commented Oct 18, 2017

Uh oh!

HyukjinKwon commented Oct 18, 2017

Uh oh!

dongjoon-hyun commented Oct 18, 2017

Uh oh!

gatorsmile commented Oct 18, 2017

Uh oh!

rxin commented Oct 18, 2017

Uh oh!

dongjoon-hyun commented Oct 18, 2017

Uh oh!

cloud-fan commented Oct 19, 2017

Uh oh!

dongjoon-hyun commented Oct 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants