[SPARK-17495] [SQL] Support date, timestamp and interval types in Hive hash #17062

tejasapatil · 2017-02-25T02:34:35Z

What changes were proposed in this pull request?

Timestamp hashing is done as per TimestampWritable.hashCode() in Hive
Interval hashing is done as per HiveIntervalDayTime.hashCode(). Note that there are inherent differences in how Hive and Spark store intervals under the hood which limits the ability to be in completely sync with hive's hashing function. I have explained this in the method doc.
Date type was already supported. This PR adds test for that.

How was this patch tested?

Added unit tests

tejasapatil · 2017-02-25T02:34:41Z

ok to test

SparkQA · 2017-02-25T04:05:24Z

Test build #73459 has finished for PR 17062 at commit 332475c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-02-26T00:06:58Z

Jenkins retest this please

SparkQA · 2017-02-26T02:09:51Z

Test build #73472 has finished for PR 17062 at commit e050a50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-02-28T01:35:10Z

@gatorsmile : can you please review this PR ?

tejasapatil · 2017-02-27T18:52:04Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

expected values computed over hive 1.2.1.

Here are the queries in Hive:

SELECT HASH( CAST( "2017-01-01" AS DATE) ); SELECT HASH( CAST( "0000-01-01" AS DATE) ); SELECT HASH( CAST( "9999-12-31" AS DATE) ); SELECT HASH( CAST( "1970-01-01" AS DATE) ); SELECT HASH( CAST( "1800-01-01" AS DATE) );

tejasapatil · 2017-02-27T20:09:54Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

Spark does not allow creating Date which do not fit its spec and throws exception. Hive will not fail but fallback to null and return 0 as hash value.

tejasapatil · 2017-02-27T20:11:10Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

Corresponding hive query.

select HASH(CAST("2017-02-24 10:56:29" AS TIMESTAMP)); select HASH(CAST("2017-02-24 10:56:29.111111" AS TIMESTAMP)); select HASH(CAST("0001-01-01 00:00:00" AS TIMESTAMP)); select HASH(CAST("9999-01-01 00:00:00" AS TIMESTAMP)); select HASH(CAST("1970-01-01 00:00:00" AS TIMESTAMP)); select HASH(CAST("1800-01-01 03:12:45" AS TIMESTAMP));

Note that this is with system's timezone set to UTC (export TZ=/usr/share/zoneinfo/UTC). One of the tests below was with US/Pacific timezone

select HASH(CAST("2017-02-24 10:56:29" AS TIMESTAMP));

tejasapatil · 2017-02-27T20:12:46Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

same as Date, invalid timestamp values are not allowed in Spark and it will fail. Hive will not fail but fallback to null and return 0 as hash value.

tejasapatil · 2017-02-28T01:32:05Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

SELECT HASH ( INTERVAL '1' DAY );

tejasapatil · 2017-02-28T01:33:47Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

SELECT HASH ( INTERVAL '1' DAY + INTERVAL '15' HOUR );

tejasapatil · 2017-02-28T01:34:38Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

SELECT HASH ( INTERVAL '-23' DAY + INTERVAL '56' HOUR + INTERVAL '-1111113' MINUTE + INTERVAL '9898989' SECOND );

tejasapatil · 2017-03-01T07:11:49Z

Updated comments with the corresponding hive queries used to generate the expected outputs.

gatorsmile · 2017-03-03T01:48:00Z

Will review it tonight. Thanks!

gatorsmile · 2017-03-04T07:51:12Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

Coud you add more test cases?

checkHiveHashForTimestampType("interval 0 day 0 hour 0 minute 0 second", 23273) checkHiveHashForTimestampType("interval 0 day 0 hour", 23273) checkHiveHashForTimestampType("interval -1 day", 3220036)

gatorsmile · 2017-03-04T08:10:04Z

It sounds like no test case covers nanosecond for INTERVAL

gatorsmile · 2017-03-04T08:12:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

How does Hive deal with nanoseconds, if we divide it by 1000?

Spark's CalendarInterval has precision upto microseconds while Hive can have precision upto nanoseconds. So, there is no way for us to support that in the hashing function. I have documented this in the PR.

Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 84, Column 12: Assignment conversion not possible from type "long" to type "int" at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) at org.codehaus.janino.UnitCompiler.assignmentConversion(UnitCompiler.java:9885) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3187) /* 084 */ childHash = org.apache.spark.sql.catalyst.expressions.HiveHashFunction.hashTimestamp(-8951713928982000L);value = (31 * value) + childHash;

tejasapatil

Updated the PR. Added more tests.

tejasapatil · 2017-03-09T20:22:29Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

tejasapatil · 2017-03-09T20:22:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

Spark's CalendarInterval has precision upto microseconds while Hive can have precision upto nanoseconds. So, there is no way for us to support that in the hashing function. I have documented this in the PR.

tejasapatil · 2017-03-09T20:37:06Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

+    intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.11111111", 0))
+  }
+
+  test("hive-hash for CalendarInterval type") {


Hive queries for all the tests below. Outputs are generated by running against Hive-1.2.1

// ----- MICROSEC ----- SELECT HASH(interval_day_time("0 0:0:0.000001") ); SELECT HASH(interval_day_time("-0 0:0:0.000001") ); SELECT HASH(interval_day_time("0 0:0:0.000000") ); SELECT HASH(interval_day_time("0 0:0:0.000999") ); SELECT HASH(interval_day_time("-0 0:0:0.000999") ); // ----- MILLISEC ----- SELECT HASH(interval_day_time("0 0:0:0.001") ); SELECT HASH(interval_day_time("-0 0:0:0.001") ); SELECT HASH(interval_day_time("0 0:0:0.000") ); SELECT HASH(interval_day_time("0 0:0:0.999") ); SELECT HASH(interval_day_time("-0 0:0:0.999") ); // ----- SECOND ----- SELECT HASH( INTERVAL '1' SECOND); SELECT HASH( INTERVAL '-1' SECOND); SELECT HASH( INTERVAL '0' SECOND); SELECT HASH( INTERVAL '2147483647' SECOND); SELECT HASH( INTERVAL '-2147483648' SECOND); // ----- MINUTE ----- SELECT HASH( INTERVAL '1' MINUTE); SELECT HASH( INTERVAL '-1' MINUTE); SELECT HASH( INTERVAL '0' MINUTE); SELECT HASH( INTERVAL '2147483647' MINUTE); SELECT HASH( INTERVAL '-2147483648' MINUTE); // ----- HOUR ----- SELECT HASH( INTERVAL '1' HOUR); SELECT HASH( INTERVAL '-1' HOUR); SELECT HASH( INTERVAL '0' HOUR); SELECT HASH( INTERVAL '2147483647' HOUR); SELECT HASH( INTERVAL '-2147483648' HOUR); // ----- DAY ----- SELECT HASH( INTERVAL '1' DAY); SELECT HASH( INTERVAL '-1' DAY); SELECT HASH( INTERVAL '0' DAY); SELECT HASH( INTERVAL '106751991' DAY); SELECT HASH( INTERVAL '-106751991' DAY); // ----- MIX ----- SELECT HASH( INTERVAL '0' DAY ); SELECT HASH( INTERVAL '0' DAY + INTERVAL '0' HOUR ); SELECT HASH( INTERVAL '0' DAY + INTERVAL '0' HOUR + INTERVAL '0' MINUTE); SELECT HASH( INTERVAL '0' DAY + INTERVAL '0' HOUR + INTERVAL '0' MINUTE + INTERVAL '0' SECOND); SELECT HASH(interval_day_time("0 0:0:0.000") ); SELECT HASH(interval_day_time("0 0:0:0.000000") ); SELECT HASH( INTERVAL '6' DAY + INTERVAL '15' HOUR ); SELECT HASH( INTERVAL '5' DAY + INTERVAL '4' HOUR + INTERVAL '8' MINUTE); SELECT HASH ( INTERVAL '-23' DAY + INTERVAL '56' HOUR + INTERVAL '-1111113' MINUTE + INTERVAL '9898989' SECOND ); SELECT HASH(interval_day_time("66 12:39:23.987") ); SELECT HASH(interval_day_time("66 12:39:23.987123") );

SparkQA · 2017-03-09T22:33:14Z

Test build #74278 has finished for PR 17062 at commit fd0330d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-03-09T23:27:51Z

cc @gatorsmile

gatorsmile · 2017-03-12T21:40:06Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

+
+    // Out of range for both Hive and Spark
+    // Hive throws an exception. Spark overflows and returns wrong output
+    // checkHiveHashForIntervalType("interval 9999999999 day", -4767228)


Should we fix it before merging this PR?

In case of Spark SQL, the query with fails with exception (see below). However for the test case since I am by-passing and creating raw interval object which does not go through that check

scala> hc.sql("SELECT interval 9999999999 day ").show org.apache.spark.sql.catalyst.parser.ParseException: Error parsing interval string: day 9999999999 outside range [-106751991, 106751991](line 1, pos 16) == SQL == SELECT interval 9999999999 day

scala> df.select("INTERVAL 9999999999 day").show() org.apache.spark.sql.AnalysisException: cannot resolve '`INTERVAL 9999999999 day`' given input columns: [key, value];; 'Project ['INTERVAL 9999999999 day]

gatorsmile · 2017-03-12T21:57:42Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

+
+    // Out of range for both Hive and Spark
+    // Hive throws an exception. Spark overflows and returns wrong output
+    // checkHiveHashForIntervalType("interval 9999999999 day", -4767228)


The unit sounds incorrect. The same to the other cases

gatorsmile · 2017-03-12T22:37:02Z

I did the same check. The results of Hive 2.0 exactly match the hard-coded values.

gatorsmile · 2017-03-12T22:40:08Z

@tejasapatil Thanks for your work! Could you add a comment in the hash function? The caller of hash needs to check the validity of input values.

LGTM pending test.

tejasapatil · 2017-03-12T22:53:44Z

@gatorsmile : Thanks for the review :) Added method doc for hash() with the comment as suggested.

SparkQA · 2017-03-13T00:23:41Z

Test build #74414 has finished for PR 17062 at commit 332686a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-13T01:04:52Z

Test build #74416 has finished for PR 17062 at commit 8a5f200.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-13T03:09:27Z

Thanks! Merging to master.

tejasapatil mentioned this pull request Feb 25, 2017

[SPARK-17495] [SQL] Add more tests for hive hash #17049

Closed

tejasapatil force-pushed the SPARK-17495_time_related_types branch from 332475c to e050a50 Compare February 26, 2017 00:05

tejasapatil commented Feb 28, 2017

View reviewed changes

gatorsmile reviewed Mar 4, 2017

View reviewed changes

tejasapatil added 4 commits March 8, 2017 22:51

[SPARK-17495] [SQL] Support date, timestamp datatypes in Hive hash

f06bbc6

minor refac

8e01637

review comment apache#1

fd0330d

tejasapatil force-pushed the SPARK-17495_time_related_types branch from e050a50 to fd0330d Compare March 9, 2017 20:24

tejasapatil commented Mar 9, 2017

View reviewed changes

gatorsmile reviewed Mar 12, 2017

View reviewed changes

review comment apache#2

332686a

review apache#3

8a5f200

asfgit closed this in 9456688 Mar 13, 2017

tejasapatil deleted the SPARK-17495_time_related_types branch March 13, 2017 14:07

[SPARK-17495] [SQL] Support date, timestamp and interval types in Hive hash #17062

[SPARK-17495] [SQL] Support date, timestamp and interval types in Hive hash #17062

Uh oh!

Conversation

tejasapatil commented Feb 25, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tejasapatil commented Feb 25, 2017

Uh oh!

SparkQA commented Feb 25, 2017

Uh oh!

tejasapatil commented Feb 26, 2017

Uh oh!

SparkQA commented Feb 26, 2017

Uh oh!

tejasapatil commented Feb 28, 2017

Uh oh!

tejasapatil Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil commented Mar 1, 2017

Uh oh!

gatorsmile commented Mar 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

tejasapatil commented Mar 9, 2017

Uh oh!

gatorsmile Mar 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Mar 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 12, 2017

Uh oh!

gatorsmile commented Mar 12, 2017

Uh oh!

tejasapatil commented Mar 12, 2017

tejasapatil Feb 27, 2017 •

edited

Loading

tejasapatil Feb 27, 2017 •

edited

Loading

gatorsmile Mar 12, 2017 •

edited

Loading

gatorsmile Mar 12, 2017 •

edited

Loading