-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17495] [SQL] Support date, timestamp and interval types in Hive hash #17062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17495] [SQL] Support date, timestamp and interval types in Hive hash #17062
Conversation
|
ok to test |
|
Test build #73459 has finished for PR 17062 at commit
|
332475c to
e050a50
Compare
|
Jenkins retest this please |
|
Test build #73472 has finished for PR 17062 at commit
|
|
@gatorsmile : can you please review this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected values computed over hive 1.2.1.
Here are the queries in Hive:
SELECT HASH( CAST( "2017-01-01" AS DATE) );
SELECT HASH( CAST( "0000-01-01" AS DATE) );
SELECT HASH( CAST( "9999-12-31" AS DATE) );
SELECT HASH( CAST( "1970-01-01" AS DATE) );
SELECT HASH( CAST( "1800-01-01" AS DATE) );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark does not allow creating Date which do not fit its spec and throws exception. Hive will not fail but fallback to null and return 0 as hash value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corresponding hive query.
select HASH(CAST("2017-02-24 10:56:29" AS TIMESTAMP));
select HASH(CAST("2017-02-24 10:56:29.111111" AS TIMESTAMP));
select HASH(CAST("0001-01-01 00:00:00" AS TIMESTAMP));
select HASH(CAST("9999-01-01 00:00:00" AS TIMESTAMP));
select HASH(CAST("1970-01-01 00:00:00" AS TIMESTAMP));
select HASH(CAST("1800-01-01 03:12:45" AS TIMESTAMP));
Note that this is with system's timezone set to UTC (export TZ=/usr/share/zoneinfo/UTC). One of the tests below was with US/Pacific timezone
select HASH(CAST("2017-02-24 10:56:29" AS TIMESTAMP));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as Date, invalid timestamp values are not allowed in Spark and it will fail. Hive will not fail but fallback to null and return 0 as hash value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT HASH ( INTERVAL '1' DAY );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT HASH ( INTERVAL '1' DAY + INTERVAL '15' HOUR );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT HASH ( INTERVAL '-23' DAY + INTERVAL '56' HOUR + INTERVAL '-1111113' MINUTE + INTERVAL '9898989' SECOND );
|
Updated comments with the corresponding hive queries used to generate the expected outputs. |
|
Will review it tonight. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coud you add more test cases?
checkHiveHashForTimestampType("interval 0 day 0 hour 0 minute 0 second", 23273)
checkHiveHashForTimestampType("interval 0 day 0 hour", 23273)
checkHiveHashForTimestampType("interval -1 day", 3220036)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
It sounds like no test case covers nanosecond for INTERVAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does Hive deal with nanoseconds, if we divide it by 1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark's CalendarInterval has precision upto microseconds while Hive can have precision upto nanoseconds. So, there is no way for us to support that in the hashing function. I have documented this in the PR.
Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 84, Column 12: Assignment conversion not possible from type "long" to type "int"
at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004)
at org.codehaus.janino.UnitCompiler.assignmentConversion(UnitCompiler.java:9885)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3187)
/* 084 */ childHash = org.apache.spark.sql.catalyst.expressions.HiveHashFunction.hashTimestamp(-8951713928982000L);value = (31 * value) + childHash;
e050a50 to
fd0330d
Compare
tejasapatil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR. Added more tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark's CalendarInterval has precision upto microseconds while Hive can have precision upto nanoseconds. So, there is no way for us to support that in the hashing function. I have documented this in the PR.
| intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.11111111", 0)) | ||
| } | ||
|
|
||
| test("hive-hash for CalendarInterval type") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive queries for all the tests below. Outputs are generated by running against Hive-1.2.1
// ----- MICROSEC -----
SELECT HASH(interval_day_time("0 0:0:0.000001") );
SELECT HASH(interval_day_time("-0 0:0:0.000001") );
SELECT HASH(interval_day_time("0 0:0:0.000000") );
SELECT HASH(interval_day_time("0 0:0:0.000999") );
SELECT HASH(interval_day_time("-0 0:0:0.000999") );
// ----- MILLISEC -----
SELECT HASH(interval_day_time("0 0:0:0.001") );
SELECT HASH(interval_day_time("-0 0:0:0.001") );
SELECT HASH(interval_day_time("0 0:0:0.000") );
SELECT HASH(interval_day_time("0 0:0:0.999") );
SELECT HASH(interval_day_time("-0 0:0:0.999") );
// ----- SECOND -----
SELECT HASH( INTERVAL '1' SECOND);
SELECT HASH( INTERVAL '-1' SECOND);
SELECT HASH( INTERVAL '0' SECOND);
SELECT HASH( INTERVAL '2147483647' SECOND);
SELECT HASH( INTERVAL '-2147483648' SECOND);
// ----- MINUTE -----
SELECT HASH( INTERVAL '1' MINUTE);
SELECT HASH( INTERVAL '-1' MINUTE);
SELECT HASH( INTERVAL '0' MINUTE);
SELECT HASH( INTERVAL '2147483647' MINUTE);
SELECT HASH( INTERVAL '-2147483648' MINUTE);
// ----- HOUR -----
SELECT HASH( INTERVAL '1' HOUR);
SELECT HASH( INTERVAL '-1' HOUR);
SELECT HASH( INTERVAL '0' HOUR);
SELECT HASH( INTERVAL '2147483647' HOUR);
SELECT HASH( INTERVAL '-2147483648' HOUR);
// ----- DAY -----
SELECT HASH( INTERVAL '1' DAY);
SELECT HASH( INTERVAL '-1' DAY);
SELECT HASH( INTERVAL '0' DAY);
SELECT HASH( INTERVAL '106751991' DAY);
SELECT HASH( INTERVAL '-106751991' DAY);
// ----- MIX -----
SELECT HASH( INTERVAL '0' DAY );
SELECT HASH( INTERVAL '0' DAY + INTERVAL '0' HOUR );
SELECT HASH( INTERVAL '0' DAY + INTERVAL '0' HOUR + INTERVAL '0' MINUTE);
SELECT HASH( INTERVAL '0' DAY + INTERVAL '0' HOUR + INTERVAL '0' MINUTE + INTERVAL '0' SECOND);
SELECT HASH(interval_day_time("0 0:0:0.000") );
SELECT HASH(interval_day_time("0 0:0:0.000000") );
SELECT HASH( INTERVAL '6' DAY + INTERVAL '15' HOUR );
SELECT HASH( INTERVAL '5' DAY + INTERVAL '4' HOUR + INTERVAL '8' MINUTE);
SELECT HASH ( INTERVAL '-23' DAY + INTERVAL '56' HOUR + INTERVAL '-1111113' MINUTE + INTERVAL '9898989' SECOND );
SELECT HASH(interval_day_time("66 12:39:23.987") );
SELECT HASH(interval_day_time("66 12:39:23.987123") );
|
Test build #74278 has finished for PR 17062 at commit
|
|
cc @gatorsmile |
|
|
||
| // Out of range for both Hive and Spark | ||
| // Hive throws an exception. Spark overflows and returns wrong output | ||
| // checkHiveHashForIntervalType("interval 9999999999 day", -4767228) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fix it before merging this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of Spark SQL, the query with fails with exception (see below). However for the test case since I am by-passing and creating raw interval object which does not go through that check
scala> hc.sql("SELECT interval 9999999999 day ").show
org.apache.spark.sql.catalyst.parser.ParseException:
Error parsing interval string: day 9999999999 outside range [-106751991, 106751991](line 1, pos 16)
== SQL ==
SELECT interval 9999999999 day
scala> df.select("INTERVAL 9999999999 day").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`INTERVAL 9999999999 day`' given input columns: [key, value];;
'Project ['INTERVAL 9999999999 day]
|
|
||
| // Out of range for both Hive and Spark | ||
| // Hive throws an exception. Spark overflows and returns wrong output | ||
| // checkHiveHashForIntervalType("interval 9999999999 day", -4767228) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit sounds incorrect. The same to the other cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
I did the same check. The results of Hive 2.0 exactly match the hard-coded values. |
|
@tejasapatil Thanks for your work! Could you add a comment in the LGTM pending test. |
|
@gatorsmile : Thanks for the review :) Added method doc for |
|
Test build #74414 has finished for PR 17062 at commit
|
|
Test build #74416 has finished for PR 17062 at commit
|
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
How was this patch tested?
Added unit tests