Skip to content

Conversation

@tejasapatil
Copy link
Contributor

What changes were proposed in this pull request?

  • Timestamp hashing is done as per TimestampWritable.hashCode() in Hive
  • Interval hashing is done as per HiveIntervalDayTime.hashCode(). Note that there are inherent differences in how Hive and Spark store intervals under the hood which limits the ability to be in completely sync with hive's hashing function. I have explained this in the method doc.
  • Date type was already supported. This PR adds test for that.

How was this patch tested?

Added unit tests

@tejasapatil
Copy link
Contributor Author

ok to test

@SparkQA
Copy link

SparkQA commented Feb 25, 2017

Test build #73459 has finished for PR 17062 at commit 332475c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil tejasapatil force-pushed the SPARK-17495_time_related_types branch from 332475c to e050a50 Compare February 26, 2017 00:05
@tejasapatil
Copy link
Contributor Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Feb 26, 2017

Test build #73472 has finished for PR 17062 at commit e050a50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

@gatorsmile : can you please review this PR ?

Copy link
Contributor Author

@tejasapatil tejasapatil Feb 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected values computed over hive 1.2.1.

Here are the queries in Hive:

SELECT HASH( CAST( "2017-01-01" AS DATE) );
SELECT HASH( CAST( "0000-01-01" AS DATE) );
SELECT HASH( CAST( "9999-12-31" AS DATE) );
SELECT HASH( CAST( "1970-01-01" AS DATE) );
SELECT HASH( CAST( "1800-01-01" AS DATE) );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark does not allow creating Date which do not fit its spec and throws exception. Hive will not fail but fallback to null and return 0 as hash value.

Copy link
Contributor Author

@tejasapatil tejasapatil Feb 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corresponding hive query.

select HASH(CAST("2017-02-24 10:56:29" AS TIMESTAMP));
select HASH(CAST("2017-02-24 10:56:29.111111" AS TIMESTAMP));
select HASH(CAST("0001-01-01 00:00:00" AS TIMESTAMP));
select HASH(CAST("9999-01-01 00:00:00" AS TIMESTAMP));
select HASH(CAST("1970-01-01 00:00:00" AS TIMESTAMP));
select HASH(CAST("1800-01-01 03:12:45" AS TIMESTAMP));

Note that this is with system's timezone set to UTC (export TZ=/usr/share/zoneinfo/UTC). One of the tests below was with US/Pacific timezone

select HASH(CAST("2017-02-24 10:56:29" AS TIMESTAMP));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as Date, invalid timestamp values are not allowed in Spark and it will fail. Hive will not fail but fallback to null and return 0 as hash value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT HASH ( INTERVAL '1' DAY );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT HASH ( INTERVAL '1' DAY + INTERVAL '15' HOUR );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT HASH ( INTERVAL '-23' DAY + INTERVAL '56' HOUR + INTERVAL '-1111113' MINUTE + INTERVAL '9898989' SECOND );

@tejasapatil
Copy link
Contributor Author

Updated comments with the corresponding hive queries used to generate the expected outputs.

@gatorsmile
Copy link
Member

Will review it tonight. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coud you add more test cases?

    checkHiveHashForTimestampType("interval 0 day 0 hour 0 minute 0 second", 23273)
    checkHiveHashForTimestampType("interval 0 day 0 hour", 23273)
    checkHiveHashForTimestampType("interval -1 day", 3220036)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@gatorsmile
Copy link
Member

It sounds like no test case covers nanosecond for INTERVAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does Hive deal with nanoseconds, if we divide it by 1000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark's CalendarInterval has precision upto microseconds while Hive can have precision upto nanoseconds. So, there is no way for us to support that in the hashing function. I have documented this in the PR.

Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 84, Column 12: Assignment conversion not possible from type "long" to type "int"
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004)
    at org.codehaus.janino.UnitCompiler.assignmentConversion(UnitCompiler.java:9885)
    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3187)

/* 084 */     childHash = org.apache.spark.sql.catalyst.expressions.HiveHashFunction.hashTimestamp(-8951713928982000L);value = (31 * value) + childHash;
@tejasapatil tejasapatil force-pushed the SPARK-17495_time_related_types branch from e050a50 to fd0330d Compare March 9, 2017 20:24
Copy link
Contributor Author

@tejasapatil tejasapatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR. Added more tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark's CalendarInterval has precision upto microseconds while Hive can have precision upto nanoseconds. So, there is no way for us to support that in the hashing function. I have documented this in the PR.

intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.11111111", 0))
}

test("hive-hash for CalendarInterval type") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive queries for all the tests below. Outputs are generated by running against Hive-1.2.1

// ----- MICROSEC -----
SELECT HASH(interval_day_time("0 0:0:0.000001") );
SELECT HASH(interval_day_time("-0 0:0:0.000001") );
SELECT HASH(interval_day_time("0 0:0:0.000000") );
SELECT HASH(interval_day_time("0 0:0:0.000999") );
SELECT HASH(interval_day_time("-0 0:0:0.000999") );

// ----- MILLISEC -----
SELECT HASH(interval_day_time("0 0:0:0.001") );
SELECT HASH(interval_day_time("-0 0:0:0.001") );
SELECT HASH(interval_day_time("0 0:0:0.000") );
SELECT HASH(interval_day_time("0 0:0:0.999") );
SELECT HASH(interval_day_time("-0 0:0:0.999") );

// ----- SECOND -----
SELECT HASH( INTERVAL '1' SECOND);
SELECT HASH( INTERVAL '-1' SECOND);
SELECT HASH( INTERVAL '0' SECOND);
SELECT HASH( INTERVAL '2147483647' SECOND);
SELECT HASH( INTERVAL '-2147483648' SECOND);

// ----- MINUTE -----
SELECT HASH( INTERVAL '1' MINUTE);
SELECT HASH( INTERVAL '-1' MINUTE);
SELECT HASH( INTERVAL '0' MINUTE);
SELECT HASH( INTERVAL '2147483647' MINUTE);
SELECT HASH( INTERVAL '-2147483648' MINUTE);

// ----- HOUR -----
SELECT HASH( INTERVAL '1' HOUR);
SELECT HASH( INTERVAL '-1' HOUR);
SELECT HASH( INTERVAL '0' HOUR);
SELECT HASH( INTERVAL '2147483647' HOUR);
SELECT HASH( INTERVAL '-2147483648' HOUR);

// ----- DAY -----
SELECT HASH( INTERVAL '1' DAY);
SELECT HASH( INTERVAL '-1' DAY);
SELECT HASH( INTERVAL '0' DAY);
SELECT HASH( INTERVAL '106751991' DAY);
SELECT HASH( INTERVAL '-106751991' DAY);

// ----- MIX -----
SELECT HASH( INTERVAL '0' DAY );
SELECT HASH( INTERVAL '0' DAY + INTERVAL '0' HOUR );
SELECT HASH( INTERVAL '0' DAY + INTERVAL '0' HOUR + INTERVAL '0' MINUTE);
SELECT HASH( INTERVAL '0' DAY + INTERVAL '0' HOUR + INTERVAL '0' MINUTE + INTERVAL '0' SECOND);
SELECT HASH(interval_day_time("0 0:0:0.000") );
SELECT HASH(interval_day_time("0 0:0:0.000000") );

SELECT HASH( INTERVAL '6' DAY + INTERVAL '15' HOUR );
SELECT HASH( INTERVAL '5' DAY + INTERVAL '4' HOUR + INTERVAL '8' MINUTE);
SELECT HASH ( INTERVAL '-23' DAY + INTERVAL '56' HOUR + INTERVAL '-1111113' MINUTE + INTERVAL '9898989' SECOND );
SELECT HASH(interval_day_time("66 12:39:23.987") );
SELECT HASH(interval_day_time("66 12:39:23.987123") );

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74278 has finished for PR 17062 at commit fd0330d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

cc @gatorsmile


// Out of range for both Hive and Spark
// Hive throws an exception. Spark overflows and returns wrong output
// checkHiveHashForIntervalType("interval 9999999999 day", -4767228)
Copy link
Member

@gatorsmile gatorsmile Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fix it before merging this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of Spark SQL, the query with fails with exception (see below). However for the test case since I am by-passing and creating raw interval object which does not go through that check

scala> hc.sql("SELECT interval 9999999999 day ").show
org.apache.spark.sql.catalyst.parser.ParseException:
Error parsing interval string: day 9999999999 outside range [-106751991, 106751991](line 1, pos 16)

== SQL ==
SELECT interval 9999999999 day
scala> df.select("INTERVAL 9999999999 day").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`INTERVAL 9999999999 day`' given input columns: [key, value];;
'Project ['INTERVAL 9999999999 day]


// Out of range for both Hive and Spark
// Hive throws an exception. Spark overflows and returns wrong output
// checkHiveHashForIntervalType("interval 9999999999 day", -4767228)
Copy link
Member

@gatorsmile gatorsmile Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit sounds incorrect. The same to the other cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@gatorsmile
Copy link
Member

I did the same check. The results of Hive 2.0 exactly match the hard-coded values.

@gatorsmile
Copy link
Member

@tejasapatil Thanks for your work! Could you add a comment in the hash function? The caller of hash needs to check the validity of input values.

LGTM pending test.

@tejasapatil
Copy link
Contributor Author

@gatorsmile : Thanks for the review :) Added method doc for hash() with the comment as suggested.

@SparkQA
Copy link

SparkQA commented Mar 13, 2017

Test build #74414 has finished for PR 17062 at commit 332686a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2017

Test build #74416 has finished for PR 17062 at commit 8a5f200.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 9456688 Mar 13, 2017
@tejasapatil tejasapatil deleted the SPARK-17495_time_related_types branch March 13, 2017 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants