[SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly #28156

juliuszsompolski · 2020-04-08T18:12:47Z

What changes were proposed in this pull request?

CliSuite.runCliWithin was not matching for expected results correctly. It was matching for expected lines anywhere in stdout or stderr.
On the example of Single command with --database test:
In

    runCliWithin(2.minute)(
      "CREATE DATABASE hive_db_test;"
        -> "",
      "USE hive_test;"
        -> "",
      "CREATE TABLE hive_test(key INT, val STRING);"
        -> "",
      "SHOW TABLES;"
        -> "hive_test"
    )

It was looking for lines containing "", "", "" and then "hive_test".
However, the string "hive_test" was contained in "hive_test_db", and hence:

2020-04-08 17:53:12,752 INFO  CliSuite - 2020-04-08 17:53:12.752 - stderr> Spark master: local, Application Id: local-1586368384172
2020-04-08 17:53:12,765 INFO  CliSuite - stderr> found expected output line 0: ""
2020-04-08 17:53:12,765 INFO  CliSuite - 2020-04-08 17:53:12.765 - stdout> spark-sql> CREATE DATABASE hive_db_test;
2020-04-08 17:53:12,765 INFO  CliSuite - stdout> found expected output line 1: ""
2020-04-08 17:53:17,688 INFO  CliSuite - 2020-04-08 17:53:17.688 - stderr> chgrp: changing ownership of 'file:///tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': chown: changing group of '/tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': Operation not permitted
2020-04-08 17:53:12,765 INFO  CliSuite - stderr> found expected output line 2: ""
2020-04-08 17:53:18,069 INFO  CliSuite - 2020-04-08 17:53:18.069 - stderr> Time taken: 5.265 seconds
2020-04-08 17:53:18,087 INFO  CliSuite - 2020-04-08 17:53:18.087 - stdout> spark-sql> USE hive_test;
2020-04-08 17:53:12,765 INFO  CliSuite - stdout> found expected output line 3: "hive_test"
2020-04-08 17:53:21,742 INFO  CliSuite - Found all expected output.

Because of that, it could kill the CLI process without really even creating the table. This was not expected. The test could be flaky depending on whether process.destroy() in the finally block managed to kill it before it actually creates the table.

I make the output checking more robust to not match on unexpected output, by making it check the echo of query output on the CLI. Also, wait for the CLI process to finish gracefully (triggered by closing its stdin), instead of killing it forcibly.

Why are the changes needed?

org.apache.spark.sql.hive.thriftserver.CliSuite was flaky, and didn't test outputs as expected.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests in CLISuite. Tested several times with no flakiness. Was getting flaky results almost on every run before.

[info] CliSuite:
[info] - load warehouse dir from hive-site.xml (12 seconds, 568 milliseconds)
[info] - load warehouse dir from --hiveconf (10 seconds, 648 milliseconds)
[info] - load warehouse dir from --conf spark(.hadoop).hive.* (20 seconds, 653 milliseconds)
[info] - load warehouse dir from spark.sql.warehouse.dir (9 seconds, 763 milliseconds)
[info] - Simple commands (16 seconds, 238 milliseconds)
[info] - Single command with -e (9 seconds, 967 milliseconds)
[info] - Single command with --database (21 seconds, 205 milliseconds)
[info] - Commands using SerDe provided in --jars (15 seconds, 51 milliseconds)
[info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (14 seconds, 625 milliseconds)
[info] - SPARK-11188 Analysis error reporting (7 seconds, 960 milliseconds)
[info] - SPARK-11624 Spark SQL CLI should set sessionState only once (7 seconds, 424 milliseconds)
[info] - list jars (9 seconds, 520 milliseconds)
[info] - list jar <jarfile> (9 seconds, 277 milliseconds)
[info] - list files (9 seconds, 828 milliseconds)
[info] - list file <filepath> (9 seconds, 646 milliseconds)
[info] - apply hiveconf from cli command (9 seconds, 469 milliseconds)
[info] - Support hive.aux.jars.path (10 seconds, 676 milliseconds)
[info] - SPARK-28840 test --jars command (10 seconds, 921 milliseconds)
[info] - SPARK-28840 test --jars and hive.aux.jars.path command (11 seconds, 49 milliseconds)
[info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 210 milliseconds)
[info] - SPARK-26321 Should not split semicolon within quoted string literals (12 seconds, 729 milliseconds)
[info] - Pad Decimal numbers with trailing zeros to the scale of the column (10 seconds, 381 milliseconds)
[info] - SPARK-30049 Should not complain for quotes in commented lines (10 seconds, 935 milliseconds)
[info] - SPARK-30049 Should not complain for quotes in commented with multi-lines (20 seconds, 731 milliseconds)

juliuszsompolski · 2020-04-08T18:17:43Z

cc @liancheng @gatorsmile

juliuszsompolski · 2020-04-08T18:17:54Z

@CJStuart

SparkQA · 2020-04-08T18:26:45Z

Test build #120981 has finished for PR 28156 at commit 13fd2be.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-08T18:33:30Z

Test build #120982 has finished for PR 28156 at commit b0b4f42.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2020-04-08T19:19:37Z

cc @wangyum too.

SparkQA · 2020-04-08T19:29:46Z

Test build #120983 has finished for PR 28156 at commit 4beb8d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-08T19:41:51Z

Test build #120984 has finished for PR 28156 at commit 61f8566.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

CJStuart · 2020-04-09T08:19:20Z

sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala

    } finally {
-      process.destroy()
+      if (!process.waitFor(1, MINUTES)) {
+        log.warn("spark-sql did not exit. Killing process.")


If the SQL repl keeps waiting for the next input this will always happen.

We are closing the stdin of the process, so it will exit.

Nevermind, then

juliuszsompolski · 2020-04-09T08:20:18Z

@wangyum I pushed one more change: wait for the CLI process instead of killing it. Fail the test if it doesn't end gracefully. We are closing the stdin of the process, so it will exit.
Now in many tests we do e.g.:

      "DROP TABLE t1;"
        -> "",
      "DROP TABLE sourceTable;"
        -> ""

at the end.
The empty line match at the end may still match on anything for that last drop table, because it actually finishes running. Let's wait and give it a chance to finish.

juliuszsompolski · 2020-04-09T08:33:00Z

Hm. I still have some problems with stdout lines sometimes cut in half:

20/04/09 10:31:05.541 Thread-3 INFO CliSuite: 2020-04-09 10:31:05.54 - stdout> spark-sql> LOAD DATA LOCAL INPATH 'file:/home/julek/dev/spark3/sql/hive-thriftserver/target/scala-2.12/test-classes/data/files/small_kv.txt' OVERWRIT 
20/04/09 10:31:05.542 Thread-3 INFO CliSuite: 2020-04-09 10:31:05.542 - stdout> E INTO TABLE hive_test;

SparkQA · 2020-04-09T09:06:20Z

Test build #121023 has finished for PR 28156 at commit 21f06ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-09T09:17:19Z

Test build #121021 has finished for PR 28156 at commit 175432e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

juliuszsompolski · 2020-04-09T09:30:49Z

@wangyum I pushed another change to cut some longer lines in half, and to match for first 60 bytes of query echo lines. Running multiple times it was still sometimes flaky for me because some long lines would be cut by the output processor...
I believe that now it runs stable for me.

SparkQA · 2020-04-09T09:39:50Z

Test build #121024 has finished for PR 28156 at commit 48cf362.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-09T10:01:05Z

Test build #121025 has finished for PR 28156 at commit 9bbfec0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-04-12T20:28:52Z

Retest this please.

SparkQA · 2020-04-12T21:08:50Z

Test build #121152 has finished for PR 28156 at commit 9bbfec0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-04-15T06:51:01Z

I've observed high possibility of failing CliSuite on PR builder and this is great! Thanks for working on this.

Would CliSuite be no longer flaky without #28055? I'd like to see relationship between this and #28055 - if this patch works then probably #28055 may not be needed.

juliuszsompolski · 2020-04-16T12:03:21Z

@HeartSaVioR Thanks!
I'm not entirely sure of the interaction of #28055 with this...
Before the fix in this PR, the CLI would have often be forcefully killed before/while doing the cleanup after the test. If any, I would expect that having a separate metastore dir for every test would make it less flaky - in case the CLI was killed in the middle of some metastore operation.
On the other hand, maybe recreating a new metastore for every test just made it timeout more often, as it had to do more work. Or made it output more lines to stderr, which were matched against "" lines in output matcher, making the output matching more flaky?
Or these two might be orthogonal.

HeartSaVioR · 2020-04-19T22:22:43Z

Looks like two patches are orthogonal. Personally I feel safer if we try to isolate the env. (sure, including metastore) per test (not per suite), but yes it would bring longer test duration and possibly reach a timeout. If both patches could make sure the test suite becomes stabilized, then this patch is the thing we can merge first (as it brings less impact on test), and revisit other if the test suite becomes flaky again.

HeartSaVioR · 2020-04-19T22:25:30Z

This is a huge pain on existing PRs as it fails by high chance. Could some committer take a look again and deal with this? Thanks!

juliuszsompolski · 2020-04-23T15:49:38Z

@dongjoon-hyun @wangyum do you have extra comments, or would you think this could be merged?

HyukjinKwon · 2020-04-24T10:10:42Z

retest this please

SparkQA · 2020-04-24T11:14:12Z

Test build #121751 has finished for PR 28156 at commit 9bbfec0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-27T04:27:53Z

thanks, merging to master/3.0!

…uite doesn't match results correctly ### What changes were proposed in this pull request? `CliSuite.runCliWithin` was not matching for expected results correctly. It was matching for expected lines anywhere in stdout or stderr. On the example of `Single command with --database` test: In ``` runCliWithin(2.minute)( "CREATE DATABASE hive_db_test;" -> "", "USE hive_test;" -> "", "CREATE TABLE hive_test(key INT, val STRING);" -> "", "SHOW TABLES;" -> "hive_test" ) ``` It was looking for lines containing "", "", "" and then "hive_test". However, the string "hive_test" was contained in "hive_test_db", and hence: ``` 2020-04-08 17:53:12,752 INFO CliSuite - 2020-04-08 17:53:12.752 - stderr> Spark master: local, Application Id: local-1586368384172 2020-04-08 17:53:12,765 INFO CliSuite - stderr> found expected output line 0: "" 2020-04-08 17:53:12,765 INFO CliSuite - 2020-04-08 17:53:12.765 - stdout> spark-sql> CREATE DATABASE hive_db_test; 2020-04-08 17:53:12,765 INFO CliSuite - stdout> found expected output line 1: "" 2020-04-08 17:53:17,688 INFO CliSuite - 2020-04-08 17:53:17.688 - stderr> chgrp: changing ownership of 'file:///tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': chown: changing group of '/tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': Operation not permitted 2020-04-08 17:53:12,765 INFO CliSuite - stderr> found expected output line 2: "" 2020-04-08 17:53:18,069 INFO CliSuite - 2020-04-08 17:53:18.069 - stderr> Time taken: 5.265 seconds 2020-04-08 17:53:18,087 INFO CliSuite - 2020-04-08 17:53:18.087 - stdout> spark-sql> USE hive_test; 2020-04-08 17:53:12,765 INFO CliSuite - stdout> found expected output line 3: "hive_test" 2020-04-08 17:53:21,742 INFO CliSuite - Found all expected output. ``` Because of that, it could kill the CLI process without really even creating the table. This was not expected. The test could be flaky depending on whether process.destroy() in the finally block managed to kill it before it actually creates the table. I make the output checking more robust to not match on unexpected output, by making it check the echo of query output on the CLI. Also, wait for the CLI process to finish gracefully (triggered by closing its stdin), instead of killing it forcibly. ### Why are the changes needed? org.apache.spark.sql.hive.thriftserver.CliSuite was flaky, and didn't test outputs as expected. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests in CLISuite. Tested several times with no flakiness. Was getting flaky results almost on every run before. ``` [info] CliSuite: [info] - load warehouse dir from hive-site.xml (12 seconds, 568 milliseconds) [info] - load warehouse dir from --hiveconf (10 seconds, 648 milliseconds) [info] - load warehouse dir from --conf spark(.hadoop).hive.* (20 seconds, 653 milliseconds) [info] - load warehouse dir from spark.sql.warehouse.dir (9 seconds, 763 milliseconds) [info] - Simple commands (16 seconds, 238 milliseconds) [info] - Single command with -e (9 seconds, 967 milliseconds) [info] - Single command with --database (21 seconds, 205 milliseconds) [info] - Commands using SerDe provided in --jars (15 seconds, 51 milliseconds) [info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (14 seconds, 625 milliseconds) [info] - SPARK-11188 Analysis error reporting (7 seconds, 960 milliseconds) [info] - SPARK-11624 Spark SQL CLI should set sessionState only once (7 seconds, 424 milliseconds) [info] - list jars (9 seconds, 520 milliseconds) [info] - list jar <jarfile> (9 seconds, 277 milliseconds) [info] - list files (9 seconds, 828 milliseconds) [info] - list file <filepath> (9 seconds, 646 milliseconds) [info] - apply hiveconf from cli command (9 seconds, 469 milliseconds) [info] - Support hive.aux.jars.path (10 seconds, 676 milliseconds) [info] - SPARK-28840 test --jars command (10 seconds, 921 milliseconds) [info] - SPARK-28840 test --jars and hive.aux.jars.path command (11 seconds, 49 milliseconds) [info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 210 milliseconds) [info] - SPARK-26321 Should not split semicolon within quoted string literals (12 seconds, 729 milliseconds) [info] - Pad Decimal numbers with trailing zeros to the scale of the column (10 seconds, 381 milliseconds) [info] - SPARK-30049 Should not complain for quotes in commented lines (10 seconds, 935 milliseconds) [info] - SPARK-30049 Should not complain for quotes in commented with multi-lines (20 seconds, 731 milliseconds) ``` Closes #28156 from juliuszsompolski/SPARK-31388. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 560bd54) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

juliuszsompolski added 2 commits April 8, 2020 20:00

fix CLISuite

f9a7c88

fix log message.

13fd2be

stray

b0b4f42

juliuszsompolski added 2 commits April 8, 2020 20:40

fix compilation

4beb8d2

fix for multiline queries

61f8566

dongjoon-hyun changed the title ~~[SPARK-31388] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly~~ [SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly Apr 8, 2020

dongjoon-hyun added SQL TESTS labels Apr 8, 2020

wangyum approved these changes Apr 9, 2020

View reviewed changes

juliuszsompolski added 2 commits April 9, 2020 10:14

wait for CLI to exit before killing

175432e

fail in the process didn't exit

21f06ad

CJStuart reviewed Apr 9, 2020

View reviewed changes

cut long lines

48cf362

match first 60 characters of each query echo line

9bbfec0

CJStuart approved these changes Apr 15, 2020

View reviewed changes

HeartSaVioR mentioned this pull request Apr 25, 2020

[SPARK-31554][SQL][TESTS] Retry flaky tests from CliSuite #28329

Closed

cloud-fan closed this in 560bd54 Apr 27, 2020

[SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly #28156

[SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly #28156

Uh oh!

Conversation

juliuszsompolski commented Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

juliuszsompolski commented Apr 8, 2020

Uh oh!

juliuszsompolski commented Apr 8, 2020

Uh oh!

SparkQA commented Apr 8, 2020

Uh oh!

SparkQA commented Apr 8, 2020

Uh oh!

gatorsmile commented Apr 8, 2020

Uh oh!

SparkQA commented Apr 8, 2020

Uh oh!

SparkQA commented Apr 8, 2020

Uh oh!

CJStuart Apr 9, 2020

Choose a reason for hiding this comment

Uh oh!

CJStuart Apr 9, 2020

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented Apr 9, 2020

Uh oh!

juliuszsompolski commented Apr 9, 2020

Uh oh!

SparkQA commented Apr 9, 2020

Uh oh!

SparkQA commented Apr 9, 2020

Uh oh!

juliuszsompolski commented Apr 9, 2020

Uh oh!

SparkQA commented Apr 9, 2020

Uh oh!

SparkQA commented Apr 9, 2020

Uh oh!

dongjoon-hyun commented Apr 12, 2020

Uh oh!

SparkQA commented Apr 12, 2020

Uh oh!

HeartSaVioR commented Apr 15, 2020

Uh oh!

juliuszsompolski commented Apr 16, 2020

Uh oh!

HeartSaVioR commented Apr 19, 2020

Uh oh!

HeartSaVioR commented Apr 19, 2020

Uh oh!

juliuszsompolski commented Apr 23, 2020

Uh oh!

HyukjinKwon commented Apr 24, 2020

Uh oh!

SparkQA commented Apr 24, 2020

Uh oh!

cloud-fan commented Apr 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

juliuszsompolski commented Apr 8, 2020 •

edited

Loading