Branch 2.0 #15404

yintengfei · 2016-10-09T01:33:09Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

## What changes were proposed in this pull request? This patch creates array.sql in SQLQueryTestSuite for testing array related functions, including: - indexing - array creation - size - array_contains - sort_array ## How was this patch tested? The patch itself is about adding tests. Author: petermaxlee <petermaxlee@gmail.com> Closes #14708 from petermaxlee/SPARK-17149. (cherry picked from commit a117afa) Signed-off-by: Reynold Xin <rxin@databricks.com>

…rals ## What changes were proposed in this pull request? Modifies error message for numeric literals to Numeric literal <literal> does not fit in range [min, max] for type <T> ## How was this patch tested? Fixed up the error messages for literals.sql in SqlQueryTestSuite and re-ran via sbt. Also fixed up error messages in ExpressionParserSuite Author: Srinath Shankar <srinath@databricks.com> Closes #14721 from srinathshankar/sc4296. (cherry picked from commit ba1737c) Signed-off-by: Reynold Xin <rxin@databricks.com>

## What changes were proposed in this pull request? This patch adds support for SQL generation for inline tables. With this, it would be possible to create a view that depends on inline tables. ## How was this patch tested? Added a test case in LogicalPlanToSQLSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14709 from petermaxlee/SPARK-17150. (cherry picked from commit 45d40d9) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ntics of MultiInstanceRelation ## What changes were proposed in this pull request? Currently `LogicalRelation.newInstance()` simply creates another `LogicalRelation` object with the same parameters. However, the `newInstance()` method inherited from `MultiInstanceRelation` should return a copy of object with unique expression ids. Current `LogicalRelation.newInstance()` can cause failure when doing self-join. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #14682 from viirya/fix-localrelation. (cherry picked from commit 31a0155) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… and allow multiple aggregates per column ## What changes were proposed in this pull request? This patch fixes a longstanding issue with one of the RelationalGroupedDataset.agg function. Even though the signature accepts vararg of pairs, the underlying implementation turns the seq into a map, and thus not order preserving nor allowing multiple aggregates per column. This change also allows users to use this function to run multiple different aggregations for a single column, e.g. ``` agg("age" -> "max", "age" -> "count") ``` ## How was this patch tested? Added a test case in DataFrameAggregateSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14697 from petermaxlee/SPARK-17124. (cherry picked from commit 9560c8d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…doesn't exist in dependent module ## What changes were proposed in this pull request? Adding a "(runtime)" to the dependency configuration will set a fallback configuration to be used if the requested one is not found. E.g. with the setting "default(runtime)", Ivy will look for the conf "default" in the module ivy file and if not found will look for the conf "runtime". This can help with the case when using "sbt publishLocal" which does not write a "default" conf in the published ivy.xml file. ## How was this patch tested? used spark-submit with --packages option for a package published locally with no default conf, and a package resolved from Maven central. Author: Bryan Cutler <cutlerb@gmail.com> Closes #13428 from BryanCutler/fallback-package-conf-SPARK-12666. (cherry picked from commit 9f37d4e) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

## What changes were proposed in this pull request? Ignore temp files generated by `check-cran.sh`. Author: Xiangrui Meng <meng@databricks.com> Closes #14740 from mengxr/R-gitignore. (cherry picked from commit ab71434) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…ings. This PR tries to fix all the remaining "undocumented/duplicated arguments" warnings given by CRAN-check. One left is doc for R `stats::glm` exported in SparkR. To mute that warning, we have to also provide document for all arguments of that non-SparkR function. Some previous conversation is in #14558. R unit test and `check-cran.sh` script (with no-test). Author: Junyang Qian <junyangq@databricks.com> Closes #14705 from junyangq/SPARK-16508-master. (cherry picked from commit 01401e9) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

…ULL) OVER` correctly ## What changes were proposed in this pull request? Currently, `NullPropagation` optimizer replaces `COUNT` on null literals in a bottom-up fashion. During that, `WindowExpression` is not covered properly. This PR adds the missing propagation logic. **Before** ```scala scala> sql("SELECT COUNT(1 + NULL) OVER ()").show java.lang.UnsupportedOperationException: Cannot evaluate expression: cast(0 as bigint) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ``` **After** ```scala scala> sql("SELECT COUNT(1 + NULL) OVER ()").show +----------------------------------------------------------------------------------------------+ |count((1 + CAST(NULL AS INT))) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)| +----------------------------------------------------------------------------------------------+ | 0| +----------------------------------------------------------------------------------------------+ ``` ## How was this patch tested? Pass the Jenkins test with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14689 from dongjoon-hyun/SPARK-17098. (cherry picked from commit 91c2397) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

## What changes were proposed in this pull request? In 2.0, we change the threshold of splitting expressions from 16K to 64K, which cause very bad performance on wide table, because the generated method can't be JIT compiled by default (above the limit of 8K bytecode). This PR will decrease it to 1K, based on the benchmark results for a wide table with 400 columns of LongType. It also fix a bug around splitting expression in whole-stage codegen (it should not split them). ## How was this patch tested? Added benchmark suite. Author: Davies Liu <davies@databricks.com> Closes #14692 from davies/split_exprs. (cherry picked from commit 8d35a6f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…PPORTED OPERATIONS] Changes in Spark Stuctured Streaming doc in this link https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations Author: Jagadeesan <as2@us.ibm.com> Closes #14715 from jagadeesanas2/SPARK-17085. (cherry picked from commit bd96550) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? This PR tries to fix the scheme of local cache folder in Windows. The name of the environment variable should be `LOCALAPPDATA` rather than `%LOCALAPPDATA%`. ## How was this patch tested? Manual test in Windows 7. Author: Junyang Qian <junyangq@databricks.com> Closes #14743 from junyangq/SPARKR-FixWindowsInstall. (cherry picked from commit 209e1b3) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

## What changes were proposed in this pull request? Collect GC discussion in one section, and documenting findings about G1 GC heap region size. ## How was this patch tested? Jekyll doc build Author: Sean Owen <sowen@cloudera.com> Closes #14732 from srowen/SPARK-16320. (cherry picked from commit 342278c) Signed-off-by: Yin Huai <yhuai@databricks.com>

## What changes were proposed in this pull request? This change adds Xiangrui Meng and Felix Cheung to the maintainers field in the package description. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14758 from shivaram/sparkr-maintainers. (cherry picked from commit 6f3cd36) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

## What changes were proposed in this pull request? The range operator previously didn't support SQL generation, which made it not possible to use in views. ## How was this patch tested? Unit tests. cc hvanhovell Author: Eric Liang <ekl@databricks.com> Closes #14724 from ericl/spark-17162. (cherry picked from commit 84770b5) Signed-off-by: Reynold Xin <rxin@databricks.com>

replace ``` ` ``` in code doc with `\code{thing}` remove added `...` for drop(DataFrame) fix remaining CRAN check warnings create doc with knitr junyangq Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14734 from felixcheung/rdoccleanup. (cherry picked from commit 71afeee) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

…in block manager replication ## What changes were proposed in this pull request? This is a straightforward clone of JoshRosen 's original patch. I have follow-up changes to fix block replication for repl-defined classes as well, but those appear to be flaking tests so I'm going to leave that for SPARK-17042 ## How was this patch tested? End-to-end test in ReplSuite (also more tests in DistributedSuite from the original patch). Author: Eric Liang <ekl@databricks.com> Closes #14311 from ericl/spark-16550. (cherry picked from commit 8e223ea) Signed-off-by: Reynold Xin <rxin@databricks.com>

## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? This change adds CRAN documentation checks to be run as a part of `R/run-tests.sh` . As this script is also used by Jenkins this means that we will get documentation checks on every PR going forward. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14759 from shivaram/sparkr-cran-jenkins. (cherry picked from commit 920806a) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

## What changes were proposed in this pull request? This PR marks the abstract class `Collect` as non-deterministic since the results of `CollectList` and `CollectSet` depend on the actual order of input rows. ## How was this patch tested? Existing test cases should be enough. Author: Cheng Lian <lian@databricks.com> Closes #14749 from liancheng/spark-17182-non-deterministic-collect. (cherry picked from commit 2cdd92a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? Update DESCRIPTION ## How was this patch tested? Run install and CRAN tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14764 from felixcheung/rpackagedescription. (cherry picked from commit d2b3d3e) Signed-off-by: Xiangrui Meng <meng@databricks.com>

Some JDBC driver (for example PostgreSQL) does not use the underlying exception as cause, but have another APIs (getNextException) to access that, so it it's included in the error logging, making us hard to find the root cause, especially in batch mode. This PR will pull out the next exception and add it as cause (if it's different) or suppressed (if there is another different cause). Can't reproduce this on the default JDBC driver, so did not add a regression test. Author: Davies Liu <davies@databricks.com> Closes #14722 from davies/keep_cause. (cherry picked from commit 9afdfc9) Signed-off-by: Davies Liu <davies.liu@gmail.com>

…variables ## What changes were proposed in this pull request? The PR removes reference link in the doc for environment variables for common Windows folders. The cran check gave code 503: service unavailable on the original link. ## How was this patch tested? Manual check. Author: Junyang Qian <junyangq@databricks.com> Closes #14767 from junyangq/SPARKR-RemoveLink. (cherry picked from commit 8fd63e8) Signed-off-by: Felix Cheung <felixcheung@apache.org>

…Mac in structured streaming programming guides This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`). This will be a actual problem when users copy and paste the examples. This would not work. This seems only happening in `structured-streaming-programming-guide.md`. Manually built. This will change some examples to be correctly marked down as below: ![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png) to ![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png) Author: hyukjinkwon <gurwls223@gmail.com> Closes #14770 from HyukjinKwon/minor-quotes. (cherry picked from commit 5885599) Signed-off-by: Sean Owen <sowen@cloudera.com>

When Spark emits SQL for a string literal, it should wrap the string in single quotes, not double quotes. Databases which adhere more strictly to the ANSI SQL standards, such as Postgres, allow only single-quotes to be used for denoting string literals (see http://stackoverflow.com/a/1992331/590203). Author: Josh Rosen <joshrosen@databricks.com> Closes #14763 from JoshRosen/SPARK-17194. (cherry picked from commit bf8ff83) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

…onCatalog.scala' ## What changes were proposed in this pull request? This PR removes implemented functions from comments of `HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`. ## How was this patch tested? Manual. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14769 from Sherry302/cleanComment. (cherry picked from commit b9994ad) Signed-off-by: Reynold Xin <rxin@databricks.com>

## What changes were proposed in this pull request? Actually Spark SQL doesn't support index, the catalog table type `INDEX` is from Hive. However, most operations in Spark SQL can't handle index table, e.g. create table, alter table, etc. Logically index table should be invisible to end users, and Hive also generates special table name for index table to avoid users accessing it directly. Hive has special SQL syntax to create/show/drop index tables. At Spark SQL side, although we can describe index table directly, but the result is unreadable, we should use the dedicated SQL syntax to do it(e.g. `SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the result is always empty.(Can hive read index table directly?) This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't support index currently. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14752 from cloud-fan/minor2. (cherry picked from commit 52fa45d) Signed-off-by: Reynold Xin <rxin@databricks.com>

As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release. Before: ``` ./dev/lint-java Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119). [ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103). ``` After: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Manual. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14768 from Sherry302/fixjavastyle. (cherry picked from commit 673a80d) Signed-off-by: Sean Owen <sowen@cloudera.com>

…etizer when some quantiles are duplicated ## What changes were proposed in this pull request? In cases when QuantileDiscretizerSuite is called upon a numeric array with duplicated elements, we will take the unique elements generated from approxQuantiles as input for Bucketizer. ## How was this patch tested? An unit test is added in QuantileDiscretizerSuite QuantileDiscretizer.fit will throw an illegal exception when calling setSplits on a list of splits with duplicated elements. Bucketizer.setSplits should only accept either a numeric vector of two or more unique cut points, although that may produce less number of buckets than requested. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14747 from VinceShieh/SPARK-17086. (cherry picked from commit 92c0eaf) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? The original doc of `show` put methods for multiple classes together but the text only talks about `SparkDataFrame`. This PR tries to fix this problem. ## How was this patch tested? Manual test. Author: Junyang Qian <junyangq@databricks.com> Closes #14776 from junyangq/SPARK-FixShowDoc. (cherry picked from commit d2932a0) Signed-off-by: Felix Cheung <felixcheung@apache.org>

… the same java used in the spark environment ## What changes were proposed in this pull request? Update to py4j 0.10.3 to enable JAVA_HOME support ## How was this patch tested? Pyspark tests Author: Sean Owen <sowen@cloudera.com> Closes #14748 from srowen/SPARK-16781. (cherry picked from commit 0b3a4be) Signed-off-by: Sean Owen <sowen@cloudera.com>

… branch-2.0) ## What changes were proposed in this pull request? Backport 988c714 to branch-2.0 ## How was this patch tested? Jenkins Author: Michael Armbrust <michael@databricks.com> Closes #15362 from zsxwing/SPARK-17643-2.0.

…treaming ## What changes were proposed in this pull request? I was looking through API annotations to catch mislabeled APIs, and realized DataStreamReader and DataStreamWriter classes are already annotated as Experimental, and as a result there is no need to annotate each method within them. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #15373 from rxin/SPARK-17798. (cherry picked from commit 79accf4) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

## What changes were proposed in this pull request? When using an incompatible source for structured streaming, it may throw NoClassDefFoundError. It's better to just catch Throwable and report it to the user since the streaming thread is dying. ## How was this patch tested? `test("NoClassDefFoundError from an incompatible source")` Author: Shixiong Zhu <shixiong@databricks.com> Closes #15352 from zsxwing/SPARK-17780. (cherry picked from commit 9a48e60) Signed-off-by: Michael Armbrust <michael@databricks.com>

[SPARK-17803: Docker integration tests don't run with "Docker for Mac"](https://issues.apache.org/jira/browse/SPARK-17803) ## What changes were proposed in this pull request? This PR upgrades the [docker-client](https://mvnrepository.com/artifact/com.spotify/docker-client) dependency from [3.6.6](https://mvnrepository.com/artifact/com.spotify/docker-client/3.6.6) to [5.0.2](https://mvnrepository.com/artifact/com.spotify/docker-client/5.0.2) to enable _Docker for Mac_ users to run the `docker-integration-tests` out of the box. The very latest docker-client version is [6.0.0](https://mvnrepository.com/artifact/com.spotify/docker-client/6.0.0) but that has one additional dependency and no usage yet. ## How was this patch tested? The code change was tested on Mac OS X Yosemite with both _Docker Toolbox_ as well as _Docker for Mac_ and on Linux Ubuntu 14.04. ``` $ build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests clean package $ build/mvn -Pdocker-integration-tests -Pscala-2.11 -pl :spark-docker-integration-tests_2.11 clean compile test ``` Author: Christian Kadner <ckadner@us.ibm.com> Closes #15378 from ckadner/SPARK-17803_Docker_for_Mac. (cherry picked from commit 49d11d4) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…etic ## What changes were proposed in this pull request? Currently, Spark raises `RuntimeException` when creating a view with timestamp with INTERVAL arithmetic like the following. The root cause is the arithmetic expression, `TimeAdd`, was transformed into `timeadd` function as a VIEW definition. This PR fixes the SQL definition of `TimeAdd` and `TimeSub` expressions. ```scala scala> sql("CREATE TABLE dates (ts TIMESTAMP)") scala> sql("CREATE VIEW view1 AS SELECT ts + INTERVAL 1 DAY FROM dates") java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ... ``` ## How was this patch tested? Pass Jenkins with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15383 from dongjoon-hyun/SPARK-17750-BACK.

… general numeric label column types ## What changes were proposed in this pull request? Before, we computed `instances` in LinearRegression in two spots, even though they did the same thing. One of them did not cast the label column to `DoubleType`. This patch consolidates the computation and always casts the label column to `DoubleType`. ## How was this patch tested? Added a unit test to check all solvers. This test failed before this patch. Author: sethah <seth.hendrickson16@gmail.com> Closes #15364 from sethah/linreg_numeric_type. (cherry picked from commit 3713bb1) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

… syntax ## What changes were proposed in this pull request? This is a backport of SPARK-17612. This implements `DESCRIBE table PARTITION` SQL Syntax again. It was supported until Spark 1.6.2, but was dropped since 2.0.0. **Spark 1.6.2** ```scala scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)") res1: org.apache.spark.sql.DataFrame = [result: string] scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)") res2: org.apache.spark.sql.DataFrame = [result: string] scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false) +----------------------------------------------------------------+ |result | +----------------------------------------------------------------+ |a string | |b int | |c string | |d string | | | |# Partition Information | |# col_name data_type comment | | | |c string | |d string | +----------------------------------------------------------------+ ``` **Spark 2.0** - **Before** ```scala scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)") res0: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false) org.apache.spark.sql.catalyst.parser.ParseException: Unsupported SQL statement ``` - **After** ```scala scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY (c STRING, d STRING)") res0: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false) +-----------------------+---------+-------+ |col_name |data_type|comment| +-----------------------+---------+-------+ |a |string |null | |b |int |null | |c |string |null | |d |string |null | |# Partition Information| | | |# col_name |data_type|comment| |c |string |null | |d |string |null | +-----------------------+---------+-------+ scala> sql("DESC EXTENDED partitioned_table PARTITION (c='Us', d=1)").show(100,false) +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+ |col_name |data_type|comment| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+ |a |string |null | |b |int |null | |c |string |null | |d |string |null | |# Partition Information | | | |# col_name |data_type|comment| |c |string |null | |d |string |null | | | | | |Detailed Partition Information CatalogPartition( Partition Values: [Us, 1] Storage(Location: file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [serialization.format=1]) Partition Parameters:{transient_lastDdlTime=1475001066})| | | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+ scala> sql("DESC FORMATTED partitioned_table PARTITION (c='Us', d=1)").show(100,false) +--------------------------------+---------------------------------------------------------------------------------------+-------+ |col_name |data_type |comment| +--------------------------------+---------------------------------------------------------------------------------------+-------+ |a |string |null | |b |int |null | |c |string |null | |d |string |null | |# Partition Information | | | |# col_name |data_type |comment| |c |string |null | |d |string |null | | | | | |# Detailed Partition Information| | | |Partition Value: |[Us, 1] | | |Database: |default | | |Table: |partitioned_table | | |Location: |file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1| | |Partition Parameters: | | | | transient_lastDdlTime |1475001066 | | | | | | |# Storage Information | | | |SerDe Library: |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | |InputFormat: |org.apache.hadoop.mapred.TextInputFormat | | |OutputFormat: |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | | |Compressed: |No | | |Storage Desc Parameters: | | | | serialization.format |1 | | +--------------------------------+---------------------------------------------------------------------------------------+-------+ ``` ## How was this patch tested? Pass the Jenkins tests with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15351 from dongjoon-hyun/SPARK-17612-BACK.

…of paths ## What changes were proposed in this pull request? If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`. This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases ## How was this patch tested? Added unit test for reading list of files Author: Bryan Cutler <cutlerb@gmail.com> Closes #15379 from BryanCutler/sql-readtext-paths-SPARK-17805. (cherry picked from commit bcaa799) Signed-off-by: Reynold Xin <rxin@databricks.com>

…inished This expands calls to Jetty's simple `ServerConnector` constructor to explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It should otherwise result in exactly the same configuration, because the other args are copied from the constructor that is currently called. (I'm not sure we should change the Hive Thriftserver impl, but I did anyway.) This also adds `sc.stop()` to the quick start guide example. Existing tests; _pending_ at least manual verification of the fix. Author: Sean Owen <sowen@cloudera.com> Closes #15381 from srowen/SPARK-17707. (cherry picked from commit cff5607) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…ing (branch 2.0) ## What changes were proposed in this pull request? Backport 9293734 and b678e46 into branch 2.0. The only difference is the Spark version in pom file. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15367 from zsxwing/kafka-source-branch-2.0.

## What changes were proposed in this pull request? This PR adds the Kafka 0.10 subproject to the build infrastructure. This makes sure Kafka 0.10 tests are only triggers when it or of its dependencies change. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15355 from hvanhovell/SPARK-17782.

## What changes were proposed in this pull request? In HashJoin, we try to rewrite the join key as Long to improve the performance of finding a match. The rewriting part is not well tested, has a bug that could cause wrong result when there are at least three integral columns in the joining key also the total length of the key exceed 8 bytes. ## How was this patch tested? Added unit test to covering the rewriting with different number of columns and different data types. Manually test the reported case and confirmed that this PR fix the bug. Author: Davies Liu <davies@databricks.com> Closes #15390 from davies/rewrite_key. (cherry picked from commit 94b24b8) Signed-off-by: Davies Liu <davies.liu@gmail.com>

AmplabJenkins · 2016-10-09T01:37:13Z

Can one of the admins verify this patch?

holdenk · 2016-10-09T04:09:29Z

Can you close this pull request? If it's an attempt at backporting you can just make a new PR once you get it sorted out.

…names when name contains a backtick ## What changes were proposed in this pull request? The `quotedString` method in `TableIdentifier` and `FunctionIdentifier` produce an illegal (un-parseable) name when the name contains a backtick. For example: ``` import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._ import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute val complexName = TableIdentifier("`weird`table`name", Some("`d`b`1")) parseTableIdentifier(complexName.unquotedString) // Does not work parseTableIdentifier(complexName.quotedString) // Does not work parseExpression(complexName.unquotedString) // Does not work parseExpression(complexName.quotedString) // Does not work ``` We should handle the backtick properly to make `quotedString` parseable. ## How was this patch tested? Add new testcases in `TableIdentifierParserSuite` and `ExpressionParserSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15403 from jiangxb1987/backtick. (cherry picked from commit 26fbca4) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

## What changes were proposed in this pull request? Currently the no. of partition files are limited to 10000 files (%05d format). If there are more than 10000 part files, the logic goes for a toss while recreating the RDD as it sorts them by string. More details can be found in the JIRA desc [here](https://issues.apache.org/jira/browse/SPARK-17417). ## How was this patch tested? I tested this patch by checkpointing a RDD and then manually renaming part files to the old format and tried to access the RDD. It was successfully created from the old format. Also verified loading a sample parquet file and saving it as multiple formats - CSV, JSON, Text, Parquet, ORC and read them successfully back from the saved files. I couldn't launch the unit test from my local box, so will wait for the Jenkins output. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #15370 from dhruve/bug/SPARK-17417. (cherry picked from commit 4bafaca) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

## What changes were proposed in this pull request? The default buffer size is not big enough for randomly generated MapType. ## How was this patch tested? Ran the tests in 100 times, it never fail (it fail 8 times before the patch). Author: Davies Liu <davies@databricks.com> Closes #15395 from davies/flaky_map. (cherry picked from commit d5ec4a3) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…StressSuite ## What changes were proposed in this pull request? A follow up Pr for SPARK-17346 to fix flaky `org.apache.spark.sql.kafka010.KafkaSourceStressSuite`. Test log: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1855/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/_It_is_not_a_test_/ Looks like deleting the Kafka internal topic `__consumer_offsets` is flaky. This PR just simply ignores internal topics. ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15384 from zsxwing/SPARK-17346-flaky-test. (cherry picked from commit 75b9e35) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…ssue in BlockStatusesAccumulator ## What changes were proposed in this pull request? Replaced `BlockStatusesAccumulator` with `CollectionAccumulator` which is thread safe and few more cleanups. ## How was this patch tested? Tested in master branch and cherry-picked. Author: Ergin Seyfe <eseyfe@fb.com> Closes #15425 from seyfe/race_cond_jsonprotocal_branch-2.0.

Couple of mvn build examples use `-Dhadoop.version=VERSION` instead of actual version number Author: Alexander Pivovarov <apivovarov@gmail.com> Closes #15440 from apivovarov/patch-1. (cherry picked from commit 299eb04) Signed-off-by: Reynold Xin <rxin@databricks.com>

… is incorrect. ## What changes were proposed in this pull request? In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct. ## How was this patch tested? manual test. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #15439 from sarutak/SPARK-17880. (cherry picked from commit b512f04) Signed-off-by: Reynold Xin <rxin@databricks.com>

….id is bad ## What changes were proposed in this pull request? Documentation fix to make it clear that reusing group id for different streams is super duper bad, just like it is with the underlying Kafka consumer. ## How was this patch tested? I built jekyll doc and made sure it looked ok. Author: cody koeninger <cody@koeninger.org> Closes #15442 from koeninger/SPARK-17853. (cherry picked from commit c264ef9) Signed-off-by: Reynold Xin <rxin@databricks.com>

## What changes were proposed in this pull request? Upgraded to a newer version of Pyrolite which supports serialization of a BinaryType StructField for PySpark.SQL ## How was this patch tested? Added a unit test which fails with a raised ValueError when using the previous version of Pyrolite 4.9 and Python3 Author: Bryan Cutler <cutlerb@gmail.com> Closes #15386 from BryanCutler/pyrolite-upgrade-SPARK-17808. (cherry picked from commit 658c714) Signed-off-by: Sean Owen <sowen@cloudera.com>

Closes apache#15303 Closes apache#15078 Closes apache#15080 Closes apache#15135 Closes apache#14565 Closes apache#12355 Closes apache#15404

…m empty string to interval type. ## What changes were proposed in this pull request? This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true. Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason. ## How was this patch tested? Added test case in CastSuite.scala jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884 Author: prigarg <prigarg@adobe.com> Closes #15449 from priyankagargnitk/SPARK-17884. (cherry picked from commit d5580eb) Signed-off-by: Reynold Xin <rxin@databricks.com>

…han 2GB ## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes #15375 from falaki/SPARK-17790. (cherry picked from commit 5cc503f) Signed-off-by: Felix Cheung <felixcheung@apache.org>

vanzin · 2016-10-12T18:10:27Z

@yintengfei please close this PR.

Closes apache#15303 Closes apache#15078 Closes apache#15080 Closes apache#15135 Closes apache#14565 Closes apache#12355 Closes apache#15404 Author: Sean Owen <sowen@cloudera.com> Closes apache#15451 from srowen/CloseStalePRs.

petermaxlee and others added 30 commits August 19, 2016 18:14

marmbrus and others added 12 commits October 5, 2016 16:48

jiangxb1987 and others added 9 commits October 9, 2016 21:52

srowen added a commit to srowen/spark that referenced this pull request Oct 12, 2016

Closing stale PRs.

4d40636

Closes apache#15303 Closes apache#15078 Closes apache#15080 Closes apache#15135 Closes apache#14565 Closes apache#12355 Closes apache#15404

srowen mentioned this pull request Oct 12, 2016

[BUILD] Closing stale PRs #15451

Closed

priyankagar and others added 2 commits October 12, 2016 10:14

asfgit closed this in eb69335 Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Branch 2.0 #15404

Branch 2.0 #15404

Uh oh!

yintengfei commented Oct 9, 2016

Uh oh!

AmplabJenkins commented Oct 9, 2016

Uh oh!

holdenk commented Oct 9, 2016

Uh oh!

vanzin commented Oct 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Branch 2.0 #15404

Branch 2.0 #15404

Uh oh!

Conversation

yintengfei commented Oct 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Oct 9, 2016

Uh oh!

holdenk commented Oct 9, 2016

Uh oh!

vanzin commented Oct 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants