Branch 2.2 #21691

xianbin · 2018-07-02T05:53:20Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

## What changes were proposed in this pull request? Fix 2 rendering errors on configuration doc page, due to SPARK-21243 and SPARK-15355. ## How was this patch tested? Manually built and viewed docs with jekyll Author: Sean Owen <sowen@cloudera.com> Closes #18793 from srowen/SPARK-21593. (cherry picked from commit b1d59e6) Signed-off-by: Sean Owen <sowen@cloudera.com>

…o classpath on windows The --packages option jars are getting added to the classpath with the scheme as "file:///", in Unix it doesn't have problem with this since the scheme contains the Unix Path separator which separates the jar name with location in the classpath. In Windows, the jar file is not getting resolved from the classpath because of the scheme. Windows : file:///C:/Users/<user>/.ivy2/jars/<jar-name>.jar Unix : file:///home/<user>/.ivy2/jars/<jar-name>.jar With this PR, we are avoiding the 'file://' scheme to get added to the packages jar files. I have verified manually in Windows and Unix environments, with the change it adds the jar to classpath like below, Windows : C:\Users\<user>\.ivy2\jars\<jar-name>.jar Unix : /home/<user>/.ivy2/jars/<jar-name>.jar Author: Devaraj K <devaraj@apache.org> Closes #18708 from devaraj-kavali/SPARK-21339. (cherry picked from commit 58da1a2) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

## What changes were proposed in this pull request? This PR fixed a potential overflow issue in EventTimeStats. ## How was this patch tested? The new unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #18803 from zsxwing/avg. (cherry picked from commit 7f63e85) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…t a key ## What changes were proposed in this pull request? When the watermark is not a column of `dropDuplicates`, right now it will crash. This PR fixed this issue. ## How was this patch tested? The new unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18822 from zsxwing/SPARK-21546. (cherry picked from commit 0d26b3a) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

… registry ## What changes were proposed in this pull request? When using PySpark broadcast variables in a multi-threaded environment, `SparkContext._pickled_broadcast_vars` becomes a shared resource. A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread. This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used. ## How was this patch tested? Added a unit test that causes this race condition using another thread. Author: Bryan Cutler <cutlerb@gmail.com> Closes #18823 from BryanCutler/branch-2.2.

## What changes were proposed in this pull request? Add missing import and missing parentheses to invoke `SparkSession::text()`. ## How was this patch tested? Built and the code for this application, ran jekyll locally per docs/README.md. Author: Christiam Camacho <camacho@ncbi.nlm.nih.gov> Closes #18795 from christiam/master. (cherry picked from commit dd72b10) Signed-off-by: Sean Owen <sowen@cloudera.com>

…le with extreme values on the partition column ## What changes were proposed in this pull request? An overflow of the difference of bounds on the partitioning column leads to no data being read. This patch checks for this overflow. ## How was this patch tested? New unit test. Author: Andrew Ray <ray.andrew@gmail.com> Closes #18800 from aray/SPARK-21330. (cherry picked from commit 25826c7) Signed-off-by: Sean Owen <sowen@cloudera.com>

…ken as group-by ordinal ## What changes were proposed in this pull request? create temporary view data as select * from values (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2) as data(a, b); `select 3, 4, sum(b) from data group by 1, 2;` `select 3 as c, 4 as d, sum(b) from data group by c, d;` When running these two cases, the following exception occurred: `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10` The cause of this failure: If an aggregateExpression is integer, after replaced with this aggregateExpression, the groupExpression still considered as an ordinal. The solution: This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`. ## How was this patch tested? Added unit test case Author: liuxian <liu.xian3@zte.com.cn> Closes #18779 from 10110346/groupby. (cherry picked from commit 894d5a4) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

## What changes were proposed in this pull request? In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue) ## How was this patch tested? Added unit test Author: vinodkc <vinod.kc.in@gmail.com> Closes #18852 from vinodkc/br_Fix_SPARK-21588. (cherry picked from commit 1ba967b) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…er.commitAndGet called ## What changes were proposed in this pull request? We should reset numRecordsWritten to zero after DiskBlockObjectWriter.commitAndGet called. Because when `revertPartialWritesAndClose` be called, we decrease the written records in `ShuffleWriteMetrics` . However, we decreased the written records to zero, this should be wrong, we should only decreased the number reords after the last `commitAndGet` called. ## How was this patch tested? Modified existing test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18830 from ConeyLiu/DiskBlockObjectWriter. (cherry picked from commit 534a063) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? author: BoleynSu closes #18836 ```Scala val df = Seq((1, 1)).toDF("i", "j") df.createOrReplaceTempView("T") withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { sql("select * from (select a.i from T a cross join T t where t.i = a.i) as t1 " + "cross join T t2 where t2.i = t1.i").explain(true) } ``` The above code could cause the following exception: ``` SortMergeJoinExec should not take Cross as the JoinType java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross as the JoinType at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100) ``` Our SortMergeJoinExec supports CROSS. We should not hit such an exception. This PR is to fix the issue. ### How was this patch tested? Modified the two existing test cases. Author: Xiao Li <gatorsmile@gmail.com> Author: Boleyn Su <boleyn.su@gmail.com> Closes #18863 from gatorsmile/pr-18836. (cherry picked from commit bbfd6b5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…sabled FS cache This PR replaces #18623 to do some clean up. Closes #18623 Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Author: Andrey Taptunov <taptunov@amazon.com> Closes #18848 from zsxwing/review-pr18623.

## What changes were proposed in this pull request? Propagate metadata in attribute replacement during streaming execution. This is necessary for EventTimeWatermarks consuming replaced attributes. ## How was this patch tested? new unit test, which was verified to fail before the fix Author: Jose Torres <joseph-torres@databricks.com> Closes #18840 from joseph-torres/SPARK-21565. (cherry picked from commit cce25b3) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…arallel fetching parameters are not properly provided. ### What changes were proposed in this pull request? ```SQL CREATE TABLE mytesttable1 USING org.apache.spark.sql.jdbc OPTIONS ( url 'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}', dbtable 'mytesttable1', paritionColumn 'state_id', lowerBound '0', upperBound '52', numPartitions '53', fetchSize '10000' ) ``` The above option name `paritionColumn` is wrong. That mean, users did not provide the value for `partitionColumn`. In such case, users hit a confusing error. ``` AssertionError: assertion failed java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312) ``` ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #18864 from gatorsmile/jdbcPartCol. (cherry picked from commit baf5cac) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset. A reproducible case looks like: object C { type TwoInt = (Int, Int) def tupleTypeAlias: TwoInt = (1, 1) } Seq(1).toDS().map(_ => ("", C.tupleTypeAlias)) It throws an exception like: type T1 is not a class scala.ScalaReflectionException: type T1 is not a class at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275) ... This patch accesses the dealias of type in many places in `ScalaReflection` to fix it. Added test case. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18813 from viirya/SPARK-21567. (cherry picked from commit ee13041) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This reverts commit 86609a9.

…xecutor Process The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command. Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc. ## How was this patch tested? Manually Tested the fix by observing the UI change before and after. Before: <img width="1398" alt="screen shot-before" src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png"> After: <img width="1385" alt="screen shot-after" src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pgandhi <pgandhi@yahoo-inc.com> Author: pgandhi999 <parthkgandhi9@gmail.com> Closes #18707 from pgandhi999/master. (cherry picked from commit f016f5c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search scalanlp/breeze#651 ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #18797 from WeichenXu123/update-breeze. (cherry picked from commit b35660d) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

… return value Same PR as #18799 but for branch 2.2. Main discussion the other PR. -------- When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug. This PR ensures that places calling HDFSMetadataLog.get always check the return value. Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18890 from tdas/SPARK-21596-2.2.

…should call masterTracker.stop() in MapOutputTrackerSuite Signed-off-by: 10087686 <wang.jiaochunzte.com.cn> ## What changes were proposed in this pull request? After Unit tests end，there should be call masterTracker.stop() to free resource; (Please fill in changes proposed in this fix) ## How was this patch tested? Run Unit tests; (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 10087686 <wang.jiaochun@zte.com.cn> Closes #18867 from wangjiaochun/mapout. (cherry picked from commit 6426adf) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption. ## How was this patch tested? Removed the test case. Author: Reynold Xin <rxin@databricks.com> Closes #18912 from rxin/remove-getTableOption. (cherry picked from commit 584c7f1) Signed-off-by: Reynold Xin <rxin@databricks.com>

…nalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre #16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes #18843 from tejasapatil/SPARK-21595. (cherry picked from commit 9443999) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

…ons and adding jars ## What changes were proposed in this pull request? Fix the race condition when serializing TaskDescriptions and adding jars by keeping the set of jars and files for a TaskSet constant across the lifetime of the TaskSet. Otherwise TaskDescription serialization can produce an invalid serialization when new file/jars are added concurrently as the TaskDescription is serialized. ## How was this patch tested? Additional unit test ensures jars/files contained in the TaskDescription remain constant throughout the lifetime of the TaskSet. Author: Andrew Ash <andrew@andrewash.com> Closes #18913 from ash211/SPARK-21563. (cherry picked from commit 6847e93) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…pshot files ## What changes were proposed in this pull request? Directly writing a snapshot file may generate a partial file. This PR changes it to write to a temp file then rename to the target file. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18928 from zsxwing/SPARK-21696. (cherry picked from commit 282f00b) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

… successfully removed ## What changes were proposed in this pull request? We put staging path to delete into the deleteOnExit cache of `FileSystem` in case of the path can't be successfully removed. But when we successfully remove the path, we don't remove it from the cache. We should do it to avoid continuing grow the cache size. ## How was this patch tested? Added a test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18934 from viirya/SPARK-21721. (cherry picked from commit 4c3cf1c) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon (Maybe the usage should be forbidden when writing, in a major version change?). Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Jan Vrsovsky <jan.vrsovsky@firma.seznam.cz> Closes #18872 from ProtD/master. (cherry picked from commit 8321c14) Signed-off-by: Sean Owen <sowen@cloudera.com>

…executors when tasks still to run ## What changes were proposed in this pull request? Right now spark lets go of executors when they are idle for the 60s (or configurable time). I have seen spark let them go when they are idle but they were really needed. I have seen this issue when the scheduler was waiting to get node locality but that takes longer than the default idle timeout. In these jobs the number of executors goes down really small (less than 10) but there are still like 80,000 tasks to run. We should consider not allowing executors to idle timeout if they are still needed according to the number of tasks to be run. ## How was this patch tested? Tested by manually adding executors to `executorsIdsToBeRemoved` list and seeing if those executors were removed when there are a lot of tasks and a high `numExecutorsTarget` value. Code used In `ExecutorAllocationManager.start()` ``` start_time = clock.getTimeMillis() ``` In `ExecutorAllocationManager.schedule()` ``` val executorIdsToBeRemoved = ArrayBuffer[String]() if ( now > start_time + 1000 * 60 * 2) { logInfo("--- REMOVING 1/2 of the EXECUTORS ---") start_time += 1000 * 60 * 100 var counter = 0 for (x <- executorIds) { counter += 1 if (counter == 2) { counter = 0 executorIdsToBeRemoved += x } } } Author: John Lee <jlee2@yahoo-inc.com> Closes #18874 from yoonlee95/SPARK-21656. (cherry picked from commit adf005d) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

…hema in table properties backport #18907 to branch 2.2 Author: Wenchen Fan <wenchen@databricks.com> Closes #18963 from cloud-fan/backport.

…it is called statically to convert something into TimestampType ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739 This issue is caused by introducing TimeZoneAwareExpression. When the **Cast** expression converts something into TimestampType, it should be resolved with setting `timezoneId`. In general, it is resolved in LogicalPlan phase. However, there are still some places that use Cast expression statically to convert datatypes without setting `timezoneId`. In such cases, `NoSuchElementException: None.get` will be thrown for TimestampType. This PR is proposed to fix the issue. We have checked the whole project and found two such usages(i.e., in`TableReader` and `HiveTableScanExec`). ## How was this patch tested? unit test Author: donnyzone <wellfengzhu@gmail.com> Closes #18960 from DonnyZone/spark-21739. (cherry picked from commit 310454b) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…SurvivalRegression ## What changes were proposed in this pull request? The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol. ## How was this patch tested? Manually. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Cédric Pelvet <cedric.pelvet@gmail.com> Closes #18980 from sharp-pixel/master. (cherry picked from commit 73e04ec) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? Shell escaped the name passed to spark-submit and change how conf attributes are shell escaped. ## How was this patch tested? This test has been tested manually with Hive-on-spark with mesos or with the use case described in the issue with the sparkPi application with a custom name which contains illegal shell characters. With this PR, hive-on-spark on mesos works like a charm with hive 3.0.0-SNAPSHOT. I state that this contribution is my original work and that I license the work to the project under the project’s open source license Author: Bounkong Khamphousone <bounkong.khamphousone@ebiznext.com> Closes #21014 from tiboun/fix/SPARK-23941. (cherry picked from commit 6782359) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

Fetch failure lead to multiple tasksets which are active for a given stage. While there is only one "active" version of the taskset, the earlier attempts can still have running tasks, which can complete successfully. So a task completion needs to update every taskset so that it knows the partition is completed. That way the final active taskset does not try to submit another task for the same partition, and so that it knows when it is completed and when it should be marked as a "zombie". Added a regression test. Author: Imran Rashid <irashid@cloudera.com> Closes #21131 from squito/SPARK-23433. (cherry picked from commit 94641fe) Signed-off-by: Imran Rashid <irashid@cloudera.com>

… should verify the downloaded file ## What changes were proposed in this pull request? This is a backport of #21210 because `branch-2.2` also faces the same failures. Although [SPARK-22654](https://issues.apache.org/jira/browse/SPARK-22654) made `HiveExternalCatalogVersionsSuite` download from Apache mirrors three times, it has been flaky because it didn't verify the downloaded file. Some Apache mirrors terminate the downloading abnormally, the *corrupted* file shows the following errors. ``` gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now 22:46:32.700 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.hive.HiveExternalCatalogVersionsSuite, thread names: Keep-Alive-Timer ===== *** RUN ABORTED *** java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.0"): error=2, No such file or directory ``` This has been reported weirdly in two ways. For example, the above case is reported as Case 2 `no failures`. - Case 1. [Test Result (1 failure / +1)](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389/) - Case 2. [Test Result (no failures)](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/) This PR aims to make `HiveExternalCatalogVersionsSuite` more robust by verifying the downloaded `tgz` file by extracting and checking the existence of `bin/spark-submit`. If it turns out that the file is empty or corrupted, `HiveExternalCatalogVersionsSuite` will do retry logic like the download failure. ## How was this patch tested? Pass the Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21232 from dongjoon-hyun/SPARK-23489-2.

…rectly ## What changes were proposed in this pull request? It's possible that Accumulators of Spark 1.x may no longer work with Spark 2.x. This is because `LegacyAccumulatorWrapper.isZero` may return wrong answer if `AccumulableParam` doesn't define equals/hashCode. This PR fixes this by using reference equality check in `LegacyAccumulatorWrapper.isZero`. ## How was this patch tested? a new test Author: Wenchen Fan <wenchen@databricks.com> Closes #21229 from cloud-fan/accumulator. (cherry picked from commit 4d5de4d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This PR aims to bump Py4J in order to fix the following float/double bug. Py4J 0.10.5 fixes this (py4j/py4j#272) and the latest Py4J is 0.10.6. **BEFORE** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +--------------------+ |(id + 17.1335742042)| +--------------------+ | 17.1335742042| +--------------------+ ``` **AFTER** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +-------------------------+ |(id + 17.133574204226083)| +-------------------------+ | 17.133574204226083| +-------------------------+ ``` Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18546 from dongjoon-hyun/SPARK-21278. (cherry picked from commit c8d0aba) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

(cherry picked from commit cc613b5) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 323dc3a) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

(cherry picked from commit 628c7b5) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 16cd9ac) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

## What changes were proposed in this pull request? backport part of the commit that addresses lintr issue Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #21325 from felixcheung/rlintfix22.

…daction. The old code was relying on a core configuration and extended its default value to include things that redact desired things in the app's environment. Instead, add a SQL-specific option for which options to redact, and apply both the core and SQL-specific rules when redacting the options in the save command. This is a little sub-optimal since it adds another config, but it retains the current default behavior. While there I also fixed a typo and a couple of minor config API usage issues in the related redaction option that SQL already had. Tested with existing unit tests, plus checking the env page on a shell UI. (cherry picked from commit ed7ba7d) Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21365 from vanzin/SPARK-23850-2.2.

…rong LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted. Author: sychen <sychen@ctrip.com> Closes #21311 from cxzl25/fix_LongToUnsafeRowMap_page_size. (cherry picked from commit 8883401) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? SPARK-17874 introduced a new configuration to set the port where SSL services bind to. We missed to update the scaladoc and the `toString` method, though. The PR adds it in the missing places ## How was this patch tested? checked the `toString` output in the logs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21429 from mgaido91/minor_ssl. (cherry picked from commit fd315f5) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

As discussed separately, this avoids the possibility of XSS on certain request param keys. CC vanzin Author: Sean Owen <srowen@gmail.com> Closes #21464 from srowen/XSS2. (cherry picked from commit 698b9a0) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

…veExternalCatalogVersionsSuite Removing version 2.2.0 from testing versions in HiveExternalCatalogVersionsSuite as it is not present anymore in the mirrors and this is blocking all the open PRs. running UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21540 from mgaido91/SPARK-24531. (cherry picked from commit 2824f14) Signed-off-by: Xiao Li <gatorsmile@gmail.com>

Currently, `spark.ui.filters` are not applied to the handlers added after binding the server. This means that every page which is added after starting the UI will not have the filters configured on it. This can allow unauthorized access to the pages. The PR adds the filters also to the handlers added after the UI starts. manual tests (without the patch, starting the thriftserver with `--conf spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter --conf spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"` you can access `http://localhost:4040/sqlserver`; with the patch, 401 is the response as for the other pages). Author: Marco Gaido <marcogaido91@gmail.com> Closes #21523 from mgaido91/SPARK-24506. (cherry picked from commit f53818d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

Apply the suggestion on the bug to fix source links. Tested with the 2.3.1 release docs. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21521 from vanzin/SPARK-23732. (cherry picked from commit dc22465) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

…ator. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task is allowed to commit the output in the above case. The stage state in the coordinator is also maintained across stage retries, so that a stray speculative task from a previous stage attempt is not allowed to commit. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21577 from vanzin/SPARK-24552. (cherry picked from commit c8e909c) Signed-off-by: Thomas Graves <tgraves@apache.org>

stageAttemptId added in TaskContext and corresponding construction modification Added a new test in TaskContextSuite, two cases are tested: 1. Normal case without failure 2. Exception case with resubmitted stages Link to [SPARK-22897](https://issues.apache.org/jira/browse/SPARK-22897) Author: Xianjin YE <advancedxygmail.com> Closes #20082 from advancedxy/SPARK-22897. Conflicts: project/MimaExcludes.scala ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xianjin YE <advancedxy@gmail.com> Author: Thomas Graves <tgraves@unharmedunarmed.corp.ne1.yahoo.com> Closes #21609 from tgravescs/SPARK-22897.

…er for writes . This passes a unique attempt id to the Hadoop APIs, because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because the same attempt number can be both committed and aborted. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21616 from vanzin/SPARK-24552-2.2.

findTightestCommonTypeOfTwo has been renamed to findTightestCommonType ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Fokko Driesprong <fokkodriesprong@godatadriven.com> Closes #21597 from Fokko/fd-typo. (cherry picked from commit 6a97e8e) Signed-off-by: hyukjinkwon <gurwls223@apache.org>

AmplabJenkins · 2018-07-02T05:58:13Z

Can one of the admins verify this patch?

HyukjinKwon · 2018-07-02T07:27:33Z

@xianbin looks mistakenly open. Mind closing this please?

Closes apache#20932 Closes apache#17843 Closes apache#13477 Closes apache#14291 Closes apache#20919 Closes apache#17907 Closes apache#18766 Closes apache#20809 Closes apache#8849 Closes apache#21076 Closes apache#21507 Closes apache#21336 Closes apache#21681 Closes apache#21691 Author: Sean Owen <srowen@gmail.com> Closes apache#21708 from srowen/CloseStalePRs.

srowen and others added 30 commits August 1, 2017 19:06

Revert "[SPARK-21567][SQL] Dataset should work with type alias"

e87ffca

This reverts commit 86609a9.

[SPARK-18464][SQL][BACKPORT] support old table which doesn't store sc…

2a96975

…hema in table properties backport #18907 to branch 2.2 Author: Wenchen Fan <wenchen@databricks.com> Closes #18963 from cloud-fan/backport.

BounkongK and others added 25 commits May 1, 2018 08:28

[PYSPARK] Update py4j to version 0.10.7.

850b7d8

(cherry picked from commit cc613b5) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 323dc3a) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

[SPARKR] Match pyspark features in SparkR communication protocol.

f96d13d

(cherry picked from commit 628c7b5) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 16cd9ac) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

[R][BACKPORT-2.2] backport lint fix

8c223b6

## What changes were proposed in this pull request? backport part of the commit that addresses lintr issue Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #21325 from felixcheung/rlintfix22.

fix compilation caused by SPARK-24257

8abd0a7

Preparing Spark release v2.2.2-rc1

8ce9e2a

Preparing development version 2.2-3-SNAPSHOT

e2e4d58

Preparing development version 2.2.3-SNAPSHOT

7bfefc9

Preparing Spark release v2.2.2-rc2

fc28ba3

Preparing development version 2.2.3-SNAPSHOT

4795827

srowen mentioned this pull request Jul 3, 2018

[BUILD] Close stale PRs #21708

Closed

asfgit closed this in 5bf95f2 Jul 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Branch 2.2 #21691

Branch 2.2 #21691

Uh oh!

xianbin commented Jul 2, 2018

Uh oh!

AmplabJenkins commented Jul 2, 2018

Uh oh!

HyukjinKwon commented Jul 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Branch 2.2 #21691

Branch 2.2 #21691

Uh oh!

Conversation

xianbin commented Jul 2, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Jul 2, 2018

Uh oh!

HyukjinKwon commented Jul 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants