From ba1097686041e6d57183e7979814dff0ff7ffa5a Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Tue, 16 May 2017 10:20:28 +0200 Subject: [PATCH 1/7] Migration guide 2.1->2.2 --- docs/ml-guide.md | 25 +++++++------------------ docs/ml-migration-guides.md | 29 +++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+), 18 deletions(-) diff --git a/docs/ml-guide.md b/docs/ml-guide.md index 971761961b965..d1e51aefebde9 100644 --- a/docs/ml-guide.md +++ b/docs/ml-guide.md @@ -72,35 +72,24 @@ MLlib is under active development. The APIs marked `Experimental`/`DeveloperApi` may change in future releases, and the migration guide below will explain all changes between releases. -## From 2.0 to 2.1 +## From 2.1 to 2.2 ### Breaking changes - -**Deprecated methods removed** -* `setLabelCol` in `feature.ChiSqSelectorModel` -* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`) -* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`) -* `model` in `regression.LinearRegressionSummary` -* `validateParams` in `PipelineStage` -* `validateParams` in `Evaluator` +There are no breaking changes. ### Deprecations and changes of behavior **Deprecations** -* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592): - Deprecate all Param setter methods except for input/output column Params for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel` +There are no deprecations. **Changes of behavior** -* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870): - Fix a bug of `ChiSqSelector` which will likely change its result. Now `ChiSquareSelector` use pValue rather than raw statistic to select a fixed number of top features. -* [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261): - `KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected. -* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389): - `KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode. - +* [SPARK-19787](https://issues.apache.org/jira/browse/SPARK-19787): + Default value of `regParam` changed from `1.0` to `0.1` for `ALS.train` method (marked `DeveloperApi`). + **Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's `ALS` class. + ## Previous Spark versions Earlier migration guides are archived [on this page](ml-migration-guides.html). diff --git a/docs/ml-migration-guides.md b/docs/ml-migration-guides.md index 58c3747ea6387..687d7c8930362 100644 --- a/docs/ml-migration-guides.md +++ b/docs/ml-migration-guides.md @@ -7,6 +7,35 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide). +## From 2.0 to 2.1 + +### Breaking changes + +**Deprecated methods removed** + +* `setLabelCol` in `feature.ChiSqSelectorModel` +* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`) +* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`) +* `model` in `regression.LinearRegressionSummary` +* `validateParams` in `PipelineStage` +* `validateParams` in `Evaluator` + +### Deprecations and changes of behavior + +**Deprecations** + +* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592): + Deprecate all Param setter methods except for input/output column Params for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel` + +**Changes of behavior** + +* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870): + Fix a bug of `ChiSqSelector` which will likely change its result. Now `ChiSquareSelector` use pValue rather than raw statistic to select a fixed number of top features. +* [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261): + `KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected. +* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389): + `KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode. + ## From 1.6 to 2.0 ### Breaking changes From 5a3d87b4a58b1c3db6ce49fe3a6aa9caf8ed9b42 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Tue, 16 May 2017 10:34:56 +0200 Subject: [PATCH 2/7] Bump expected parity release number --- docs/ml-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/ml-guide.md b/docs/ml-guide.md index d1e51aefebde9..d088f4fe6558b 100644 --- a/docs/ml-guide.md +++ b/docs/ml-guide.md @@ -26,7 +26,7 @@ The primary Machine Learning API for Spark is now the [DataFrame](sql-programmin * MLlib will still support the RDD-based API in `spark.mllib` with bug fixes. * MLlib will not add new features to the RDD-based API. * In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API. -* After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated. +* After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated. * The RDD-based API is expected to be removed in Spark 3.0. *Why is MLlib switching to the DataFrame-based API?* From ac2e50d185479030420d0fc973fc12caee5bc3ea Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Tue, 16 May 2017 12:22:52 +0200 Subject: [PATCH 3/7] Update migration guide --- docs/ml-guide.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/ml-guide.md b/docs/ml-guide.md index d088f4fe6558b..6b406c9a0906f 100644 --- a/docs/ml-guide.md +++ b/docs/ml-guide.md @@ -89,7 +89,9 @@ There are no deprecations. * [SPARK-19787](https://issues.apache.org/jira/browse/SPARK-19787): Default value of `regParam` changed from `1.0` to `0.1` for `ALS.train` method (marked `DeveloperApi`). **Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's `ALS` class. - +* [SPARK-14772](https://issues.apache.org/jira/browse/SPARK-14772): + Fixed inconsistency between Python and Scala APIs for `Param.copy` method. + ## Previous Spark versions Earlier migration guides are archived [on this page](ml-migration-guides.html). From d122dd86e6139b970324749761bf553765d18eb9 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Thu, 18 May 2017 09:14:28 +0200 Subject: [PATCH 4/7] Add string indexer fix to behavior changes section --- docs/ml-guide.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/ml-guide.md b/docs/ml-guide.md index 6b406c9a0906f..013d2362f04fa 100644 --- a/docs/ml-guide.md +++ b/docs/ml-guide.md @@ -91,6 +91,9 @@ There are no deprecations. **Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's `ALS` class. * [SPARK-14772](https://issues.apache.org/jira/browse/SPARK-14772): Fixed inconsistency between Python and Scala APIs for `Param.copy` method. +* [SPARK-11569](https://issues.apache.org/jira/browse/SPARK-11569): + `StringIndexer` now handles `NULL` values in the same way as unseen values. Previously an exception + would always be thrown regardless of the setting of the `handleInvalid` parameter. ## Previous Spark versions From 85b0f6e4bb77206b924a9d858c434b526d998f8a Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Thu, 18 May 2017 09:55:44 +0200 Subject: [PATCH 5/7] Add release highlights section --- docs/ml-guide.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/docs/ml-guide.md b/docs/ml-guide.md index 013d2362f04fa..19c36c019b522 100644 --- a/docs/ml-guide.md +++ b/docs/ml-guide.md @@ -66,6 +66,28 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 [^1]: To learn more about the benefits and background of system optimised natives, you may wish to watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/). +# Highlights in 2.2 + +The list below highlights some of the new features and enhancements added to MLlib in the `2.2` +release of Spark: + +* `ALS` methods for _top-k_ recommendations for all users or items, matching the functionality + in `mllib` ([SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)). Performance + was also improved for both `ml` and `mllib` + ([SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and + [SPARK-20587](https://issues.apache.org/jira/browse/SPARK-20587)) +* `Correlation` and `ChiSquareTest` stats functions for `DataFrames` + ([SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635) and + [SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635)) +* `GLM` now supports the full `Tweedie` family + ([SPARK-18929](https://issues.apache.org/jira/browse/SPARK-18929)) +* `Imputer` feature transformer to impute missing values in a dataset + ([SPARK-13568](https://issues.apache.org/jira/browse/SPARK-13568)) +* `LinearSVC` for linear Support Vector Machine classification + ([SPARK-14709](https://issues.apache.org/jira/browse/SPARK-14709)) +* Logistic regression now supports constraints on the coefficients during training + ([SPARK-20047](https://issues.apache.org/jira/browse/SPARK-20047)) + # Migration guide MLlib is under active development. From e27d9e46c2ee654e5e29259bdf4b653cb55df6ce Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Thu, 18 May 2017 09:59:48 +0200 Subject: [PATCH 6/7] Add FPGrowth to highlights list --- docs/ml-guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/ml-guide.md b/docs/ml-guide.md index 19c36c019b522..0c167fdf28ae5 100644 --- a/docs/ml-guide.md +++ b/docs/ml-guide.md @@ -79,6 +79,8 @@ release of Spark: * `Correlation` and `ChiSquareTest` stats functions for `DataFrames` ([SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635) and [SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635)) +* `FPGrowth` algorithm for frequent pattern mining + ([SPARK-14503](https://issues.apache.org/jira/browse/SPARK-14503)) * `GLM` now supports the full `Tweedie` family ([SPARK-18929](https://issues.apache.org/jira/browse/SPARK-18929)) * `Imputer` feature transformer to impute missing values in a dataset From fb9fb5b60b8cf29f2e11368ae78a6f9de048f43a Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Fri, 19 May 2017 20:28:47 +0200 Subject: [PATCH 7/7] Correct JIRA link for correlation --- docs/ml-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/ml-guide.md b/docs/ml-guide.md index 0c167fdf28ae5..362e883e55e83 100644 --- a/docs/ml-guide.md +++ b/docs/ml-guide.md @@ -77,7 +77,7 @@ release of Spark: ([SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and [SPARK-20587](https://issues.apache.org/jira/browse/SPARK-20587)) * `Correlation` and `ChiSquareTest` stats functions for `DataFrames` - ([SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635) and + ([SPARK-19636](https://issues.apache.org/jira/browse/SPARK-19636) and [SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635)) * `FPGrowth` algorithm for frequent pattern mining ([SPARK-14503](https://issues.apache.org/jira/browse/SPARK-14503))