Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Oct 17, 2018

What changes were proposed in this pull request?

The PR proposes to deprecate the computeCost method on BisectingKMeans in favor of the adoption of ClusteringEvaluator in order to evaluate the clustering.

How was this patch tested?

NA

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97495 has finished for PR 22756 at commit 4761989.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

cc @holdenk @srowen

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this is a little more than just deprecating a few methods. Is this basically the same change as for KMeans?

return [c.toArray() for c in self._call_java("clusterCenters")]

@since("2.0.0")
def computeCost(self, dataset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, can you actually remove this, vs just deprecate it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this was not intended, I am fixing this.

Copy link
Contributor Author

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is the same change which was done there, ie. deprecate the computeCost, but offer the cost on the training dataset in the summary, in order for the users to be able to still get it.

In addition, I also edited the examples so that they don't include deprecated methods.

return [c.toArray() for c in self._call_java("clusterCenters")]

@since("2.0.0")
def computeCost(self, dataset):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this was not intended, I am fixing this.

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97501 has finished for PR 22756 at commit ed235f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* instead. You can also get the cost on the training dataset in the summary.
*/
@Since("2.0.0")
@deprecated("This method is deprecated and will be removed in 3.0.0. Use ClusteringEvaluator " +
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this PR is a blocker for Spark 2.4. According to JIRA desciption (Improvement/Minor), we cannot remove this at 3.0.0 because we cannot announce deprecations before 3.0.0. So, this PR looks invalid to me.

cc @cloud-fan since he is a release manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks reasonable to me to deprecate it in 2.4 so that we can remove it in 3.0, if this is the last one. Then we can have a consistent ML API in 3.0 after removing these deprecated APIs.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the decision, @cloud-fan ! So, this is one of the task ML API auditing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is the last one.

Then we can have a consistent ML API in 3.0 after removing these deprecated APIs.

Yes, that's my goal in targeting this for 2.4.

Thanks.

ClusteringEvaluator evaluator = new ClusteringEvaluator();

double silhouette = evaluator.evaluate(predictions);
System.out.println("Silhouette with squared euclidean distance = " + silhouette);
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 .
If we are going to change all ml examples for deprecation, we had better change the following, too. And, could you check if we had another instances?

    # Evaluate clustering.
    cost = model.computeCost(dataset)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks I'll do @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

@mgaido91 . If you don't mind, could you split this PR into two PRs? One is adding deprecation annotation only. The other is adding new API and updating all examples?

@mgaido91
Copy link
Contributor Author

@dongjoon-hyun sure, thanks. I'll update asap. Thanks.

@SparkQA
Copy link

SparkQA commented Oct 18, 2018

Test build #97525 has finished for PR 22756 at commit d5fddb5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@WeichenXu123
Copy link
Contributor

LGTM. thanks!

@dongjoon-hyun
Copy link
Member

Merged to master/branch-2.4.

asfgit pushed a commit that referenced this pull request Oct 18, 2018
## What changes were proposed in this pull request?

The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering.

## How was this patch tested?

NA

Closes #22756 from mgaido91/SPARK-25758.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c296254)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Thank you, @mgaido91 and all!

@asfgit asfgit closed this in c296254 Oct 18, 2018
@holdenk
Copy link
Contributor

holdenk commented Oct 19, 2018

I'm seeing this linked from #22764 and I'm wondering if we need to revert this. If the information is not actually available where we tell folks it is I think we need to revert this especially since we are in the middle of the release process. Or raise SPARK-25765 to blocker release blocker.

Have I misunderstood the situation here?

@dongjoon-hyun
Copy link
Member

I also understand today's situation and agree with @holdenk 's thought about SPARK-25765 as a blocker. Ping @cloud-fan since you are a release manager. How can we proceed SPARK-25765?

Maybe, this is due to Preparing Spark release v2.4.0-rc4 which happen two hours ago. We are in the middle of unstable situation.

Also, cc @gatorsmile .

@gatorsmile
Copy link
Member

cc @mengxr WDYT? It does not sound a blocker to me.

@mengxr
Copy link
Contributor

mengxr commented Oct 19, 2018

We have to revert this PR in branch-2.4. It is not a blocker and we shouldn't merge it to branch-2.4 this late in this already delayed release.

@gatorsmile
Copy link
Member

Let me revert it. Thanks!

@gatorsmile
Copy link
Member

Done

@cloud-fan
Copy link
Contributor

shall we revert it from master as well? At least we need to update the message This method is deprecated and will be removed in 3.0.0.

@mgaido91
Copy link
Contributor Author

yes, I agree, if we are not going to deprecate it in 2.4, we need to revert also on master because of @cloud-fan's comment.

This would mean we won't have coherency with KMeans though, which is not that good IMHO.
Thanks.

@cloud-fan
Copy link
Contributor

reverted from master. Let's move the discussion to #22764

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering.

## How was this patch tested?

NA

Closes apache#22756 from mgaido91/SPARK-25758.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants