-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans #22756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #97495 has finished for PR 22756 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this is a little more than just deprecating a few methods. Is this basically the same change as for KMeans?
| return [c.toArray() for c in self._call_java("clusterCenters")] | ||
|
|
||
| @since("2.0.0") | ||
| def computeCost(self, dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, can you actually remove this, vs just deprecate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this was not intended, I am fixing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is the same change which was done there, ie. deprecate the computeCost, but offer the cost on the training dataset in the summary, in order for the users to be able to still get it.
In addition, I also edited the examples so that they don't include deprecated methods.
| return [c.toArray() for c in self._call_java("clusterCenters")] | ||
|
|
||
| @since("2.0.0") | ||
| def computeCost(self, dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this was not intended, I am fixing this.
|
Test build #97501 has finished for PR 22756 at commit
|
| * instead. You can also get the cost on the training dataset in the summary. | ||
| */ | ||
| @Since("2.0.0") | ||
| @deprecated("This method is deprecated and will be removed in 3.0.0. Use ClusteringEvaluator " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this PR is a blocker for Spark 2.4. According to JIRA desciption (Improvement/Minor), we cannot remove this at 3.0.0 because we cannot announce deprecations before 3.0.0. So, this PR looks invalid to me.
cc @cloud-fan since he is a release manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks reasonable to me to deprecate it in 2.4 so that we can remove it in 3.0, if this is the last one. Then we can have a consistent ML API in 3.0 after removing these deprecated APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the decision, @cloud-fan ! So, this is one of the task ML API auditing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is the last one.
Then we can have a consistent ML API in 3.0 after removing these deprecated APIs.
Yes, that's my goal in targeting this for 2.4.
Thanks.
| ClusteringEvaluator evaluator = new ClusteringEvaluator(); | ||
|
|
||
| double silhouette = evaluator.evaluate(predictions); | ||
| System.out.println("Silhouette with squared euclidean distance = " + silhouette); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 .
If we are going to change all ml examples for deprecation, we had better change the following, too. And, could you check if we had another instances?
# Evaluate clustering.
cost = model.computeCost(dataset)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks I'll do @dongjoon-hyun
|
@mgaido91 . If you don't mind, could you split this PR into two PRs? One is adding |
|
@dongjoon-hyun sure, thanks. I'll update asap. Thanks. |
|
Test build #97525 has finished for PR 22756 at commit
|
|
LGTM |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
LGTM. thanks! |
|
Merged to master/branch-2.4. |
## What changes were proposed in this pull request? The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering. ## How was this patch tested? NA Closes #22756 from mgaido91/SPARK-25758. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c296254) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
|
Thank you, @mgaido91 and all! |
|
I'm seeing this linked from #22764 and I'm wondering if we need to revert this. If the information is not actually available where we tell folks it is I think we need to revert this especially since we are in the middle of the release process. Or raise SPARK-25765 to blocker release blocker. Have I misunderstood the situation here? |
|
I also understand today's situation and agree with @holdenk 's thought about SPARK-25765 as a blocker. Ping @cloud-fan since you are a release manager. How can we proceed SPARK-25765? Maybe, this is due to Also, cc @gatorsmile . |
|
cc @mengxr WDYT? It does not sound a blocker to me. |
|
We have to revert this PR in branch-2.4. It is not a blocker and we shouldn't merge it to branch-2.4 this late in this already delayed release. |
|
Let me revert it. Thanks! |
|
Done |
|
shall we revert it from master as well? At least we need to update the message |
|
yes, I agree, if we are not going to deprecate it in 2.4, we need to revert also on master because of @cloud-fan's comment. This would mean we won't have coherency with |
|
reverted from master. Let's move the discussion to #22764 |
## What changes were proposed in this pull request? The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering. ## How was this patch tested? NA Closes apache#22756 from mgaido91/SPARK-25758. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
The PR proposes to deprecate the
computeCostmethod onBisectingKMeansin favor of the adoption ofClusteringEvaluatorin order to evaluate the clustering.How was this patch tested?
NA