-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-29143][PYTHON][ML] Pyspark feature models support column setters/getters #25908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #111246 has finished for PR 25908 at commit
|
python/pyspark/ml/feature.py
Outdated
|
|
||
|
|
||
| class LSHParams(Params): | ||
| class LSHParams(JavaParams, HasInputCol, HasOutputCol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we neet to extend JavaParam here and in other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I will remove.
| >>> model = maScaler.fit(df) | ||
| >>> model.getOutputCol() | ||
| 'scaled' | ||
| >>> model.transform(df).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about making sure that the setters really works?
python/pyspark/ml/feature.py
Outdated
| return self._call_java("numDocs") | ||
|
|
||
|
|
||
| class ImputerParams(JavaParams, HasInputCols, HasOutputCols): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here by extending HasOutputCols, we do not need to add var outputCols by hand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I didn't have outputCols as member variable
| >>> mmScaler = MinMaxScaler(inputCol="a", outputCol="scaled") | ||
| >>> model = mmScaler.fit(df) | ||
| >>> model.setOutputCol("scaledOutput") | ||
| MinMaxScaler... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not MinMaxScalerModel...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is MinMaxScaler_2cca32620254
|
|
|
Test build #111304 has finished for PR 25908 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much same comments as #25859 ; does this change any API or just refactor?
|
Similar to the other PR, this PR adds setters/getters to feature models. |
I prefer to follow it in the following PRs. |
|
Test build #111437 has finished for PR 25908 at commit
|
python/pyspark/ml/feature.py
Outdated
|
|
||
| @inherit_doc | ||
| class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, HasSeed, | ||
| class MinHashLSH(JavaEstimator, _LSHParams, HasInputCol, HasOutputCol, HasSeed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This place seems a bit different from scala side
MinHashLSH(override val uid: String) extends LSH[MinHashLSHModel] with HasSeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change.
| if not name.endswith('Model') and not name.endswith('Params') \ | ||
| and issubclass(cls, JavaParams) and not inspect.isabstract(cls) \ | ||
| and not name.startswith('Java'): | ||
| and not name.startswith('Java') and name != 'LSH': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need to filter out LSH here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code doesn't have python class LSH
class LSH(JavaEstimator, _LSHParams, JavaMLReadable, JavaMLWritable):
"""
Mixin for Locality Sensitive Hashing (LSH).
"""
def setNumHashTables(self, value):
"""
Sets the value of :py:attr:`numHashTables`.
"""
return self._set(numHashTables=value)
I add this class so I can have a place for setNumHashTables. I don't have a __init__ in LSH because scala LSH is an abstract class and no constructor. Since no __init__, this self._java_obj is never set and test_param will throw Exception for LSH
Traceback (most recent call last):
File "/Users/hgao/spark092119/spark/python/pyspark/ml/tests/test_param.py", line 358, in test_java_params
check_params(self, cls(), check_params_exist=False)
File "/Users/hgao/spark092119/spark/python/pyspark/testing/mlutils.py", line 40, in check_params
java_stage = py_stage._to_java()
File "/Users/hgao/spark092119/spark/python/pyspark/ml/wrapper.py", line 222, in _to_java
self._transfer_params_to_java()
File "/Users/hgao/spark092119/spark/python/pyspark/ml/wrapper.py", line 145, in _transfer_params_to_java
pair = self._make_java_param_pair(param, self._defaultParamMap[param])
File "/Users/hgao/spark092119/spark/python/pyspark/ml/wrapper.py", line 131, in _make_java_param_pair
java_param = self._java_obj.getParam(param.name)
AttributeError: 'NoneType' object has no attribute 'getParam'
|
Test build #111493 has finished for PR 25908 at commit
|
|
@srowen @zhengruifeng @huaxingao Could you explain the rationale behind 425959e? Leading underscore doesn't really have the same semantics as Scala package private (or private modifier in general). However, when used in context which have direct impact on user facing API, it is quite annoying, as it requires manually segregation from truly internal components. Even if the goal is simple API parity it makes sense to take such things into account. Otherwise the whole thing could be just code generated. Just saying... |
|
If the general question is, why hide things? just the usual software design idea, to only expose what you intend to support as a user or developer API. I'm not making a specific judgment about whether these should be exposed or not, just don't know yet. If the question is, can we make this more extensible, I also don't have a strong opinion but am not against it, but think that can be considered separately. There are also going to be a few follow-on PRs to this one to further remove some setters. |
|
@zero323
I am neural on this leading underscore. Since I am following Bryan's convention in |
Not at all @srowen. It is a specific question. These mixins are used only to augment public API, and as such directly affect user's code (for example method resolution order), so it is a bit fuzzy why indicate them as internal. From one hand their usage is limited to specific In contrast to SPARK-29212, I am indifferent here and just trying to get a better feeling how the API evolves and what drives certain decisions. |
|
@huaxingao. I understand leading underscore usage and I've seen Bryan's work, but for now it was an exception in |
|
Test build #111754 has finished for PR 25908 at commit
|
python/pyspark/ml/feature.py
Outdated
| return self.getOrDefault(self.numHashTables) | ||
|
|
||
|
|
||
| class LSH(JavaEstimator, _LSHParams, JavaMLReadable, JavaMLWritable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should be _LSH as well, shouldn't it? Scala counterpart is private[ml].
python/pyspark/ml/feature.py
Outdated
|
|
||
|
|
||
| class LSHModel(JavaModel): | ||
| class LSHModel(JavaModel, _LSHParams): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Maybe it should be _LSHModel?
| """ | ||
| Params for :py:class:`BucketedRandomProjectionLSH` and | ||
| :py:class:`BucketedRandomProjectionLSHModel`. | ||
| .. versionadded:: 3.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an empty line between the description and .. versionadded:: ... (or any directive)
"""
Params for :py:class:`BucketedRandomProjectionLSH` and
:py:class:`BucketedRandomProjectionLSHModel`.
.. versionadded:: 3.0.0Otherwise such elements are not recognized as directives and incorrectly appended to the body.
| class _IDFParams(HasInputCol, HasOutputCol): | ||
| """ | ||
| Params for :py:class:`IDF` and :py:class:`IDFModel`. | ||
| .. versionadded:: 3.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
| class _ImputerParams(HasInputCols, HasOutputCols): | ||
| """ | ||
| Params for :py:class:`Imputer` and :py:class:`ImputerModel`. | ||
| .. versionadded:: 3.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
| class _MaxAbsScalerParams(HasInputCol, HasOutputCol): | ||
| """ | ||
| Params for :py:class:`MaxAbsScaler` and :py:class:`MaxAbsScalerModel`. | ||
| .. versionadded:: 3.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
| class _MinMaxScalerParams(HasInputCol, HasOutputCol): | ||
| """ | ||
| Params for :py:class:`MinMaxScaler` and :py:class:`MinMaxScalerModel`. | ||
| .. versionadded:: 3.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
| class _OneHotEncoderParams(HasInputCols, HasOutputCols, HasHandleInvalid): | ||
| """ | ||
| Params for :py:class:`OneHotEncoder` and :py:class:`OneHotEncoderModel`. | ||
| .. versionadded:: 3.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
| class _RobustScalerParams(HasInputCol, HasOutputCol): | ||
| """ | ||
| Params for :py:class:`RobustScaler` and :py:class:`RobustScalerModel`. | ||
| .. versionadded:: 3.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
| class _StandardScalerParams(HasInputCol, HasOutputCol): | ||
| """ | ||
| Params for :py:class:`StandardScaler` and :py:class:`StandardScalerModel`. | ||
| .. versionadded:: 3.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. And the remaining ones as well.
|
Test build #111767 has finished for PR 25908 at commit
|
|
Test build #111768 has finished for PR 25908 at commit
|
|
Merged to master |
|
Thanks all for your help! |
What changes were proposed in this pull request?
add column setters/getters support in Pyspark feature models
Why are the changes needed?
keep parity between Pyspark and Scala
Does this PR introduce any user-facing change?
Yes.
After the change, Pyspark feature models have column setters/getters support.
How was this patch tested?
Add some doctests