Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

add column setters/getters support in Pyspark feature models

Why are the changes needed?

keep parity between Pyspark and Scala

Does this PR introduce any user-facing change?

Yes.
After the change, Pyspark feature models have column setters/getters support.

How was this patch tested?

Add some doctests

@SparkQA
Copy link

SparkQA commented Sep 24, 2019

Test build #111246 has finished for PR 25908 at commit 2ced6ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class LSHParams(JavaParams, HasInputCol, HasOutputCol):
  • class LSH(JavaEstimator, LSHParams, JavaMLReadable, JavaMLWritable):
  • class LSHModel(JavaModel, LSHParams):
  • class BucketedRandomProjectionLSHParams(JavaParams):
  • class BucketedRandomProjectionLSH(LSH, BucketedRandomProjectionLSHParams,
  • class BucketedRandomProjectionLSHModel(LSHModel, BucketedRandomProjectionLSHParams, JavaMLReadable,
  • class ImputerParams(JavaParams, HasInputCols, HasOutputCols):
  • class Imputer(JavaEstimator, ImputerParams, JavaMLReadable, JavaMLWritable):
  • class ImputerModel(JavaModel, ImputerParams, JavaMLReadable, JavaMLWritable):
  • class MaxAbsScalerParams(JavaParams, HasInputCol, HasOutputCol):
  • class MaxAbsScaler(JavaEstimator, MaxAbsScalerParams, JavaMLReadable, JavaMLWritable):
  • class MaxAbsScalerModel(JavaModel, MaxAbsScalerParams, JavaMLReadable, JavaMLWritable):
  • class MinMaxScalerParams(JavaParams, HasInputCol, HasOutputCol):
  • class MinMaxScaler(JavaEstimator, MinMaxScalerParams, JavaMLReadable, JavaMLWritable):
  • class MinMaxScalerModel(JavaModel, MinMaxScalerParams, JavaMLReadable, JavaMLWritable):
  • class OneHotEncoderParams(JavaParams, HasInputCols, HasOutputCols, HasHandleInvalid):
  • class OneHotEncoder(JavaEstimator, OneHotEncoderParams, JavaMLReadable, JavaMLWritable):
  • class OneHotEncoderModel(JavaModel, OneHotEncoderParams, JavaMLReadable, JavaMLWritable):
  • class RobustScalerParams(JavaParams, HasInputCol, HasOutputCol):
  • class RobustScaler(JavaEstimator, RobustScalerParams, JavaMLReadable, JavaMLWritable):
  • class RobustScalerModel(JavaModel, RobustScalerParams, JavaMLReadable, JavaMLWritable):
  • class StandardScalerParams(JavaParams, HasInputCol, HasOutputCol):
  • class StandardScaler(JavaEstimator, StandardScalerParams, JavaMLReadable, JavaMLWritable):
  • class StandardScalerModel(JavaModel, StandardScalerParams, JavaMLReadable, JavaMLWritable):
  • class VectorIndexerParams(JavaParams, HasInputCol, HasOutputCol, HasHandleInvalid):
  • class VectorIndexer(JavaEstimator, VectorIndexerParams, JavaMLReadable, JavaMLWritable):
  • class VectorIndexerModel(JavaModel, VectorIndexerParams, JavaMLReadable, JavaMLWritable):
  • class Word2VecParams(JavaParams, HasStepSize, HasMaxIter, HasSeed, HasInputCol, HasOutputCol):
  • class Word2Vec(JavaEstimator, Word2VecParams, JavaMLReadable, JavaMLWritable):
  • class Word2VecModel(JavaModel, Word2VecParams, JavaMLReadable, JavaMLWritable):
  • class PCAParams(JavaParams, HasInputCol, HasOutputCol):
  • class PCA(JavaEstimator, PCAParams, JavaMLReadable, JavaMLWritable):
  • class PCAModel(JavaModel, PCAParams, JavaMLReadable, JavaMLWritable):
  • class RFormulaParams(JavaParams, HasFeaturesCol, HasLabelCol, HasHandleInvalid):
  • class RFormula(JavaEstimator, RFormulaParams, JavaMLReadable, JavaMLWritable):
  • class RFormulaModel(JavaModel, RFormulaParams, JavaMLReadable, JavaMLWritable):
  • class ChiSqSelectorParams(JavaParams, HasFeaturesCol, HasOutputCol, HasLabelCol):
  • class ChiSqSelector(JavaEstimator, ChiSqSelectorParams, JavaMLReadable, JavaMLWritable):
  • class ChiSqSelectorModel(JavaModel, ChiSqSelectorParams, JavaMLReadable, JavaMLWritable):



class LSHParams(Params):
class LSHParams(JavaParams, HasInputCol, HasOutputCol):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we neet to extend JavaParam here and in other places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I will remove.

>>> model = maScaler.fit(df)
>>> model.getOutputCol()
'scaled'
>>> model.transform(df).show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about making sure that the setters really works?

return self._call_java("numDocs")


class ImputerParams(JavaParams, HasInputCols, HasOutputCols):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here by extending HasOutputCols, we do not need to add var outputCols by hand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I didn't have outputCols as member variable

>>> mmScaler = MinMaxScaler(inputCol="a", outputCol="scaled")
>>> model = mmScaler.fit(df)
>>> model.setOutputCol("scaledOutput")
MinMaxScaler...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not MinMaxScalerModel...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is MinMaxScaler_2cca32620254

@huaxingao
Copy link
Contributor Author

_CountVectorizerParams and _StringIndexerParams have _ . Shall we follow the same convention? @zhengruifeng

@SparkQA
Copy link

SparkQA commented Sep 24, 2019

Test build #111304 has finished for PR 25908 at commit 74de7a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class LSHParams(HasInputCol, HasOutputCol):
  • class BucketedRandomProjectionLSHParams():
  • class IDFParams(HasInputCol, HasOutputCol):
  • class IDF(JavaEstimator, IDFParams, JavaMLReadable, JavaMLWritable):
  • class IDFModel(JavaModel, IDFParams, JavaMLReadable, JavaMLWritable):
  • class ImputerParams(HasInputCols, HasOutputCols):
  • class MaxAbsScalerParams(HasInputCol, HasOutputCol):
  • class MinMaxScalerParams(HasInputCol, HasOutputCol):
  • class OneHotEncoderParams(HasInputCols, HasOutputCols, HasHandleInvalid):
  • class RobustScalerParams(HasInputCol, HasOutputCol):
  • class StandardScalerParams(HasInputCol, HasOutputCol):
  • class VectorIndexerParams(HasInputCol, HasOutputCol, HasHandleInvalid):
  • class Word2VecParams(HasStepSize, HasMaxIter, HasSeed, HasInputCol, HasOutputCol):
  • class PCAParams(HasInputCol, HasOutputCol):
  • class RFormulaParams(HasFeaturesCol, HasLabelCol, HasHandleInvalid):
  • class ChiSqSelectorParams(HasFeaturesCol, HasOutputCol, HasLabelCol):

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much same comments as #25859 ; does this change any API or just refactor?

@huaxingao
Copy link
Contributor Author

Similar to the other PR, this PR adds setters/getters to feature models.

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Sep 26, 2019

@huaxingao

Shall we follow the same convention?

I prefer to follow it in the following PRs.

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111437 has finished for PR 25908 at commit 425959e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _LSHParams(HasInputCol, HasOutputCol):
  • class LSH(JavaEstimator, _LSHParams, JavaMLReadable, JavaMLWritable):
  • class LSHModel(JavaModel, _LSHParams):
  • class _BucketedRandomProjectionLSHParams():
  • class BucketedRandomProjectionLSH(LSH, _BucketedRandomProjectionLSHParams,
  • class BucketedRandomProjectionLSHModel(LSHModel, _BucketedRandomProjectionLSHParams, JavaMLReadable,
  • class _IDFParams(HasInputCol, HasOutputCol):
  • class IDF(JavaEstimator, _IDFParams, JavaMLReadable, JavaMLWritable):
  • class IDFModel(JavaModel, _IDFParams, JavaMLReadable, JavaMLWritable):
  • class _ImputerParams(HasInputCols, HasOutputCols):
  • class Imputer(JavaEstimator, _ImputerParams, JavaMLReadable, JavaMLWritable):
  • class ImputerModel(JavaModel, _ImputerParams, JavaMLReadable, JavaMLWritable):
  • class _MaxAbsScalerParams(HasInputCol, HasOutputCol):
  • class MaxAbsScaler(JavaEstimator, _MaxAbsScalerParams, JavaMLReadable, JavaMLWritable):
  • class MaxAbsScalerModel(JavaModel, _MaxAbsScalerParams, JavaMLReadable, JavaMLWritable):
  • class MinHashLSH(JavaEstimator, _LSHParams, HasInputCol, HasOutputCol, HasSeed,
  • class _MinMaxScalerParams(HasInputCol, HasOutputCol):
  • class MinMaxScaler(JavaEstimator, _MinMaxScalerParams, JavaMLReadable, JavaMLWritable):
  • class MinMaxScalerModel(JavaModel, _MinMaxScalerParams, JavaMLReadable, JavaMLWritable):
  • class _OneHotEncoderParams(HasInputCols, HasOutputCols, HasHandleInvalid):
  • class OneHotEncoder(JavaEstimator, _OneHotEncoderParams, JavaMLReadable, JavaMLWritable):
  • class OneHotEncoderModel(JavaModel, _OneHotEncoderParams, JavaMLReadable, JavaMLWritable):
  • class _RobustScalerParams(HasInputCol, HasOutputCol):
  • class RobustScaler(JavaEstimator, _RobustScalerParams, JavaMLReadable, JavaMLWritable):
  • class RobustScalerModel(JavaModel, _RobustScalerParams, JavaMLReadable, JavaMLWritable):
  • class _StandardScalerParams(HasInputCol, HasOutputCol):
  • class StandardScaler(JavaEstimator, _StandardScalerParams, JavaMLReadable, JavaMLWritable):
  • class StandardScalerModel(JavaModel, _StandardScalerParams, JavaMLReadable, JavaMLWritable):
  • class _VectorIndexerParams(HasInputCol, HasOutputCol, HasHandleInvalid):
  • class VectorIndexer(JavaEstimator, _VectorIndexerParams, JavaMLReadable, JavaMLWritable):
  • class VectorIndexerModel(JavaModel, _VectorIndexerParams, JavaMLReadable, JavaMLWritable):
  • class _Word2VecParams(HasStepSize, HasMaxIter, HasSeed, HasInputCol, HasOutputCol):
  • class Word2Vec(JavaEstimator, _Word2VecParams, JavaMLReadable, JavaMLWritable):
  • class Word2VecModel(JavaModel, _Word2VecParams, JavaMLReadable, JavaMLWritable):
  • class _PCAParams(HasInputCol, HasOutputCol):
  • class PCA(JavaEstimator, _PCAParams, JavaMLReadable, JavaMLWritable):
  • class PCAModel(JavaModel, _PCAParams, JavaMLReadable, JavaMLWritable):
  • class _RFormulaParams(HasFeaturesCol, HasLabelCol, HasHandleInvalid):
  • class RFormula(JavaEstimator, _RFormulaParams, JavaMLReadable, JavaMLWritable):
  • class RFormulaModel(JavaModel, _RFormulaParams, JavaMLReadable, JavaMLWritable):
  • class _ChiSqSelectorParams(HasFeaturesCol, HasOutputCol, HasLabelCol):
  • class ChiSqSelector(JavaEstimator, _ChiSqSelectorParams, JavaMLReadable, JavaMLWritable):
  • class ChiSqSelectorModel(JavaModel, _ChiSqSelectorParams, JavaMLReadable, JavaMLWritable):


@inherit_doc
class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, HasSeed,
class MinHashLSH(JavaEstimator, _LSHParams, HasInputCol, HasOutputCol, HasSeed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This place seems a bit different from scala side
MinHashLSH(override val uid: String) extends LSH[MinHashLSHModel] with HasSeed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change.

if not name.endswith('Model') and not name.endswith('Params') \
and issubclass(cls, JavaParams) and not inspect.isabstract(cls) \
and not name.startswith('Java'):
and not name.startswith('Java') and name != 'LSH':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need to filter out LSH here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code doesn't have python class LSH

class LSH(JavaEstimator, _LSHParams, JavaMLReadable, JavaMLWritable):
    """
    Mixin for Locality Sensitive Hashing (LSH).
    """

    def setNumHashTables(self, value):
        """
        Sets the value of :py:attr:`numHashTables`.
        """
        return self._set(numHashTables=value)

I add this class so I can have a place for setNumHashTables. I don't have a __init__ in LSH because scala LSH is an abstract class and no constructor. Since no __init__, this self._java_obj is never set and test_param will throw Exception for LSH

Traceback (most recent call last):
  File "/Users/hgao/spark092119/spark/python/pyspark/ml/tests/test_param.py", line 358, in test_java_params
    check_params(self, cls(), check_params_exist=False)
  File "/Users/hgao/spark092119/spark/python/pyspark/testing/mlutils.py", line 40, in check_params
    java_stage = py_stage._to_java()
  File "/Users/hgao/spark092119/spark/python/pyspark/ml/wrapper.py", line 222, in _to_java
    self._transfer_params_to_java()
  File "/Users/hgao/spark092119/spark/python/pyspark/ml/wrapper.py", line 145, in _transfer_params_to_java
    pair = self._make_java_param_pair(param, self._defaultParamMap[param])
  File "/Users/hgao/spark092119/spark/python/pyspark/ml/wrapper.py", line 131, in _make_java_param_pair
    java_param = self._java_obj.getParam(param.name)
AttributeError: 'NoneType' object has no attribute 'getParam'

@SparkQA
Copy link

SparkQA commented Sep 27, 2019

Test build #111493 has finished for PR 25908 at commit 50e4467.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MinHashLSH(LSH, HasInputCol, HasOutputCol, HasSeed, JavaMLReadable, JavaMLWritable):

@zero323
Copy link
Member

zero323 commented Sep 28, 2019

@srowen @zhengruifeng @huaxingao Could you explain the rationale behind 425959e? Leading underscore doesn't really have the same semantics as Scala package private (or private modifier in general).

However, when used in context which have direct impact on user facing API, it is quite annoying, as it requires manually segregation from truly internal components. Even if the goal is simple API parity it makes sense to take such things into account. Otherwise the whole thing could be just code generated. Just saying...

@srowen
Copy link
Member

srowen commented Sep 28, 2019

If the general question is, why hide things? just the usual software design idea, to only expose what you intend to support as a user or developer API. I'm not making a specific judgment about whether these should be exposed or not, just don't know yet.

If the question is, can we make this more extensible, I also don't have a strong opinion but am not against it, but think that can be considered separately.

There are also going to be a few follow-on PRs to this one to further remove some setters.

@huaxingao
Copy link
Contributor Author

@zero323
If your question is specific for this leading underscore I just added, there are two reasons:

  • To keep consistent with the previous implementation in _CountVectorizerParams and _StringIndexerParams
  • Since Python doesn't really have private modifier, seems to me the closest one is to have leading underscore specified in PEP 8 convention. https://www.python.org/dev/peps/pep-0008/
    _single_leading_underscore: weak "internal use" indicator. E.g. from M import * does not import objects whose names start with an underscore.

I am neural on this leading underscore. Since I am following Bryan's convention in _CountVectorizerParams, I would like to check with him to see if there are any other reasons for this leading underscore. Thanks in advance @BryanCutler

@zero323
Copy link
Member

zero323 commented Sep 28, 2019

If the general question is, why hide things? just the usual software design idea, to only expose what you intend to support as a user or developer API. I'm not making a specific judgment about whether these should be exposed or not, just don't know yet.

Not at all @srowen. It is a specific question. These mixins are used only to augment public API, and as such directly affect user's code (for example method resolution order), so it is a bit fuzzy why indicate them as internal.

From one hand their usage is limited to specific Estimator / Model pair, so there is really little value in exposing these to the end user. From the other, hand limited scope suggests little potential for change, especially one that doesn't affect public API, hence there seems to be no little harm in exposing these (the problem is similar to SPARK-7146).

In contrast to SPARK-29212, I am indifferent here and just trying to get a better feeling how the API evolves and what drives certain decisions.

@zero323
Copy link
Member

zero323 commented Sep 28, 2019

@huaxingao. I understand leading underscore usage and I've seen Bryan's work, but for now it was an exception in pyspark.ml. We have a bunch of fairly isolated *Params but nothing indicates these are considered internal. Hence I was curious why the sudden change.

@SparkQA
Copy link

SparkQA commented Oct 3, 2019

Test build #111754 has finished for PR 25908 at commit bc25799.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

return self.getOrDefault(self.numHashTables)


class LSH(JavaEstimator, _LSHParams, JavaMLReadable, JavaMLWritable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be _LSH as well, shouldn't it? Scala counterpart is private[ml].



class LSHModel(JavaModel):
class LSHModel(JavaModel, _LSHParams):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Maybe it should be _LSHModel?

"""
Params for :py:class:`BucketedRandomProjectionLSH` and
:py:class:`BucketedRandomProjectionLSHModel`.
.. versionadded:: 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an empty line between the description and .. versionadded:: ... (or any directive)

    """
    Params for :py:class:`BucketedRandomProjectionLSH` and
    :py:class:`BucketedRandomProjectionLSHModel`.


    .. versionadded:: 3.0.0

Otherwise such elements are not recognized as directives and incorrectly appended to the body.

class _IDFParams(HasInputCol, HasOutputCol):
"""
Params for :py:class:`IDF` and :py:class:`IDFModel`.
.. versionadded:: 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

class _ImputerParams(HasInputCols, HasOutputCols):
"""
Params for :py:class:`Imputer` and :py:class:`ImputerModel`.
.. versionadded:: 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

class _MaxAbsScalerParams(HasInputCol, HasOutputCol):
"""
Params for :py:class:`MaxAbsScaler` and :py:class:`MaxAbsScalerModel`.
.. versionadded:: 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

class _MinMaxScalerParams(HasInputCol, HasOutputCol):
"""
Params for :py:class:`MinMaxScaler` and :py:class:`MinMaxScalerModel`.
.. versionadded:: 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

class _OneHotEncoderParams(HasInputCols, HasOutputCols, HasHandleInvalid):
"""
Params for :py:class:`OneHotEncoder` and :py:class:`OneHotEncoderModel`.
.. versionadded:: 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

class _RobustScalerParams(HasInputCol, HasOutputCol):
"""
Params for :py:class:`RobustScaler` and :py:class:`RobustScalerModel`.
.. versionadded:: 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

class _StandardScalerParams(HasInputCol, HasOutputCol):
"""
Params for :py:class:`StandardScaler` and :py:class:`StandardScalerModel`.
.. versionadded:: 3.0.0
Copy link
Member

@zero323 zero323 Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. And the remaining ones as well.

@SparkQA
Copy link

SparkQA commented Oct 4, 2019

Test build #111767 has finished for PR 25908 at commit c305a43.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _LSH(JavaEstimator, _LSHParams, JavaMLReadable, JavaMLWritable):
  • class _LSHModel(JavaModel, _LSHParams):
  • class BucketedRandomProjectionLSH(_LSH, _BucketedRandomProjectionLSHParams,
  • class BucketedRandomProjectionLSHModel(_LSHModel, _BucketedRandomProjectionLSHParams, JavaMLReadable,
  • class MinHashLSH(_LSH, HasInputCol, HasOutputCol, HasSeed, JavaMLReadable, JavaMLWritable):
  • class MinHashLSHModel(_LSHModel, JavaMLReadable, JavaMLWritable):

@SparkQA
Copy link

SparkQA commented Oct 4, 2019

Test build #111768 has finished for PR 25908 at commit 64fca95.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class BucketedRandomProjectionLSHModel(_LSHModel, _BucketedRandomProjectionLSHParams,

@srowen srowen closed this in 2399134 Oct 7, 2019
@srowen
Copy link
Member

srowen commented Oct 7, 2019

Merged to master

@huaxingao
Copy link
Contributor Author

Thanks all for your help!

@huaxingao huaxingao deleted the spark-29143 branch October 7, 2019 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants