[air preprocessor] Add limit to OHE.#24893

Merged

richardliaw merged 11 commits intoray-project:masterfrom

xwjiang2010:OHE

May 24, 2022

Contributor

xwjiang2010 commented May 17, 2022 •

edited

Loading

Why are these changes needed?

This is useful when user wants to limit how many columns are meaningful to be encoded by OHE. The rest long tail can be discarded.
This is a portion coming from https://github.com/ray-project/ray/pull/24638/files. Also implemented the feedback from @Yard1 about combining the two cases: find unique values and find values with top K freq.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

xwjiang2010 added 2 commits

May 17, 2022 15:28


          Add limit to OHE.

069b206

fix

d4c15e6

xwjiang2010 assigned amogkam and Yard1

xwjiang2010 mentioned this pull request

[air example] train a Keras model on tabular data and serve it. #24898

Merged

6 tasks

richardliaw reviewed

View reviewed changes

python/ray/ml/preprocessors/encoder.py Outdated Show resolved Hide resolved

xwjiang2010 and others added 2 commits

May 17, 2022 20:41


          address comments.

7adf245


          Merge branch 'ray-project:master' into OHE

5b04401

Contributor Author

xwjiang2010 commented May 18, 2022

lint test unrelated - some glitch from updating to ray 3.0.0

Yard1 reviewed

View reviewed changes

Member

Yard1 left a comment

Hey @xwjiang2010 thanks for this! I think it would be good to actually not just discard the values, but instead group all infrequent values into a single column. This is what sklearn is doing, and it allows you to preserve the information (with all 0s being then used for unseen values during prediction). Once we add the drop parameter, we will be able to discard one column anyway. Let me know what you think.

python/ray/ml/preprocessors/encoder.py Outdated Show resolved Hide resolved

python/ray/ml/preprocessors/encoder.py Outdated Show resolved Hide resolved

python/ray/ml/preprocessors/encoder.py Outdated Show resolved Hide resolved

richardliaw added this to the Ray AIR milestone

xwjiang2010 and others added 5 commits

May 18, 2022 09:40


          Update python/ray/ml/preprocessors/encoder.py

964e85b

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>


          Update python/ray/ml/preprocessors/encoder.py

4424ba1

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>


          Update python/ray/ml/preprocessors/encoder.py

84da043

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>


          lint

73d7174


          lint

14f6d02

xwjiang2010 mentioned this pull request

[air] Have a default column for not frequent enough categories for OHE #25096

Closed

richardliaw reviewed

View reviewed changes

python/ray/ml/preprocessors/encoder.py

Comment on lines +192 to +193

		for column in limit:
		if column not in columns:

Contributor

richardliaw May 23, 2022

nit:

if any(column not in columns for column in limit)

Contributor Author

xwjiang2010 May 23, 2022

Not easily converted to list comprehension as we need to print out related error msg.

richardliaw approved these changes

View reviewed changes

richardliaw reviewed

View reviewed changes

python/ray/ml/preprocessors/encoder.py Show resolved Hide resolved

richardliaw reviewed

View reviewed changes

python/ray/ml/preprocessors/encoder.py Show resolved Hide resolved

Contributor

richardliaw commented May 23, 2022

Discussed with Xiaowei. Seems like the scikit-learn API has a more complex API for this (via drop). OK to move forward, but reached out to @thomasjpfan for more guidance.

xwjiang2010 added 2 commits

May 23, 2022 15:50


          Update example.

e74729b


          style

f9368b2

richardliaw merged commit 8703d5e into ray-project:master

thomasjpfan reviewed

View reviewed changes

python/ray/ml/preprocessors/encoder.py

Comment on lines +85 to +86

		categorical variables. The less frequent ones will result in all
		the encoded column values being 0. This is a dict of column to

Contributor

thomasjpfan May 24, 2022

scikit-learn adds a new column to represent the the infrequent categories. This PR essentially drops the infrequent category. For scikit-learn, all zeros are usually used to represent "unknown categories", when handle_unknown="ignore". Unknown categories are categories seen in the test set, but not in the training set.

In ray's implementation, what happens for unknown categories?

xwjiang2010 deleted the OHE branch

July 26, 2023 19:49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet