Skip to content

[air preprocessor] Add limit to OHE.#24893

Merged
richardliaw merged 11 commits intoray-project:masterfrom
xwjiang2010:OHE
May 24, 2022
Merged

[air preprocessor] Add limit to OHE.#24893
richardliaw merged 11 commits intoray-project:masterfrom
xwjiang2010:OHE

Conversation

@xwjiang2010
Copy link
Contributor

@xwjiang2010 xwjiang2010 commented May 17, 2022

Why are these changes needed?

This is useful when user wants to limit how many columns are meaningful to be encoded by OHE. The rest long tail can be discarded.
This is a portion coming from https://github.com/ray-project/ray/pull/24638/files. Also implemented the feedback from @Yard1 about combining the two cases: find unique values and find values with top K freq.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@xwjiang2010
Copy link
Contributor Author

lint test unrelated - some glitch from updating to ray 3.0.0

Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @xwjiang2010 thanks for this! I think it would be good to actually not just discard the values, but instead group all infrequent values into a single column. This is what sklearn is doing, and it allows you to preserve the information (with all 0s being then used for unseen values during prediction). Once we add the drop parameter, we will be able to discard one column anyway. Let me know what you think.

@richardliaw richardliaw added this to the Ray AIR milestone May 18, 2022
xwjiang2010 and others added 5 commits May 18, 2022 09:40
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Comment on lines +192 to +193
for column in limit:
if column not in columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

if any(column not in columns for column in limit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not easily converted to list comprehension as we need to print out related error msg.

@richardliaw
Copy link
Contributor

Discussed with Xiaowei. Seems like the scikit-learn API has a more complex API for this (via drop). OK to move forward, but reached out to @thomasjpfan for more guidance.

@richardliaw richardliaw merged commit 8703d5e into ray-project:master May 24, 2022
Comment on lines +85 to +86
categorical variables. The less frequent ones will result in all
the encoded column values being 0. This is a dict of column to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn adds a new column to represent the the infrequent categories. This PR essentially drops the infrequent category. For scikit-learn, all zeros are usually used to represent "unknown categories", when handle_unknown="ignore". Unknown categories are categories seen in the test set, but not in the training set.

In ray's implementation, what happens for unknown categories?

@xwjiang2010 xwjiang2010 deleted the OHE branch July 26, 2023 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants