[air preprocessor] Add limit to OHE.#24893
Conversation
|
lint test unrelated - some glitch from updating to ray 3.0.0 |
Yard1
left a comment
There was a problem hiding this comment.
Hey @xwjiang2010 thanks for this! I think it would be good to actually not just discard the values, but instead group all infrequent values into a single column. This is what sklearn is doing, and it allows you to preserve the information (with all 0s being then used for unseen values during prediction). Once we add the drop parameter, we will be able to discard one column anyway. Let me know what you think.
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
| for column in limit: | ||
| if column not in columns: |
There was a problem hiding this comment.
nit:
if any(column not in columns for column in limit)
There was a problem hiding this comment.
Not easily converted to list comprehension as we need to print out related error msg.
|
Discussed with Xiaowei. Seems like the scikit-learn API has a more complex API for this (via |
| categorical variables. The less frequent ones will result in all | ||
| the encoded column values being 0. This is a dict of column to |
There was a problem hiding this comment.
scikit-learn adds a new column to represent the the infrequent categories. This PR essentially drops the infrequent category. For scikit-learn, all zeros are usually used to represent "unknown categories", when handle_unknown="ignore". Unknown categories are categories seen in the test set, but not in the training set.
In ray's implementation, what happens for unknown categories?
Why are these changes needed?
This is useful when user wants to limit how many columns are meaningful to be encoded by OHE. The rest long tail can be discarded.
This is a portion coming from https://github.com/ray-project/ray/pull/24638/files. Also implemented the feedback from @Yard1 about combining the two cases: find unique values and find values with top K freq.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.