Conversation
Currently if one wants to get levels in their appropriate order, but add `missing` if present, the only solution is to do something like `union(levels(x), unique(x))`, which is inefficient. Support `skipmissing=false` to allow doing this in a single pass over the data. Use `@inline` to ensure that the return type can be inferred when the value of `skipmissing` is known statically. Also fix a type instability which existed for ranges.
Codecov Report
@@ Coverage Diff @@
## main #46 +/- ##
==========================================
+ Coverage 92.59% 95.12% +2.52%
==========================================
Files 1 1
Lines 27 41 +14
==========================================
+ Hits 25 39 +14
Misses 2 2
Continue to review full report at Codecov.
|
|
CI on 1.0 fails. |
bkamins
left a comment
There was a problem hiding this comment.
Implementation looks good. The only problem is the one that you have noted - we cannot rely on this behavior in library codes anyway.
Maybe we can have this for user's code, but then add something like levelsmissing that is internal and intended for library code to use (your current _levels_missing which is internal but not indented to be used).
So the advantage would be that we would be sure that it doesn't throw an error? If a type that uses a custom order of levels doesn't implement it, it would return them in their order of appearance though. Not sure whether it's better to throw an explicit error so that users can complain or give them suboptimal results... |
|
Having thought of it again, actually your solution is OK. We only would need in The point is that we would get to these calls only if package defines |
|
Ah, interesting. You mean that |
|
In The point is that if |
|
Ah right. Though doing this would mean that DataFrames would have to call |
|
No it would not. Note what would happen under my approach if you called
This would not be super fast, but this would be a default fallback and I think it would be good enough. However, your option 2 is I think also OK and we could just add a second function not to complicate things if you prefer so. Having said that the question is if |
|
OK, great, I hadn't realized that dispatch would choose the method which supports keyword arguments. I've pushed a commit to do that. Unfortunately,
Yes that's the debate I mentioned above. For now I didn't decide anything, given that we only define the fallback method here, which is documented to be equivalent to |
I know. I just did not want to forget about it. Let us agree that
I agree this is tricky (I just wanted to show it so that we consider such approach and decide if we want it). If you prefer - as I have written above - option 2, i.e. having two separate methods is I think also OK I think. |
|
OK, I've found a solution to make the function inferrable using a different type for the default value. |
| elseif any(ismissing, x) | ||
| return [levels(x); missing] | ||
| else | ||
| return convert(AbstractArray{eltype(x)}, levels(x)) |
There was a problem hiding this comment.
this line seems not to be covered by tests
bkamins
left a comment
There was a problem hiding this comment.
Looks good. Thank you. Only test coverage needs improvement.
quinnj
left a comment
There was a problem hiding this comment.
LGTM; if you get a chance, would you mind running Arrow.jl test suite to ensure everything works there?
|
Arrow tests pass. |
The argument is added by DataAPI 1.10 (JuliaData/DataAPI.jl#46). When `skipmissing=true`, the method for `CategoricalArray` can be slightly more efficient than the fallback defined in DataAPI as it avoids calling `unique`.
The argument is added by DataAPI 1.10 (JuliaData/DataAPI.jl#46). When `skipmissing=true`, the method for `CategoricalArray` can be slightly more efficient than the fallback defined in DataAPI as it avoids calling `unique`.
Currently if one wants to get levels in their appropriate order, but add
missingif present, the only solution is to do something likeunion(levels(x), unique(x)), which is inefficient.Support
skipmissing=falseto allow doing this in a single pass over the data.Use
@inlineto ensure that the return type can be inferred when the value ofskipmissingis known statically. Also fix a type instability which existed for ranges.Of course this new feature won't work for custom types which override this method (like
CategoricalArray) until packages implement it. Unfortunately there's no way for packages which would like to rely on it (like DataFrames) to require an appropriate version.There will also be a decision to make in CategoricalArrays as to whether
missingshould be returned only when present in the data (like the method defined here) or all the time as long as the eltype allows for it (like for other levels, which is more efficient).Fixes #44.