-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4560: [R] array() needs to take single input, not ... #3635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-4560: [R] array() needs to take single input, not ... #3635
Conversation
|
Also planning to work on https://issues.apache.org/jira/browse/ARROW-3810 in this PR |
|
To handle the library(arrow, warn.conflicts = FALSE)
type(1:10)
#> arrow::Int32
#> int32
type(1)
#> arrow::Float64
#> double
type("")
#> arrow::Utf8
#> string
type(iris$Species)
#> arrow::DictionaryType
#> dictionary<values=string, indices=int8, ordered=0>Created on 2019-02-14 by the reprex package (v0.2.1.9000) |
0829631 to
916f76a
Compare
|
I'm now fairly confident about this PR. library(arrow, warn.conflicts = FALSE)
a <- array(1:10, type = int64())
a$type
#> arrow::Int64
#> int64
a$as_vector()
#> integer64
#> [1] 1 2 3 4 5 6 7 8 9 10
a <- array(1:10, type = float64())
a$type
#> arrow::Float64
#> double
a$as_vector()
#> [1] 1 2 3 4 5 6 7 8 9 10The type of array that is made is governed by the library(arrow, warn.conflicts = FALSE)
a <- array(rnorm(10))
a$type
#> arrow::Float64
#> double
a$as_vector()
#> [1] 0.48428301 0.74399694 0.48991163 0.35701469 0.60517996
#> [6] 0.53463545 0.12579375 0.02414986 -0.49127583 0.17194012The library(arrow, warn.conflicts = FALSE)
a <- chunked_array(rnorm(10), 1:10)
a$type
#> arrow::Float64
#> double
a$as_vector()
#> [1] 1.64642028 -1.47423016 0.84996187 1.21151724 -1.52303727
#> [6] -0.04387242 -0.47798708 -0.18693768 -0.98903429 0.30376938
#> [11] 1.00000000 2.00000000 3.00000000 4.00000000 5.00000000
#> [16] 6.00000000 7.00000000 8.00000000 9.00000000 10.00000000
a <- chunked_array(1:5, 1:10, type = int64())
a$type
#> arrow::Int64
#> int64
a$as_vector()
#> integer64
#> [1] 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10 |
… factors/dictionaries
…since array() no longer deals with ...
4650fc0 to
08b2953
Compare
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this looks nice. I'm going to rebase to fix any linting issue since I just added the cpplint checks
|
|
||
| test_that("type() infers from R type", { | ||
| expect_equal(type(1:10), int32()) | ||
| expect_equal(type(1), float64()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optics of this are a bit odd =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah maybe that can be infer_type() or something
| expect_equal(type(""), utf8()) | ||
| expect_equal( | ||
| type(iris$Species), | ||
| dictionary(int8(), array(levels(iris$Species)), FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe test for ordered factors at some point also?
|
Oops I missed array_from_vector.cpp in my code review because it was hidden. silly GitHub. I'll take a skim through anyway |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments. I'll do a better job reviewing next time...
| null_bitmap_writer.Finish(); | ||
| } | ||
|
|
||
| // ----- data buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is efficient to visit the string elements twice? I think you should have some benchmarks about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so we only AllocateBuffer the buffer with the final size we need. The LENGTH() here is quick, R strings know their size: https://purrple.cat/blog/2018/03/05/strings-know-their-own-length/
Is the alternative to grow a buffer as we go ? Maybe there is a StringArrayBuilder I can use
| // catch up | ||
| for (R_xlen_t j = 0; j < i; j++, null_bitmap_writer.Next()) { | ||
| null_bitmap_writer.Set(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you might want to turn some of this into helper functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We can revisit next time on this.
| } else { | ||
| null_bitmap_writer.Set(); | ||
| *p_indices = *p_factor - 1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to write this without a branch
|
|
||
| virtual Status GetResult(std::shared_ptr<arrow::Array>* result) { | ||
| RETURN_NOT_OK(builder_->Finish(result)); | ||
| return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can return builder_->Finish(result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will do in a next PR.
This will simplify the handling of the
typeargument.jira: https://issues.apache.org/jira/browse/ARROW-4560?filter=12344983