Skip to content

Conversation

@romainfrancois
Copy link
Contributor

This will simplify the handling of the type argument.

jira: https://issues.apache.org/jira/browse/ARROW-4560?filter=12344983

@romainfrancois
Copy link
Contributor Author

Also planning to work on https://issues.apache.org/jira/browse/ARROW-3810 in this PR

@romainfrancois
Copy link
Contributor Author

To handle the type= argument in array() we need to be able to infer arrow types from R objects, so I've added the type() function:

library(arrow, warn.conflicts = FALSE)

type(1:10)
#> arrow::Int32 
#> int32
type(1)
#> arrow::Float64 
#> double
type("")
#> arrow::Utf8 
#> string
type(iris$Species)
#> arrow::DictionaryType 
#> dictionary<values=string, indices=int8, ordered=0>

Created on 2019-02-14 by the reprex package (v0.2.1.9000)

@romainfrancois romainfrancois force-pushed the ARROW-4560/array_no_dots branch 2 times, most recently from 0829631 to 916f76a Compare February 19, 2019 15:39
@romainfrancois romainfrancois added ready-for-review and removed WIP PR is work in progress labels Feb 20, 2019
@romainfrancois
Copy link
Contributor Author

I'm now fairly confident about this PR. array() used to take ... and rely on vctrs to first combine to a common type using vctrs type system, though the .ptype. I've changed that so that array() only take one vector x and the type= argument is an arrow logical type, e.g. int32().

library(arrow, warn.conflicts = FALSE)

a <- array(1:10, type = int64())
a$type
#> arrow::Int64 
#> int64
a$as_vector()
#> integer64
#>  [1] 1  2  3  4  5  6  7  8  9  10

a <- array(1:10, type = float64())
a$type
#> arrow::Float64 
#> double
a$as_vector()
#>  [1]  1  2  3  4  5  6  7  8  9 10

The type of array that is made is governed by the type= argument. If missing, the type is inferred from the data, e.g.

library(arrow, warn.conflicts = FALSE)

a <- array(rnorm(10))
a$type
#> arrow::Float64 
#> double
a$as_vector()
#>  [1]  0.48428301  0.74399694  0.48991163  0.35701469  0.60517996
#>  [6]  0.53463545  0.12579375  0.02414986 -0.49127583  0.17194012

The chunked_array() factory handles ... and a type argument too:

library(arrow, warn.conflicts = FALSE)

a <- chunked_array(rnorm(10), 1:10)
a$type
#> arrow::Float64 
#> double
a$as_vector()
#>  [1]  1.64642028 -1.47423016  0.84996187  1.21151724 -1.52303727
#>  [6] -0.04387242 -0.47798708 -0.18693768 -0.98903429  0.30376938
#> [11]  1.00000000  2.00000000  3.00000000  4.00000000  5.00000000
#> [16]  6.00000000  7.00000000  8.00000000  9.00000000 10.00000000

a <- chunked_array(1:5, 1:10, type = int64())
a$type
#> arrow::Int64 
#> int64
a$as_vector()
#> integer64
#>  [1] 1  2  3  4  5  1  2  3  4  5  6  7  8  9  10

@romainfrancois romainfrancois requested a review from wesm February 20, 2019 09:26
@romainfrancois romainfrancois force-pushed the ARROW-4560/array_no_dots branch from 4650fc0 to 08b2953 Compare February 26, 2019 08:40
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this looks nice. I'm going to rebase to fix any linting issue since I just added the cpplint checks


test_that("type() infers from R type", {
expect_equal(type(1:10), int32())
expect_equal(type(1), float64())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optics of this are a bit odd =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah maybe that can be infer_type() or something

expect_equal(type(""), utf8())
expect_equal(
type(iris$Species),
dictionary(int8(), array(levels(iris$Species)), FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe test for ordered factors at some point also?

@wesm wesm closed this in 2a14c7b Feb 27, 2019
@wesm
Copy link
Member

wesm commented Feb 27, 2019

Oops I missed array_from_vector.cpp in my code review because it was hidden. silly GitHub. I'll take a skim through anyway

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. I'll do a better job reviewing next time...

null_bitmap_writer.Finish();
}

// ----- data buffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is efficient to visit the string elements twice? I think you should have some benchmarks about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so we only AllocateBuffer the buffer with the final size we need. The LENGTH() here is quick, R strings know their size: https://purrple.cat/blog/2018/03/05/strings-know-their-own-length/

Is the alternative to grow a buffer as we go ? Maybe there is a StringArrayBuilder I can use

// catch up
for (R_xlen_t j = 0; j < i; j++, null_bitmap_writer.Next()) {
null_bitmap_writer.Set();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you might want to turn some of this into helper functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We can revisit next time on this.

} else {
null_bitmap_writer.Set();
*p_indices = *p_factor - 1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to write this without a branch


virtual Status GetResult(std::shared_ptr<arrow::Array>* result) {
RETURN_NOT_OK(builder_->Finish(result));
return Status::OK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can return builder_->Finish(result)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Will do in a next PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants