Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion r/src/arrow_cpp11.h
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,13 @@ inline R_xlen_t r_string_size(SEXP s) {
} // namespace unsafe

inline SEXP utf8_strings(SEXP x) {
return cpp11::unwind_protect([x] {
return cpp11::unwind_protect([&] {
// ensure that x is not actually altrep first this also ensures that
// x is not altrep even after it is materialized
bool was_altrep = ALTREP(x);
if (was_altrep) {
x = PROTECT(Rf_duplicate(x));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment about why we have to duplicate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll expand what I have there

}
R_xlen_t n = XLENGTH(x);

// if `x` is an altrep of some sort, this will
Expand All @@ -152,6 +158,9 @@ inline SEXP utf8_strings(SEXP x) {
SET_STRING_ELT(x, i, Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we want to check whether Rf_translateCharUTF8() actually modified anything? Or do we trust that SET_STRING_ELT is a no-op in that case? I would imagine that in most cases, we already have ascii/utf-8 strings, so this whole function should be basically free. That should be easily verified by microbenchmarking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this from above?

SEXP new_s = Rf_translateCharUTF8(s);
if (new_s == s) {
SET_STRING_ELT(x, i, Rf_mkCharCE(new_s, CE_UTF8));
}

Yeah, that's probably good. It'll also make this slightly more inline with what was there before (checking that not utf8)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried this, but after getting it to work with a bit of type faffing, doing the translate first we get errors with our utf string tests again.

}
}
if (was_altrep) {
UNPROTECT(1);
}
return x;
});
}
Expand Down
17 changes: 17 additions & 0 deletions r/tests/testthat/test-csv.R
Original file line number Diff line number Diff line change
Expand Up @@ -738,5 +738,22 @@ test_that("read_csv2_arrow correctly parses comma decimals", {
tf <- tempfile()
writeLines("x;y\n1,2;c", con = tf)
expect_equal(read_csv2_arrow(tf), tibble(x = 1.2, y = "c"))
})

test_that("altrep columns can roundtrip to table", {
tf <- tempfile()
on.exit(unlink(tf))
write.csv(tbl, tf, row.names = FALSE)

# read in, some columns will be altrep by default
new_df <- read_csv_arrow(tf)
expect_equal(tbl, as_tibble(arrow_table(new_df)))

# but also if we materialize the vector
# this could also be accomplished with printing
new_df <- read_csv_arrow(tf)
test_arrow_altrep_force_materialize(new_df$chr)

# we should still be able to turn this into a table
expect_equal(tbl, as_tibble(arrow_table(new_df)))
})