Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions r/NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,10 @@

# arrow 16.1.0.9000

# arrow 16.1.0

* R functions that users write that use functions that Arrow supports in dataset queries now can be used in queries too. Previously, only functions that used arithmetic operators worked. For example, `time_hours <- function(mins) mins / 60` worked, but `time_hours_rounded <- function(mins) round(mins / 60)` did not; now both work. These are automatic translations rather than true user-defined functions (UDFs); for UDFs, see `register_scalar_function()`. (#41223)
* `summarize()` supports more complex expressions, and correctly handles cases where column names are reused in expressions.
* The `na_matches` argument to the `dplyr::*_join()` functions is now supported. This argument controls whether `NA` values are considered equal when joining. (#41358)
* R metadata, stored in the Arrow schema to support round-tripping data between R and Arrow/Parquet, is now serialized and deserialized more strictly. This makes it safer to load data from files from unknown sources into R data.frames. (#41969)

# arrow 16.1.0

Expand Down
2 changes: 1 addition & 1 deletion r/R/extension.R
Original file line number Diff line number Diff line change
Expand Up @@ -429,7 +429,7 @@ VctrsExtensionType <- R6Class("VctrsExtensionType",
paste0(capture.output(print(self$ptype())), collapse = "\n")
},
deserialize_instance = function() {
private$.ptype <- unserialize(self$extension_metadata())
private$.ptype <- safe_r_metadata(safe_unserialize(self$extension_metadata()))
},
ExtensionEquals = function(other) {
inherits(other, "VctrsExtensionType") && identical(self$ptype(), other$ptype())
Expand Down
109 changes: 98 additions & 11 deletions r/R/metadata.R
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
}
}

out <- serialize(x, NULL, ascii = TRUE)
out <- serialize(safe_r_metadata(x, on_save = TRUE), NULL, ascii = TRUE)

# if the metadata is over 100 kB, compress
if (option_compress_metadata() && object.size(out) > 100000) {
Expand All @@ -44,23 +44,110 @@
}

.deserialize_arrow_r_metadata <- function(x) {
tryCatch(
expr = {
out <- unserialize(charToRaw(x))

# if this is still raw, try decompressing
if (is.raw(out)) {
out <- unserialize(memDecompress(out, type = "gzip"))
}
out
},
tryCatch(unserialize_r_metadata(x),
error = function(e) {
if (getOption("arrow.debug", FALSE)) {
print(conditionMessage(e))
}
warning("Invalid metadata$r", call. = FALSE)
NULL
}
)
}

unserialize_r_metadata <- function(x) {
# Check that this is ASCII serialized data (as in, what we wrote)
if (!identical(substr(unclass(x), 1, 1), "A")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not strictly necessary, but would asserting that this is ARROW be a bit more obvious?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually about how base::serialize() works, signifying that it is ASCII:

The format consists of a single line followed by the data: the first line contains a single character: X for binary serialization and A for ASCII serialization, followed by a new line.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/serialize.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAAAAH Maybe a comment X for binary serialization and A for ASCII serialization there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment on the line above not enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe it is. Though I did read it when reviewing and thought we were testing that the string started with ARROW so it wasn't when I was reading it last night. Not a huge deal either way, I think if someone needs to know this, they would poke at it more

stop("Invalid serialized data")
}
out <- safe_unserialize(charToRaw(x))
# If it's still raw, decompress and unserialize again
if (is.raw(out)) {
decompressed <- memDecompress(out, type = "gzip")
if (!identical(rawToChar(decompressed[1]), "A")) {
stop("Invalid serialized compressed data")
}
out <- safe_unserialize(decompressed)
}
if (!is.list(out)) {
stop("Invalid serialized data: must be a list")
}
safe_r_metadata(out)
}

safe_unserialize <- function(x) {
# By capturing the data in a list, we can inspect it for promises without
# triggering their evaluation.
Comment on lines +79 to +80
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# By capturing the data in a list, we can inspect it for promises without
# triggering their evaluation.
# By capturing the data in a list, we can minimize the possibility
# that R internals will evaluate any promises present before it
# can be inspected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is more accurate--I can obj <- deserialize(charToRaw()) the data in https://github.com/apache/arrow/pull/41969/files#diff-0386351ec2a20934987de3d32d4aee6fc609fbfbe3af3bf287a66941e8d563a7R121-R141 and the promise doesn't evaluate; it only evaluates if I touch obj. (This is on R 4.3.)

out <- list(unserialize(x))
if (typeof(out[[1]]) == "promise") {
stop("Serialized data contains a promise object")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stop("Serialized data contains a promise object")
stop("Invalid serialized data: Serialized data contains a promise object")

Up for other suggestions, but it would be good to make it clear that Serialized data containing a promise is problematic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it doesn't matter because this error gets swallowed in https://github.com/apache/arrow/pull/41969/files#diff-659e9fa6b66e5a72b4e3f9ac79ffddf08f92d9ea3d7aa45bd8c73b9a022fa2e5R52 and in the end the user sees an opaque "Invalid metadata$r" warning. This is a holdover from how we're currently doing the deserialization, any errors are just trapped if it fails to deserialize and we return NULL with that warning. Happy to revisit that though if there's interest.

}
out[[1]]
}

safe_r_metadata <- function(metadata, on_save = FALSE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on_save is to make if safe_r_metadata is saving metadata or loading metadata, yeah? It would be nice to either have doc strings (even just as a comment) explaining that, or maybe saving = FALSE is slightly more transparent to me?

Copy link
Member Author

@nealrichardson nealrichardson Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do. The meaning is described in a comment lower in the code but I can clarify up top too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1057b78

# This function recurses through the metadata list and checks that all
# elements are of types that are allowed in R metadata.
# If it finds an element that is not allowed, it removes it.
#
# This function is used both when saving and loading metadata.
# @param on_save: If TRUE, the function will not warn if it removes elements:
# we're just cleaning up the metadata for saving. If FALSE, it means we're
# loading the metadata, and we'll warn if we find invalid elements.
#
# When loading metadata, you can optionally keep the invalid elements by
# setting `options(arrow.unsafe_metadata = TRUE)`. It will still check
# for invalid elements and warn if any are found, though.

# This variable will be used to store the types of elements that were removed,
# if any, so we can give an informative warning if needed.
types_removed <- c()

# Internal function that we'll recursively apply,
# and mutate the `types_removed` variable outside of it.
check_r_metadata_types_recursive <- function(x) {
allowed_types <- c("character", "double", "integer", "logical", "complex", "list", "NULL")
if (is.list(x)) {
types <- map_chr(x, typeof)
x[types == "list"] <- map(x[types == "list"], check_r_metadata_types_recursive)
ok <- types %in% allowed_types
if (!all(ok)) {
# Record the invalid types, then remove the offending elements
types_removed <<- c(types_removed, setdiff(types, allowed_types))
x <- x[ok]
}
}
x
}
new <- check_r_metadata_types_recursive(metadata)

# On save: don't warn, just save the filtered metadata
if (on_save) {
return(new)
}
# On load: warn if any elements were removed
if (length(types_removed)) {
types_msg <- paste("Type:", oxford_paste(unique(types_removed)))
if (getOption("arrow.unsafe_metadata", FALSE)) {
# We've opted-in to unsafe metadata, so warn but return the original metadata
rlang::warn(
"R metadata may have unsafe or invalid elements",
body = c("i" = types_msg)
)
new <- metadata
} else {
rlang::warn(
"Potentially unsafe or invalid elements have been discarded from R metadata.",
body = c(
"i" = types_msg,
">" = "If you trust the source, you can set `options(arrow.unsafe_metadata = TRUE)` to preserve them."
)
)
}
}
new
}

#' @importFrom rlang trace_back
apply_arrow_r_metadata <- function(x, r_metadata) {
if (is.null(r_metadata)) {
Expand Down
69 changes: 69 additions & 0 deletions r/tests/testthat/test-metadata.R
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,73 @@ test_that("Garbage R metadata doesn't break things", {
"Invalid metadata$r",
fixed = TRUE
)

bad <- new.env(parent = emptyenv())
makeActiveBinding("columns", function() stop("This should not run"), bad)
tab$metadata <- list(r = rawToChar(serialize(bad, NULL, ascii = TRUE)))
expect_warning(
as.data.frame(tab),
"Invalid metadata$r",
fixed = TRUE
)

# https://hiddenlayer.com/research/r-bitrary-code-execution/
tab$metadata <- list(r = "A
3
262913
197888
5
UTF-8
5
252
6
1
262153
7
message
2
16
1
262153
32
arbitrary\040code\040was\040just\040executed
254
")
expect_message(
expect_warning(
as.data.frame(tab),
"Invalid metadata$r",
fixed = TRUE
),
NA
)
})

test_that("Complex or unsafe attributes are pruned from R metadata, if they exist", {
tab <- Table$create(example_data[1:6])
bad <- new.env()
makeActiveBinding("class", function() stop("This should not run"), bad)
tab$metadata <- list(r = rawToChar(serialize(list(attributes = bad), NULL, ascii = TRUE)))
expect_warning(
as.data.frame(tab),
"Potentially unsafe or invalid elements have been discarded from R metadata.
i Type: \"environment\"
> If you trust the source, you can set `options(arrow.unsafe_metadata = TRUE)` to preserve them.",
fixed = TRUE
)
# You can set an option to allow them through.
# It still warns, just differently, and it doesn't prune the attributes
withr::local_options(list("arrow.unsafe_metadata" = TRUE))
expect_warning(
expect_warning(
as.data.frame(tab),
"R metadata may have unsafe or invalid elements
i Type: \"environment\""
),
# This particular example ultimately fails because it's not a list
"Invalid metadata$r",
fixed = TRUE
)
})

test_that("Metadata serialization compression", {
Expand Down Expand Up @@ -254,6 +321,8 @@ test_that("Row-level metadata (does not) roundtrip in datasets", {
skip_if_not_available("dataset")
skip_if_not_available("parquet")

library(dplyr, warn.conflicts = FALSE)

df <- tibble::tibble(
metadata = list(
structure(1, my_value_as_attr = 1),
Expand Down