-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-42143: [R] Sanitize R metadata #41969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
67ce55e
b46db7e
3f0773f
df7075f
7f0271c
860333b
f1c4cdd
07155c8
d69c316
e64b85f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -30,7 +30,7 @@ | |||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| out <- serialize(x, NULL, ascii = TRUE) | ||||||||||||
| out <- serialize(safe_r_metadata(x, on_save = TRUE), NULL, ascii = TRUE) | ||||||||||||
|
|
||||||||||||
| # if the metadata is over 100 kB, compress | ||||||||||||
| if (option_compress_metadata() && object.size(out) > 100000) { | ||||||||||||
|
|
@@ -44,23 +44,110 @@ | |||||||||||
| } | ||||||||||||
|
|
||||||||||||
| .deserialize_arrow_r_metadata <- function(x) { | ||||||||||||
| tryCatch( | ||||||||||||
| expr = { | ||||||||||||
| out <- unserialize(charToRaw(x)) | ||||||||||||
|
|
||||||||||||
| # if this is still raw, try decompressing | ||||||||||||
| if (is.raw(out)) { | ||||||||||||
| out <- unserialize(memDecompress(out, type = "gzip")) | ||||||||||||
| } | ||||||||||||
| out | ||||||||||||
| }, | ||||||||||||
| tryCatch(unserialize_r_metadata(x), | ||||||||||||
| error = function(e) { | ||||||||||||
| if (getOption("arrow.debug", FALSE)) { | ||||||||||||
| print(conditionMessage(e)) | ||||||||||||
| } | ||||||||||||
| warning("Invalid metadata$r", call. = FALSE) | ||||||||||||
| NULL | ||||||||||||
| } | ||||||||||||
| ) | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| unserialize_r_metadata <- function(x) { | ||||||||||||
| # Check that this is ASCII serialized data (as in, what we wrote) | ||||||||||||
| if (!identical(substr(unclass(x), 1, 1), "A")) { | ||||||||||||
| stop("Invalid serialized data") | ||||||||||||
| } | ||||||||||||
| out <- safe_unserialize(charToRaw(x)) | ||||||||||||
| # If it's still raw, decompress and unserialize again | ||||||||||||
| if (is.raw(out)) { | ||||||||||||
| decompressed <- memDecompress(out, type = "gzip") | ||||||||||||
| if (!identical(rawToChar(decompressed[1]), "A")) { | ||||||||||||
| stop("Invalid serialized compressed data") | ||||||||||||
| } | ||||||||||||
| out <- safe_unserialize(decompressed) | ||||||||||||
| } | ||||||||||||
| if (!is.list(out)) { | ||||||||||||
| stop("Invalid serialized data: must be a list") | ||||||||||||
| } | ||||||||||||
| safe_r_metadata(out) | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| safe_unserialize <- function(x) { | ||||||||||||
| # By capturing the data in a list, we can inspect it for promises without | ||||||||||||
| # triggering their evaluation. | ||||||||||||
|
||||||||||||
| # By capturing the data in a list, we can inspect it for promises without | |
| # triggering their evaluation. | |
| # By capturing the data in a list, we can minimize the possibility | |
| # that R internals will evaluate any promises present before it | |
| # can be inspected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is more accurate--I can obj <- deserialize(charToRaw()) the data in https://github.com/apache/arrow/pull/41969/files#diff-0386351ec2a20934987de3d32d4aee6fc609fbfbe3af3bf287a66941e8d563a7R121-R141 and the promise doesn't evaluate; it only evaluates if I touch obj. (This is on R 4.3.)
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| stop("Serialized data contains a promise object") | |
| stop("Invalid serialized data: Serialized data contains a promise object") |
Up for other suggestions, but it would be good to make it clear that Serialized data containing a promise is problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it doesn't matter because this error gets swallowed in https://github.com/apache/arrow/pull/41969/files#diff-659e9fa6b66e5a72b4e3f9ac79ffddf08f92d9ea3d7aa45bd8c73b9a022fa2e5R52 and in the end the user sees an opaque "Invalid metadata$r" warning. This is a holdover from how we're currently doing the deserialization, any errors are just trapped if it fails to deserialize and we return NULL with that warning. Happy to revisit that though if there's interest.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on_save is to make if safe_r_metadata is saving metadata or loading metadata, yeah? It would be nice to either have doc strings (even just as a comment) explaining that, or maybe saving = FALSE is slightly more transparent to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can do. The meaning is described in a comment lower in the code but I can clarify up top too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 1057b78
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not strictly necessary, but would asserting that this is
ARROWbe a bit more obvious?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually about how
base::serialize()works, signifying that it is ASCII:https://stat.ethz.ch/R-manual/R-devel/library/base/html/serialize.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AAAAAH Maybe a comment
X for binary serialization and A for ASCII serializationthere?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the comment on the line above not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe it is. Though I did read it when reviewing and thought we were testing that the string started with
ARROWso it wasn't when I was reading it last night. Not a huge deal either way, I think if someone needs to know this, they would poke at it more