Skip to content

[R] passing a partition as a schema leads to segfaults #27610

@asfimport

Description

@asfimport

The command to open a dataset in R can accept both a schema and a partitioning argument. If one accidentally passes a partitioning as the schema, the result looks like the dataset was read, but operating on the dataset results in segfaults after.

Though this is input error, we should add a validation checking that the schema argument is, in fact, a Schema object and error if it is not so that someone doesn't find themselves confronted with a segfault later.

### begin setup 
# note: this exact code is called in test-dataset.R lines 18-87) So when adding
# the test to that file, you don't need to copy this, but can use the code at
# the bottom of this chunk in that test if you want.
library(dplyr)

make_temp_dir <- function() {
  path <- tempfile()
  dir.create(path)
  normalizePath(path, winslash = "/")
}

hive_dir <- make_temp_dir()

first_date <- lubridate::ymd_hms("2015-04-29 03:12:39")
df1 <- tibble(
  int = 1:10,
  dbl = as.numeric(1:10),
  lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
  chr = letters[1:10],
  fct = factor(LETTERS[1:10]),
  ts = first_date + lubridate::days(1:10)
)

second_date <- lubridate::ymd_hms("2017-03-09 07:01:02")
df2 <- tibble(
  int = 101:110,
  dbl = c(as.numeric(51:59), NaN),
  lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
  chr = letters[10:1],
  fct = factor(LETTERS[10:1]),
  ts = second_date + lubridate::days(10:1)
)

dir.create(file.path(hive_dir, "subdir", "group=1", "other=xxx"), recursive = TRUE)
dir.create(file.path(hive_dir, "subdir", "group=2", "other=yyy"), recursive = TRUE)
write_parquet(df1, file.path(hive_dir, "subdir", "group=1", "other=xxx", "file1.parquet"))
write_parquet(df2, file.path(hive_dir, "subdir", "group=2", "other=yyy", "file2.parquet"))

### end setup

# This (the correct specification) works just fine
ds <- open_dataset(hive_dir, partitioning = hive_partition(other = utf8(), group = uint8()))
ds$schema

# But if you aren't explicit with ther argument names it looks like everything works...
ds <- open_dataset(hive_dir, hive_partition(other = utf8(), group = uint8()))

# but the dataset is malformed and will have segfaults when trying to interact with it for example:
ds$schema

Reporter: Jonathan Keane / @jonkeane
Assignee: Mauricio 'Pachá' Vargas Sepúlveda / @pachadotdev

PRs and other links:

Note: This issue was originally created as ARROW-11756. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions