-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Labels
Description
The encoding options are passed when a single file is read with read_delim_arrow, but not when opening a folder with open_dataset.
read_delim_arrow creates a reader using CsvTableReader$create (which is what is tested in the package's tests).
open_dataset creates a factory and I'm unable to follow what happens when $Finish() is called.
Also, the documentation ("CsvReadOptions" page) lists the "encoding" option under "CsvConvertOptions$create()" instead of "CsvReadOptions$create()"
library(dplyr)
library(arrow)
# Opens one file just fine:
one_file <- arrow::read_delim_arrow(
"test/Test1.txt",
as_data_frame = FALSE,
delim = ";",
read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
)
collect(one_file)
# Can't open the folder that has "Test1.txt" properly, results in Column2 being typed as binary
one_folder <- arrow::open_dataset(
"test",
delim = ";",
read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
)
collect(one_folder)
# Even when specify the schema
one_folder_w_schema <- arrow::open_dataset(
"test",
schema = Schema$create(Column1 = string(), Column2 = string()),
format = FileFormat$create("text", skip_rows = 1L, delimiter = ";", column_names = c("Column1", "Column2"),
read_options = CsvReadOptions$create(encoding = "ISO-8859-1"))
)
collect(one_folder_w_schema)
Reporter: Gregoire Leleu
Related issues:
- [C++][Dataset] Support Latin-1 encoding (is blocked by)
Original Issue Attachments:
Note: This issue was originally created as ARROW-15992. Please see the migration documentation for further details.