-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Twitter question of "how can I make arrow's csv reader not make int64 for integers", turns out to be originating from the scenario where some csvs in a directory may have all integer values for a column but there are decimals in others, and you can't use them together in a dataset.
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
ds_dir <- tempfile()
dir.create(ds_dir)
cat("a\n1", file = file.path(ds_dir, "1.csv"))
cat("a\n1.1", file = file.path(ds_dir, "2.csv"))
ds <- open_dataset(ds_dir, format = "csv")
ds
#> FileSystemDataset with 2 csv files
#> a: int64
## It just picked the schema of the first file
collect(ds)
#> Error: Invalid: Could not open CSV input source '/private/var/folders/yv/b6mwztyj0r11r8pnsbmpltx00000gn/T/RtmpzENOMb/filea9c3292e06dd/2.csv': Invalid: In CSV column #0: Row #2: CSV conversion error to int64: invalid value '1.1'
#> ../src/arrow/csv/converter.cc:492 decoder_.Decode(data, size, quoted, &value)
#> ../src/arrow/csv/parser.h:123 status
#> ../src/arrow/csv/converter.cc:496 parser.VisitColumn(col_index, visit)
#> ../src/arrow/csv/reader.cc:462 internal::UnwrapOrRaise(maybe_decoded_arrays)
#> ../src/arrow/compute/exec/exec_plan.cc:398 iterator_.Next()
#> ../src/arrow/record_batch.cc:318 ReadNext(&batch)
#> ../src/arrow/record_batch.cc:329 ReadAll(&batches)
## Let's try again and tell it to unify schemas. Should result in a float64 type
ds <- open_dataset(ds_dir, format = "csv", unify_schemas = TRUE)
#> Error: Invalid: Unable to merge: Field a has incompatible types: int64 vs double
#> ../src/arrow/type.cc:1621 fields_[i]->MergeWith(field)
#> ../src/arrow/type.cc:1684 AddField(field)
#> ../src/arrow/type.cc:1755 builder.AddSchema(schema)
#> ../src/arrow/dataset/discovery.cc:251 Inspect(options.inspect_options)Reporter: Neal Richardson / @nealrichardson
Related issues:
- [C++] allow unify schema to coalesce int64 and float64 (duplicates)
- [R] Add option to attempt 32-bit integer type inference in CSV reader (relates to)
PRs and other links:
Note: This issue was originally created as ARROW-14705. Please see the migration documentation for further details.