Skip to content

[C++] unify_schemas can't handle int64 + double, affects CSV dataset #30245

@asfimport

Description

@asfimport

Twitter question of "how can I make arrow's csv reader not make int64 for integers", turns out to be originating from the scenario where some csvs in a directory may have all integer values for a column but there are decimals in others, and you can't use them together in a dataset.

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

ds_dir <- tempfile()
dir.create(ds_dir)
cat("a\n1", file = file.path(ds_dir, "1.csv"))
cat("a\n1.1", file = file.path(ds_dir, "2.csv"))

ds <- open_dataset(ds_dir, format = "csv")
ds
#> FileSystemDataset with 2 csv files
#> a: int64

## It just picked the schema of the first file
collect(ds)
#> Error: Invalid: Could not open CSV input source '/private/var/folders/yv/b6mwztyj0r11r8pnsbmpltx00000gn/T/RtmpzENOMb/filea9c3292e06dd/2.csv': Invalid: In CSV column #0: Row #2: CSV conversion error to int64: invalid value '1.1'
#> ../src/arrow/csv/converter.cc:492  decoder_.Decode(data, size, quoted, &value)
#> ../src/arrow/csv/parser.h:123  status
#> ../src/arrow/csv/converter.cc:496  parser.VisitColumn(col_index, visit)
#> ../src/arrow/csv/reader.cc:462  internal::UnwrapOrRaise(maybe_decoded_arrays)
#> ../src/arrow/compute/exec/exec_plan.cc:398  iterator_.Next()
#> ../src/arrow/record_batch.cc:318  ReadNext(&batch)
#> ../src/arrow/record_batch.cc:329  ReadAll(&batches)

## Let's try again and tell it to unify schemas. Should result in a float64 type
ds <- open_dataset(ds_dir, format = "csv", unify_schemas = TRUE)
#> Error: Invalid: Unable to merge: Field a has incompatible types: int64 vs double
#> ../src/arrow/type.cc:1621  fields_[i]->MergeWith(field)
#> ../src/arrow/type.cc:1684  AddField(field)
#> ../src/arrow/type.cc:1755  builder.AddSchema(schema)
#> ../src/arrow/dataset/discovery.cc:251  Inspect(options.inspect_options)

Reporter: Neal Richardson / @nealrichardson

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-14705. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions