-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
I discovered this while working on #10191. You can project new columns when writing a dataset, but only if they are derived from columns that are included in the output. Here's an R-based example:
# Simple function to write and re-open the new dataset
write_then_open <- function(ds, path, ...) {
write_dataset(ds, path, ...)
open_dataset(path)
}
tab <- Table$create(a = 1:5)
tab %>%
write_then_open(ds_dir) %>%
collect()
# # A tibble: 5 x 1
# a
# <int>
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# If you rename a column, it's all nulls
tab %>%
select(b = a) %>%
write_then_open(ds_dir) %>%
collect()
# # A tibble: 5 x 1
# b
# <int>
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# If you derive a new column and keep the original, it works
tab %>%
mutate(b = a) %>%
write_then_open(ds_dir) %>%
collect()
# # A tibble: 5 x 2
# a b
# <int> <int>
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
# transmute() only keeps the added columns, so it also illustrates the failure
tab %>%
transmute(b = a) %>%
write_then_open(ds_dir) %>%
collect()
# # A tibble: 5 x 1
# b
# <int>
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NAReporter: Neal Richardson / @nealrichardson
Assignee: David Li / @lidavidm
PRs and other links:
Note: This issue was originally created as ARROW-12620. Please see the migration documentation for further details.