-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15409: [C++] The C++ API for writing datasets could be improved #13959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15409: [C++] The C++ API for writing datasets could be improved #13959
Conversation
…esystem, partitioning, basename_template
|
|
cpp/src/arrow/dataset/file_base.h
Outdated
| /// Options for individual fragment writing. | ||
| std::shared_ptr<FileWriteOptions> file_write_options; | ||
| std::shared_ptr<FileWriteOptions> file_write_options = | ||
| CsvFileFormat().DefaultWriteOptions(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default to ArrowFileFormat, since CSV is optional? (Though I guess, maybe you can't build datasets without CSV.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parquet is probably the best default format for writing datasets as it is generally going to be friendlier on the disk size. Although I could be convinced that IPC is better since, as you point out, parquet is also an optional component. I very much agree it should not be CSV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this discussion advocates for there not being a default at all. These file formats have quite different characteristics and users should probably make a conscious choice about them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pyarrow we default to parquet but perhaps that is for legacy reasons. I'm fine with no default here. However, we should then get rid of the implicit no-arg constructor and add a single-argument constructor taking in write options since this will now have a required argument. E.g.
FileSystemDatasetWriteOptions(std::shared_ptr<FileWriteOptions> file_write_options) : file_write_options(std::move(file_write_options)) {}
cpp/src/arrow/dataset/file_base.h
Outdated
| /// The final row group size may be less than this value and other options such as | ||
| /// `max_open_files` or `max_rows_per_file` lead to smaller row group sizes. | ||
| uint64_t min_rows_per_group = 0; | ||
| uint64_t min_rows_per_group = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10 and not something larger? Alternatively, why set this by default at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If partitioning, it is very possible to end up with tiny row groups. For example, if we partition by year, and a batch comes in with 1 million rows spread across 1000 years you would end up with 1000 row groups with 1000 rows which is undesirable.
However, the default here should be 1 << 20 (1Mi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According with the ticket https://issues.apache.org/jira/browse/ARROW-15409?filter=-1, it suggest to set a value higher than 0, I use the 10 as I saw in some tests, what value would be apropiate for this?
@westonpace the max_rows_per_group is already 1 << 20 maybe I need to put a lower value, for now I will put 1 << 18;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine for max_rows_per_group and min_rows_per_group to have the same value I think. 1 << 18 is probably ok. The unit tests used 10 because I needed to be able to test the various features without generating a whole bunch of data (which would be time consuming) but 10 would have poor performance in practice because that means we would need to write a big block of metadata every 10 rows.
cpp/src/arrow/dataset/file_base.h
Outdated
| /// Options for individual fragment writing. | ||
| std::shared_ptr<FileWriteOptions> file_write_options; | ||
| std::shared_ptr<FileWriteOptions> file_write_options = | ||
| CsvFileFormat().DefaultWriteOptions(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parquet is probably the best default format for writing datasets as it is generally going to be friendlier on the disk size. Although I could be convinced that IPC is better since, as you point out, parquet is also an optional component. I very much agree it should not be CSV.
cpp/src/arrow/dataset/file_base.h
Outdated
| /// Template string used to generate fragment basenames. | ||
| /// {i} will be replaced by an auto incremented integer. | ||
| std::string basename_template; | ||
| std::string basename_template = "data_{i}.arrow"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extension should be based on the format. We should add a const std::string& default_extension() method to FileFormat. Then we should default this to the empty string. Then, in the dataset writer, if this is an empty string, we should use "part-{i}." + format.default_extension(). This mimics what is done in python (and we could probably remove the python logic too). Then we should update the docstring for this field to reflect this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileFormat has a funtion called type_name() which currentyle is returning the dataset-file-formats, so I think
default_extension() is not necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type_name() almost works. However, for IpcFileFormat the type_name is ipc and the extension should be arrow. We could probably update the type_name to be arrow though. It would technically be a backwards incompatible change but I'm not sure if anyone uses this field today.
cpp/src/arrow/dataset/file_base.h
Outdated
| /// The final row group size may be less than this value and other options such as | ||
| /// `max_open_files` or `max_rows_per_file` lead to smaller row group sizes. | ||
| uint64_t min_rows_per_group = 0; | ||
| uint64_t min_rows_per_group = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If partitioning, it is very possible to end up with tiny row groups. For example, if we partition by year, and a batch comes in with 1 million rows spread across 1000 years you would end up with 1000 row groups with 1000 rows which is undesirable.
However, the default here should be 1 << 20 (1Mi)
…_rows_per_group to higher value
|
Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer. |
Add defaults to FileSystemDatasetWriteOptions file_write_options, filesystem, partitioning, basename_template