Skip to content

Conversation

@AlvinJ15
Copy link
Contributor

Add defaults to FileSystemDatasetWriteOptions file_write_options, filesystem, partitioning, basename_template

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@AlvinJ15 AlvinJ15 changed the title ARROW-14742: [C++] Allow ParquetWriter to take a RecordBatchReader as input ARROW-15409: [C++] The C++ API for writing datasets could be improved Aug 25, 2022
@github-actions
Copy link

@apache apache deleted a comment from github-actions bot Aug 25, 2022
/// Options for individual fragment writing.
std::shared_ptr<FileWriteOptions> file_write_options;
std::shared_ptr<FileWriteOptions> file_write_options =
CsvFileFormat().DefaultWriteOptions();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default to ArrowFileFormat, since CSV is optional? (Though I guess, maybe you can't build datasets without CSV.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parquet is probably the best default format for writing datasets as it is generally going to be friendlier on the disk size. Although I could be convinced that IPC is better since, as you point out, parquet is also an optional component. I very much agree it should not be CSV.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this discussion advocates for there not being a default at all. These file formats have quite different characteristics and users should probably make a conscious choice about them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pyarrow we default to parquet but perhaps that is for legacy reasons. I'm fine with no default here. However, we should then get rid of the implicit no-arg constructor and add a single-argument constructor taking in write options since this will now have a required argument. E.g.

FileSystemDatasetWriteOptions(std::shared_ptr<FileWriteOptions> file_write_options) : file_write_options(std::move(file_write_options)) {}

/// The final row group size may be less than this value and other options such as
/// `max_open_files` or `max_rows_per_file` lead to smaller row group sizes.
uint64_t min_rows_per_group = 0;
uint64_t min_rows_per_group = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10 and not something larger? Alternatively, why set this by default at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If partitioning, it is very possible to end up with tiny row groups. For example, if we partition by year, and a batch comes in with 1 million rows spread across 1000 years you would end up with 1000 row groups with 1000 rows which is undesirable.

However, the default here should be 1 << 20 (1Mi)

Copy link
Contributor Author

@AlvinJ15 AlvinJ15 Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According with the ticket https://issues.apache.org/jira/browse/ARROW-15409?filter=-1, it suggest to set a value higher than 0, I use the 10 as I saw in some tests, what value would be apropiate for this?
@westonpace the max_rows_per_group is already 1 << 20 maybe I need to put a lower value, for now I will put 1 << 18;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine for max_rows_per_group and min_rows_per_group to have the same value I think. 1 << 18 is probably ok. The unit tests used 10 because I needed to be able to test the various features without generating a whole bunch of data (which would be time consuming) but 10 would have poor performance in practice because that means we would need to write a big block of metadata every 10 rows.

/// Options for individual fragment writing.
std::shared_ptr<FileWriteOptions> file_write_options;
std::shared_ptr<FileWriteOptions> file_write_options =
CsvFileFormat().DefaultWriteOptions();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parquet is probably the best default format for writing datasets as it is generally going to be friendlier on the disk size. Although I could be convinced that IPC is better since, as you point out, parquet is also an optional component. I very much agree it should not be CSV.

/// Template string used to generate fragment basenames.
/// {i} will be replaced by an auto incremented integer.
std::string basename_template;
std::string basename_template = "data_{i}.arrow";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extension should be based on the format. We should add a const std::string& default_extension() method to FileFormat. Then we should default this to the empty string. Then, in the dataset writer, if this is an empty string, we should use "part-{i}." + format.default_extension(). This mimics what is done in python (and we could probably remove the python logic too). Then we should update the docstring for this field to reflect this behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileFormat has a funtion called type_name() which currentyle is returning the dataset-file-formats, so I think
default_extension() is not necessary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type_name() almost works. However, for IpcFileFormat the type_name is ipc and the extension should be arrow. We could probably update the type_name to be arrow though. It would technically be a backwards incompatible change but I'm not sure if anyone uses this field today.

/// The final row group size may be less than this value and other options such as
/// `max_open_files` or `max_rows_per_file` lead to smaller row group sizes.
uint64_t min_rows_per_group = 0;
uint64_t min_rows_per_group = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If partitioning, it is very possible to end up with tiny row groups. For example, if we partition by year, and a batch comes in with 1 million rows spread across 1000 years you would end up with 1000 row groups with 1000 rows which is undesirable.

However, the default here should be 1 << 20 (1Mi)

@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@github-actions github-actions bot closed this Dec 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: C++ Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants