ARROW-15409: [C++] The C++ API for writing datasets could be improved #13959

AlvinJ15 · 2022-08-24T06:33:46Z

Add defaults to FileSystemDatasetWriteOptions file_write_options, filesystem, partitioning, basename_template

…esystem, partitioning, basename_template

github-actions · 2022-08-24T06:34:14Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions · 2022-08-25T05:04:41Z

https://issues.apache.org/jira/browse/ARROW-15409

lidavidm · 2022-08-25T11:34:56Z

cpp/src/arrow/dataset/file_base.h

  /// Options for individual fragment writing.
-  std::shared_ptr<FileWriteOptions> file_write_options;
+  std::shared_ptr<FileWriteOptions> file_write_options =
+      CsvFileFormat().DefaultWriteOptions();


default to ArrowFileFormat, since CSV is optional? (Though I guess, maybe you can't build datasets without CSV.)

I think parquet is probably the best default format for writing datasets as it is generally going to be friendlier on the disk size. Although I could be convinced that IPC is better since, as you point out, parquet is also an optional component. I very much agree it should not be CSV.

It seems this discussion advocates for there not being a default at all. These file formats have quite different characteristics and users should probably make a conscious choice about them.

In pyarrow we default to parquet but perhaps that is for legacy reasons. I'm fine with no default here. However, we should then get rid of the implicit no-arg constructor and add a single-argument constructor taking in write options since this will now have a required argument. E.g.

FileSystemDatasetWriteOptions(std::shared_ptr<FileWriteOptions> file_write_options) : file_write_options(std::move(file_write_options)) {}

lidavidm · 2022-08-25T11:35:47Z

cpp/src/arrow/dataset/file_base.h

  /// The final row group size may be less than this value and other options such as
  /// `max_open_files` or `max_rows_per_file` lead to smaller row group sizes.
-  uint64_t min_rows_per_group = 0;
+  uint64_t min_rows_per_group = 10;


Why 10 and not something larger? Alternatively, why set this by default at all?

If partitioning, it is very possible to end up with tiny row groups. For example, if we partition by year, and a batch comes in with 1 million rows spread across 1000 years you would end up with 1000 row groups with 1000 rows which is undesirable.

However, the default here should be 1 << 20 (1Mi)

According with the ticket https://issues.apache.org/jira/browse/ARROW-15409?filter=-1, it suggest to set a value higher than 0, I use the 10 as I saw in some tests, what value would be apropiate for this?
@westonpace the max_rows_per_group is already 1 << 20 maybe I need to put a lower value, for now I will put 1 << 18;

It's fine for max_rows_per_group and min_rows_per_group to have the same value I think. 1 << 18 is probably ok. The unit tests used 10 because I needed to be able to test the various features without generating a whole bunch of data (which would be time consuming) but 10 would have poor performance in practice because that means we would need to write a big block of metadata every 10 rows.

westonpace · 2022-08-25T17:25:40Z

cpp/src/arrow/dataset/file_base.h

  /// Options for individual fragment writing.
-  std::shared_ptr<FileWriteOptions> file_write_options;
+  std::shared_ptr<FileWriteOptions> file_write_options =
+      CsvFileFormat().DefaultWriteOptions();


I think parquet is probably the best default format for writing datasets as it is generally going to be friendlier on the disk size. Although I could be convinced that IPC is better since, as you point out, parquet is also an optional component. I very much agree it should not be CSV.

westonpace · 2022-08-25T17:29:23Z

cpp/src/arrow/dataset/file_base.h

  /// Template string used to generate fragment basenames.
  /// {i} will be replaced by an auto incremented integer.
-  std::string basename_template;
+  std::string basename_template = "data_{i}.arrow";


The extension should be based on the format. We should add a const std::string& default_extension() method to FileFormat. Then we should default this to the empty string. Then, in the dataset writer, if this is an empty string, we should use "part-{i}." + format.default_extension(). This mimics what is done in python (and we could probably remove the python logic too). Then we should update the docstring for this field to reflect this behavior.

FileFormat has a funtion called type_name() which currentyle is returning the dataset-file-formats, so I think
default_extension() is not necessary

type_name() almost works. However, for IpcFileFormat the type_name is ipc and the extension should be arrow. We could probably update the type_name to be arrow though. It would technically be a backwards incompatible change but I'm not sure if anyone uses this field today.

westonpace · 2022-08-25T17:30:32Z

cpp/src/arrow/dataset/file_base.h

  /// The final row group size may be less than this value and other options such as
  /// `max_open_files` or `max_rows_per_file` lead to smaller row group sizes.
-  uint64_t min_rows_per_group = 0;
+  uint64_t min_rows_per_group = 10;


If partitioning, it is very possible to end up with tiny row groups. For example, if we partition by year, and a batch comes in with 1 million rows spread across 1000 years you would end up with 1000 row groups with 1000 rows which is undesirable.

However, the default here should be 1 << 20 (1Mi)

…_rows_per_group to higher value

github-actions · 2025-11-18T11:23:24Z

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

Add defaults to FileSystemDatasetWriteOptions file_write_options, fil…

df5f805

…esystem, partitioning, basename_template

github-actions bot added the Component: C++ label Aug 24, 2022

AlvinJ15 changed the title ~~ARROW-14742: [C++] Allow ParquetWriter to take a RecordBatchReader as input~~ ARROW-15409: [C++] The C++ API for writing datasets could be improved Aug 25, 2022

apache deleted a comment from github-actions bot Aug 25, 2022

lidavidm reviewed Aug 25, 2022

View reviewed changes

westonpace reviewed Aug 25, 2022

View reviewed changes

Change default format to ParquetFileFormat, basename_template and min…

4ecf17e

…_rows_per_group to higher value

This was referenced Nov 28, 2022

[C++] Allow ParquetWriter to take a RecordBatchReader as input #30279

Open

[C++] The C++ API for writing datasets could be improved #30891

Open

github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025

github-actions bot closed this Dec 7, 2025

ARROW-15409: [C++] The C++ API for writing datasets could be improved #13959

ARROW-15409: [C++] The C++ API for writing datasets could be improved #13959

Uh oh!

Conversation

AlvinJ15 commented Aug 24, 2022

Uh oh!

github-actions bot commented Aug 24, 2022

Uh oh!

github-actions bot commented Aug 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlvinJ15 Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AlvinJ15 Aug 26, 2022 •

edited

Loading