GH-15256: [C++][Dataset] Add support for writing with Partitioning::Default() #33674

kou · 2023-01-15T13:37:41Z

What changes are included in this PR?

It writes all data into one directory.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

Closes: [C++][Dataset] arrow::dataset::Partitioning::Default() can't be used for writing dataset #15256

github-actions · 2023-01-15T13:38:02Z

Closes: [C++][Dataset] arrow::dataset::Partitioning::Default() can't be used for writing dataset #15256

github-actions · 2023-01-15T13:38:04Z

⚠️ GitHub issue #15256 has been automatically assigned in GitHub to PR creator.

github-actions · 2023-01-15T13:38:04Z

⚠️ GitHub issue #15256 has no components, please add labels for components.

pitrou · 2023-01-17T10:41:34Z

Shouldn't this be given a more descriptive name than "default"?

kou · 2023-01-17T15:37:35Z

"flat"? "nothing"?

jorisvandenbossche · 2023-01-17T16:07:04Z

"Flat" sounds good to me.

cc @westonpace

westonpace · 2023-01-20T19:00:11Z

I'm fine with default. I think I'd prefer "none" over "flat". "flat" implies to me that something is still happening. E.g. there is still some kind of partitioning.

We currently have HivePartitioning and DirectoryPartitioning so how about NoPartitioning?

kou · 2023-01-21T08:08:31Z

I'm OK with NoPartitioning.

jorisvandenbossche · 2023-01-21T09:06:10Z

I think I'd prefer "none" over "flat". "flat" implies to me that something is still happening. E.g. there is still some kind of partitioning.

There also is still some kind of partitioning, I think? I.e. a single flat directory? I would interpret "No" partitioning as a single file.

westonpace · 2023-01-21T14:36:54Z

@jorisvandenbossche you might be thinking of FilenamePartitioning (which I forgot to mention) which gives you:

x=7_chunk0.parquet
x=7_chunk1.parquet
x=10_chunk0.parquet

This partitioning is only going to split up files when there are too many rows. So, if you set max_rows_per_file to unlimited then you would get a single file for each write. The output will be:

chunk0.parquet
chunk1.parquet
chunk2.parquet

...and there will be no meaningful information in the filenames.

jorisvandenbossche · 2023-01-21T14:57:52Z

No, I was thinking about the latter.
How I interpret it, with the datasets API, we always create a partitioned dataset, since we always create a (possibly nested) directory of files (in contrast to the Parquet write_table API which writes single files). Even if the different files / parts have no meaningful additional information embedded in the name or path, I personally still consider that as partitioned (but again, that's my interpretation, I don't know if that's the common interpretation of "partitioned")

westonpace · 2023-01-21T15:13:33Z

@pitrou can be tiebreaker then :). I don't like FlatPartitioning (I would think this was equivalent to filename partitioning) and Joris is not a fan of NoPartitioning (since he would expect no files). Maybe ChunkPartitioning (since files will still potentially be broken into chunks if needed)?

kou · 2023-01-22T00:07:46Z

Can we use FilenamePartitioning with an empty schema as the default partitioning?

westonpace · 2023-01-22T18:34:33Z

I suppose all partitioning schemes, given an empty schema, should behave exactly the same. That might be a better solution.

For example, someone working with Spark will always want to use the hive partitioning scheme. Sometimes there might not be any partitioning columns. They still would think they are working with "the hive scheme with no columns".

I'm not sure how much this scenario is tested.

jorisvandenbossche · 2023-01-23T09:00:59Z

I'm not sure how much this scenario is tested.

From Python that is certainly tested, since if you don't pass any partitioning columns in pyarrow.dataset.write_dataset, we create a default partitioning object with an empty schema (and the default is DirectoryPartitioning).

Maybe ChunkPartitioning (since files will still potentially be broken into chunks if needed)?

The downside of that is that also for other schemes like HivePartitioning files also get broken into chunks in addition to the hive-like directories, so that is not a distinguishing feature.

Maybe the original "Default" partitioning is a decent name in the end, since "default" is ambiguous enough to avoid such conflicting interpretations of "flat" or "no" .. ;)

cpcloud · 2023-03-30T16:03:34Z

@kou Thanks for the PR! This has been open for some months now without activity, so I'm going to close it out!

westonpace · 2023-03-31T16:57:22Z

I didn't mean to reopen. @kou can reopen if desired. However, I do think it would be good to resolve this issue.

kou · 2023-03-31T20:57:02Z

@westonpace OK! We need to find a consensus approach to resolve this.
How about returning a DictionaryPartitioning with an empty schema (like PyArrow) by Partitioning::Default() instead of a DefaultPartitioning defined internally?

westonpace · 2023-03-31T21:00:52Z

@westonpace OK! We need to find a consensus approach to resolve this.
How about returning a DictionaryPartitioning with an empty schema (like PyArrow) by Partitioning::Default() instead of a DefaultPartitioning defined internally?

Yes. That will work.

kou · 2023-03-31T22:15:45Z

OK. I'll do it.

kou · 2023-04-01T14:03:04Z

@westonpace Could you review this?

CI failures are unrelated:

R failures are caused by GH-15280: [C++][Python][GLib] add libarrow_acero containing everything previously in compute/exec #34711
MinGW failures will be fixed by [C++][Gandiva] Accept LLVM 16 #34768

westonpace

Thanks! We might need #34872 to make sure the tests run.

cpp/src/arrow/dataset/partition_test.cc

…ing::Default() It writes all data into one directory.

Co-Authored-By: Weston Pace <weston.pace@gmail.com>

westonpace · 2023-04-05T20:55:26Z

CI failures are unrelated.

ursabot · 2023-04-06T16:29:00Z

Benchmark runs are scheduled for baseline = c219863 and contender = 8d8d21f. 8d8d21f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.26%] ursa-i9-9960x
[Failed ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8d8d21ff ec2-t3-xlarge-us-east-2
[Failed] 8d8d21ff test-mac-arm
[Failed] 8d8d21ff ursa-i9-9960x
[Failed] 8d8d21ff ursa-thinkcentre-m75q
[Finished] c2198630 ec2-t3-xlarge-us-east-2
[Failed] c2198630 test-mac-arm
[Finished] c2198630 ursa-i9-9960x
[Failed] c2198630 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…ing::Default() (apache#33674) ### What changes are included in this PR? It writes all data into one directory. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: apache#15256 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

kou requested a review from westonpace January 15, 2023 13:37

github-actions bot added the Component: C++ label Jan 15, 2023

cpcloud closed this Mar 30, 2023

westonpace reopened this Mar 31, 2023

westonpace closed this Mar 31, 2023

kou reopened this Mar 31, 2023

kou force-pushed the cpp-dataset-defaults-write branch from 329d1f5 to 0842f39 Compare March 31, 2023 22:44

github-actions bot added the awaiting review Awaiting review label Mar 31, 2023

kou mentioned this pull request Mar 31, 2023

[C++] The C++ API for writing datasets could be improved #30891

Open

github-actions bot added the Component: GLib label Apr 1, 2023

kou requested a review from AlenkaF as a code owner April 1, 2023 12:52

github-actions bot added the Component: Python label Apr 1, 2023

westonpace approved these changes Apr 3, 2023

View reviewed changes

cpp/src/arrow/dataset/partition_test.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Apr 3, 2023

kou and others added 5 commits April 4, 2023 14:01

apacheGH-15256: [C++][Dataset] Add support for writing with Partition…

4ff2d44

…ing::Default() It writes all data into one directory.

Use DictionaryPartitioning with an empty schema

316272c

Remove GADatasetDefaultPartitioning

510a695

Follow default partitioning change

1ced09b

Add empty directory partitioning test

87408a4

Co-Authored-By: Weston Pace <weston.pace@gmail.com>

kou force-pushed the cpp-dataset-defaults-write branch from 2d5e254 to 87408a4 Compare April 4, 2023 05:13

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Apr 4, 2023

westonpace merged commit 8d8d21f into apache:main Apr 5, 2023

kou deleted the cpp-dataset-defaults-write branch April 6, 2023 02:12

GH-15256: [C++][Dataset] Add support for writing with Partitioning::Default() #33674

GH-15256: [C++][Dataset] Add support for writing with Partitioning::Default() #33674

Uh oh!

Conversation

kou commented Jan 15, 2023 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jan 15, 2023

Uh oh!

github-actions bot commented Jan 15, 2023

Uh oh!

github-actions bot commented Jan 15, 2023

Uh oh!

pitrou commented Jan 17, 2023

Uh oh!

kou commented Jan 17, 2023

Uh oh!

jorisvandenbossche commented Jan 17, 2023

Uh oh!

westonpace commented Jan 20, 2023

Uh oh!

kou commented Jan 21, 2023

Uh oh!

jorisvandenbossche commented Jan 21, 2023

Uh oh!

westonpace commented Jan 21, 2023

Uh oh!

jorisvandenbossche commented Jan 21, 2023

Uh oh!

westonpace commented Jan 21, 2023

Uh oh!

kou commented Jan 22, 2023

Uh oh!

westonpace commented Jan 22, 2023

Uh oh!

jorisvandenbossche commented Jan 23, 2023

Uh oh!

cpcloud commented Mar 30, 2023

Uh oh!

westonpace commented Mar 31, 2023

Uh oh!

kou commented Mar 31, 2023

Uh oh!

westonpace commented Mar 31, 2023

Uh oh!

kou commented Mar 31, 2023

Uh oh!

kou commented Apr 1, 2023

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace commented Apr 5, 2023

Uh oh!

ursabot commented Apr 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kou commented Jan 15, 2023 •

edited by jorisvandenbossche

Loading