Skip to content

Conversation

@kou
Copy link
Member

@kou kou commented Jan 15, 2023

What changes are included in this PR?

It writes all data into one directory.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@kou kou requested a review from westonpace January 15, 2023 13:37
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #15256 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #15256 has no components, please add labels for components.

@pitrou
Copy link
Member

pitrou commented Jan 17, 2023

Shouldn't this be given a more descriptive name than "default"?

@kou
Copy link
Member Author

kou commented Jan 17, 2023

"flat"? "nothing"?

@jorisvandenbossche
Copy link
Member

"Flat" sounds good to me.

cc @westonpace

@westonpace
Copy link
Member

I'm fine with default. I think I'd prefer "none" over "flat". "flat" implies to me that something is still happening. E.g. there is still some kind of partitioning.

We currently have HivePartitioning and DirectoryPartitioning so how about NoPartitioning?

@kou
Copy link
Member Author

kou commented Jan 21, 2023

I'm OK with NoPartitioning.

@jorisvandenbossche
Copy link
Member

I think I'd prefer "none" over "flat". "flat" implies to me that something is still happening. E.g. there is still some kind of partitioning.

There also is still some kind of partitioning, I think? I.e. a single flat directory? I would interpret "No" partitioning as a single file.

@westonpace
Copy link
Member

@jorisvandenbossche you might be thinking of FilenamePartitioning (which I forgot to mention) which gives you:

x=7_chunk0.parquet
x=7_chunk1.parquet
x=10_chunk0.parquet

This partitioning is only going to split up files when there are too many rows. So, if you set max_rows_per_file to unlimited then you would get a single file for each write. The output will be:

chunk0.parquet
chunk1.parquet
chunk2.parquet

...and there will be no meaningful information in the filenames.

@jorisvandenbossche
Copy link
Member

No, I was thinking about the latter.
How I interpret it, with the datasets API, we always create a partitioned dataset, since we always create a (possibly nested) directory of files (in contrast to the Parquet write_table API which writes single files). Even if the different files / parts have no meaningful additional information embedded in the name or path, I personally still consider that as partitioned (but again, that's my interpretation, I don't know if that's the common interpretation of "partitioned")

@westonpace
Copy link
Member

@pitrou can be tiebreaker then :). I don't like FlatPartitioning (I would think this was equivalent to filename partitioning) and Joris is not a fan of NoPartitioning (since he would expect no files). Maybe ChunkPartitioning (since files will still potentially be broken into chunks if needed)?

@kou
Copy link
Member Author

kou commented Jan 22, 2023

Can we use FilenamePartitioning with an empty schema as the default partitioning?

@westonpace
Copy link
Member

I suppose all partitioning schemes, given an empty schema, should behave exactly the same. That might be a better solution.

For example, someone working with Spark will always want to use the hive partitioning scheme. Sometimes there might not be any partitioning columns. They still would think they are working with "the hive scheme with no columns".

I'm not sure how much this scenario is tested.

@jorisvandenbossche
Copy link
Member

I'm not sure how much this scenario is tested.

From Python that is certainly tested, since if you don't pass any partitioning columns in pyarrow.dataset.write_dataset, we create a default partitioning object with an empty schema (and the default is DirectoryPartitioning).


Maybe ChunkPartitioning (since files will still potentially be broken into chunks if needed)?

The downside of that is that also for other schemes like HivePartitioning files also get broken into chunks in addition to the hive-like directories, so that is not a distinguishing feature.

Maybe the original "Default" partitioning is a decent name in the end, since "default" is ambiguous enough to avoid such conflicting interpretations of "flat" or "no" .. ;)

@cpcloud
Copy link
Contributor

cpcloud commented Mar 30, 2023

@kou Thanks for the PR! This has been open for some months now without activity, so I'm going to close it out!

@cpcloud cpcloud closed this Mar 30, 2023
@westonpace westonpace reopened this Mar 31, 2023
@westonpace westonpace closed this Mar 31, 2023
@westonpace
Copy link
Member

I didn't mean to reopen. @kou can reopen if desired. However, I do think it would be good to resolve this issue.

@kou
Copy link
Member Author

kou commented Mar 31, 2023

@westonpace OK! We need to find a consensus approach to resolve this.
How about returning a DictionaryPartitioning with an empty schema (like PyArrow) by Partitioning::Default() instead of a DefaultPartitioning defined internally?

@westonpace
Copy link
Member

@westonpace OK! We need to find a consensus approach to resolve this.
How about returning a DictionaryPartitioning with an empty schema (like PyArrow) by Partitioning::Default() instead of a DefaultPartitioning defined internally?

Yes. That will work.

@kou
Copy link
Member Author

kou commented Mar 31, 2023

OK. I'll do it.

@kou
Copy link
Member Author

kou commented Apr 1, 2023

@westonpace Could you review this?

CI failures are unrelated:

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We might need #34872 to make sure the tests run.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Apr 3, 2023
@kou kou force-pushed the cpp-dataset-defaults-write branch from 2d5e254 to 87408a4 Compare April 4, 2023 05:13
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Apr 4, 2023
@westonpace
Copy link
Member

CI failures are unrelated.

@westonpace westonpace merged commit 8d8d21f into apache:main Apr 5, 2023
@kou kou deleted the cpp-dataset-defaults-write branch April 6, 2023 02:12
@ursabot
Copy link

ursabot commented Apr 6, 2023

Benchmark runs are scheduled for baseline = c219863 and contender = 8d8d21f. 8d8d21f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.26%] ursa-i9-9960x
[Failed ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8d8d21ff ec2-t3-xlarge-us-east-2
[Failed] 8d8d21ff test-mac-arm
[Failed] 8d8d21ff ursa-i9-9960x
[Failed] 8d8d21ff ursa-thinkcentre-m75q
[Finished] c2198630 ec2-t3-xlarge-us-east-2
[Failed] c2198630 test-mac-arm
[Finished] c2198630 ursa-i9-9960x
[Failed] c2198630 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…ing::Default() (apache#33674)

### What changes are included in this PR?

It writes all data into one directory.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* Closes: apache#15256

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Dataset] arrow::dataset::Partitioning::Default() can't be used for writing dataset

6 participants