[C++][Dataset] Preserve order when writing dataset

Currently, when writing a dataset, e.g. from a table consisting of a set of record batches, there is no guarantee that the row order is preserved when reading the dataset.

Small code example:

```Java

In [1]: import pyarrow.dataset as ds

In [2]: table = pa.table({"a": range(10)})

In [3]: table.to_pandas()
Out[3]: 
   a
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9

In [4]: batches = table.to_batches(max_chunksize=2)

In [5]: ds.write_dataset(batches, "test_dataset_order", format="parquet")

In [6]: ds.dataset("test_dataset_order").to_table().to_pandas()
Out[6]: 
   a
0  4
1  5
2  8
3  9
4  6
5  7
6  2
7  3
8  0
9  1
```

Although this might seem normal in SQL world, typical dataframe users (R, pandas/dask, etc) will expect a preserved row order. 
Some applications might also rely on this, eg with dask you can have a sorted index column ("divisions" between the partitions) that would get lost this way (note, the dask parquet writer itself doesn't use `pyarrow.dataset.write_dataset` so isn't impacted by this.)

Some discussion about this started in https://github.com/apache/arrow/pull/8305 (ARROW-9782), which changed to write all fragments to a single file instead of a file per fragment.

I am not fully sure what the best way to solve this, but IMO at least having the _option_ to preserve the order would be good.

cc @bkietz



**Reporter**: [Joris Van den Bossche](https://issues.apache.org/jira/browse/ARROW-10883) / @jorisvandenbossche
**Watchers**: [Rok Mihevc](https://issues.apache.org/jira/browse/ARROW-10883) / @rok
#### Related issues:
- [Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True](https://github.com/apache/arrow/issues/31870) (is duplicated by)
- [[C++] Add ordering information to exec batches](https://github.com/apache/arrow/issues/32991) (requires)

<sub>**Note**: *This issue was originally created as [ARROW-10883](https://issues.apache.org/jira/browse/ARROW-10883). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++][Dataset] Preserve order when writing dataset #26818

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Dataset] Preserve order when writing dataset #26818

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions