[Python] Underscores in partition (string) values are dropped when reading dataset

When reading a partitioned dataset, in which the partition column contains string values with underscores, pyarrow seems to be ignoring the underscores in the resulting values.

For example if I write and then read a dataset as follows:
```java

import pyarrow as pa
import pandas as pd

df = pd.DataFrame({
    "year_week": ["2019_2", "2019_3"],
    "value": [1, 2]
})

table = pa.Table.from_pandas(df.head())
pq.write_to_dataset(table, 'test', partition_cols=["year_week"])

table2 = pq.ParquetDataset('test').read()
```
The resulting 'year_week' column in table 2 has lost the underscores:
```java

table2[1] # Gives:

<Column name='year_week' type=DictionaryType(dictionary<values=int64, indices=int32, ordered=0>)>
[

  -- dictionary:
    [
      20192,
      20193
    ]
  -- indices:
    [
      0
    ],

  -- dictionary:
    [
      20192,
      20193
    ]
  -- indices:
    [
      1
    ]
]
```
Is this intentional behaviour or is this a bug in arrow?

**Reporter**: [Julian de Ruiter](https://issues.apache.org/jira/browse/ARROW-5666)
**Assignee**: [Joris Van den Bossche](https://issues.apache.org/jira/browse/ARROW-5666) / @jorisvandenbossche
#### Related issues:
- [[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset](https://github.com/apache/arrow/issues/22510) (relates to)
- [[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim](https://github.com/apache/arrow/issues/17077) (depends upon)

<sub>**Note**: *This issue was originally created as [ARROW-5666](https://issues.apache.org/jira/browse/ARROW-5666). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Underscores in partition (string) values are dropped when reading dataset #22099

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Underscores in partition (string) values are dropped when reading dataset #22099

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions