[Python] add option for taking all columns from all files in pa.dataset

In PyArrow's dataset class, if I give it multiple parquet files in a list and these parquet files have potentially different columns, it will always take the schema from the first parquet file in the list, thus ignoring columns that the first file doesn't have. Getting all columns within the files into the same dataset implies passing a manual schema or constructing one by iterating over the files and checking for their columns.

 

Would be nicer if PyArrow's dataset class could have an option to automatically take all columns within the files from which it is constructed.
```java

import numpy as np, pandas as pd
df1 = pd.DataFrame({
    "col1" : np.arange(10),
    "col2" : np.random.choice(["a", "b"], size=10)
})
df2 = pd.DataFrame({
    "col1" : np.arange(10, 20),
    "col3" : np.random.random(size=10)
})
df1.to_parquet("df1.parquet")
df2.to_parquet("df2.parquet")
```
```java

import pyarrow.dataset as pds
ff = ["df1.parquet", "df2.parquet"]
```
```java

### Code below will generate a DF with col1 and col2, but no col3
```
```java

pds.dataset(ff, format="parquet").to_table().to_pandas()
```
 

 

**Reporter**: [David Cortes](https://issues.apache.org/jira/browse/ARROW-9455)
#### Related issues:
- [[Python][Dataset] Expose schema inference / validation options in the factory](https://github.com/apache/arrow/issues/24418) (is duplicated by)

<sub>**Note**: *This issue was originally created as [ARROW-9455](https://issues.apache.org/jira/browse/ARROW-9455). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python] add option for taking all columns from all files in pa.dataset #25528

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] add option for taking all columns from all files in pa.dataset #25528

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions