Skip to content

Type Safety: is silent truncation a bug? #2217

@dhirschfeld

Description

@dhirschfeld

Happy to move to JIRA if this is confirmed as a bug

In [8]: import pandas as pd
   ...: import pyarrow as arw

In [9]: df = pd.DataFrame({'A': list('abc'), 'B': np.arange(3)})
   ...: df
Out[9]:
   A  B
0  a  0
1  b  1
2  c  2

In [10]: schema = arw.schema([
    ...:     arw.field('A', arw.string()),
    ...:     arw.field('B', arw.int32()),
    ...: ])

In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
    ...: tbl
Out[11]:
pyarrow.Table
A: string
B: int32
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
            b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
            b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
            b'pandas_type": "int32", "numpy_type": "int32", "metadata": null}]'
            b', "pandas_version": "0.23.1"}'}

In [12]: tbl.to_pandas().equals(df)
Out[12]: True

...so if the schema matches the pandas datatypes all is well - we can roundtrip the DataFrame.

Now, say we have some bad data such that column 'B' is now of type float64. The datatypes of the DataFrame don't match the explicitly supplied schema object but rather than raising a TypeError the data is silently truncated and the roundtrip DataFrame doesn't match our input DataFame without even a warning raised!

In [13]: df['B'].iloc[0] = 1.23
    ...: df
Out[13]:
   A     B
0  a  1.23
1  b  1.00
2  c  2.00

In [14]: # I would expect/want this to raise a TypeError since the schema doesn't match the pandas datatypes
    ...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
    ...: tbl
Out[14]:
pyarrow.Table
A: string
B: int32
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
            b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
            b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
            b'pandas_type": "int32", "numpy_type": "float64", "metadata": null'
            b'}], "pandas_version": "0.23.1"}'}

In [15]: tbl.to_pandas()  # <-- SILENT TRUNCATION!!!
Out[15]:
   A  B
0  a  1
1  b  1
2  c  2

To be clear, I would really like Table.from_pandas to raise a TypeError if the DataFrame types don't match an explicitly supplied schema and would hope this current behaviour would be considered a bug.

win64/py36
arrow 0.9.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions