-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
When writing a uint32 column, (parquet's) logical type is not written, limiting interoperability with other engines.
Minimal Python
import pyarrow as pa
data = {"uint32", [1, None, 0]}
schema = pa.schema([pa.field('uint32', pa.uint32())])
t = pa.table(data, schema=schema)
pa.parquet.write_table(t, "bla.parquet")
Inspecting it with spark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("bla.parquet")
print(df.select("uint32").schema)
shows StructType(List(StructField(uint32,LongType,true))). "LongType" indicates that the field is interpreted as a 64 bit integer. Further inspection of the metadata shows that both convertedType and logicalType are not being set. Note that this is independent of the arrow-specific schema written in the metadata.
Reporter: Jorge Leitão / @jorgecarleitao
Related issues:
- [C++][Python] Switch default Parquet version to 2.4 (is related to)
Note: This issue was originally created as ARROW-12201. Please see the migration documentation for further details.