Skip to content

[Python] pandas to_parquet() writes Timestamps in a method unreadable by Redshift Spectrum #38000

@IkeNefcy

Description

@IkeNefcy

Describe the usage question you have. Please include as many useful details as possible.

Steps to reproduce:

  1. Run a python script that will write timestamps to a parquet via s3.
  2. Due to possible spectrum scan errors, each column should have the type defined. This means using df.['col1'] = pd.datetime(df.['col1'])
  3. Once all types are assigned, Upload to s3, this can be via boto3 or by hand, it doesn't matter which.
  4. From s3, perform a redshift copy (spectrum operation) (Table would be pre-created with a timestamp type column etc).
    example:
copy
    schema.test_table
from
    's3://bucket/test/test'
iam_role 'Spectrum_Role_neded'
FORMAT AS PARQUET;
  1. Query the data, the timestamps appear corrupted.

From a deep dive it appears that an integer longer than the usual Unix timestamp is being written and this is causing the date to hundreds of thousands of years in the future.
For example a usual timestamp might look like 2023-09-11 10:39:00.000000
But the queried timestamp is 161065-10-07 20:24:18.134352 for the same uploaded timestamp.

This doesn't happen in Athena if you crawl the data and query via glue. Although, the crawler will be tricked as well and assume the type is "bigint".
In this case, for example
A normal timestamp if I change the schema to show that the bigint is really a time stamp, I can get 2023-10-02 00:00:00.000
The same number without this change, is 1696204800000000000, this looks similar to a unix timestamp, but with nano seconds.

It could be that nano seconds are part of the latest update and that aws redshift_spectrum is just not tracking on this change.
It would be nice to know if this is an intended change or not though, just opening this to confirm if this is expected behavior.
It's not super clear at a glance but I am seeing websites updated to recognize nano seconds. I'm wondering also if this is related to the epochalypse.

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions