-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the usage question you have. Please include as many useful details as possible.
Steps to reproduce:
- Run a python script that will write timestamps to a parquet via s3.
- Due to possible spectrum scan errors, each column should have the type defined. This means using df.['col1'] = pd.datetime(df.['col1'])
- Once all types are assigned, Upload to s3, this can be via boto3 or by hand, it doesn't matter which.
- From s3, perform a redshift copy (spectrum operation) (Table would be pre-created with a timestamp type column etc).
example:
copy
schema.test_table
from
's3://bucket/test/test'
iam_role 'Spectrum_Role_neded'
FORMAT AS PARQUET;
- Query the data, the timestamps appear corrupted.
From a deep dive it appears that an integer longer than the usual Unix timestamp is being written and this is causing the date to hundreds of thousands of years in the future.
For example a usual timestamp might look like 2023-09-11 10:39:00.000000
But the queried timestamp is 161065-10-07 20:24:18.134352 for the same uploaded timestamp.
This doesn't happen in Athena if you crawl the data and query via glue. Although, the crawler will be tricked as well and assume the type is "bigint".
In this case, for example
A normal timestamp if I change the schema to show that the bigint is really a time stamp, I can get 2023-10-02 00:00:00.000
The same number without this change, is 1696204800000000000, this looks similar to a unix timestamp, but with nano seconds.
It could be that nano seconds are part of the latest update and that aws redshift_spectrum is just not tracking on this change.
It would be nice to know if this is an intended change or not though, just opening this to confirm if this is expected behavior.
It's not super clear at a glance but I am seeing websites updated to recognize nano seconds. I'm wondering also if this is related to the epochalypse.
Component(s)
Python