Skip to content

Conversation

@hussein-awala
Copy link
Member

closes: #34483

o = convert_to_utc(o)

tz = o.tzname()
tz = o.tzinfo.name # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb solution which should work for different types of tzinfo

Suggested change
tz = o.tzinfo.name # type: ignore
from pendulum.tz.timezone import FixedTimezone, Timezone
tzi = o.tzinfo
if isinstance(tzi, FixedTimezone):
tz = tzi.offset or "UTC"
elif isinstance(tzi, Timezone):
tz = tzi.name
else:
tz = int(o.utcoffset().total_seconds()) or "UTC"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just for a record, I don't think that we should use local imports for datetime and pendulum in this module because we import airflow.utils.timezone which import entire datetime and some modules from pendulum (at least Timezone/Datetime)

@bolkedebruin
Copy link
Contributor

We probably should re-use the code in the timezone serializer (timezone.py) which does the right thing. It's close to @Taragolis implementation, but I would prefer to rely on one and not two.

@bolkedebruin
Copy link
Contributor

bolkedebruin commented Sep 20, 2023

My suggestion:

  1. Use serialize from timezone.py to serialize the timezone information
  2. Increment the version
  3. If the older version use the old_style deserialization AND - to be nice on people experiencing issues now, provide a mapping that does if needed (i.e. having a tz name is in these)
Eastern = USTimeZone(-5, "Eastern", "EST", "EDT")
Central = USTimeZone(-6, "Central", "CST", "CDT")
Mountain = USTimeZone(-7, "Mountain", "MST", "MDT")
Pacific = USTimeZone(-8, "Pacific", "PST", "PDT")

(source: https://github.com/stub42/pytz/blob/1acdc7f5ab3f4af063ddcd435f79a48a4a8ce079/src/pytz/reference.py#L137)

@bolkedebruin
Copy link
Contributor

are you up for that @hussein-awala ?

@Taragolis
Copy link
Contributor

Taragolis commented Sep 20, 2023

Initially I thought that it would be also a good idea to use reuse serializer however there is one small but important thing.

When you have timezone (subclass of tzinfo) then we have a limited number of opportunity to extract exact timezone because this is depend on implementation, e.g. classes from pendulum.tz.timezone could store either in IANA timezone (Timezone ) or offset from utc (FixedTimezone) and it's pretty logical that we only support Pendulum implementation.

However in case if we work with datetime-naive datetime-aware (updated 🤦) and we don't know how to extract name (deatetime.timezone, zoneinfo, pytz and etc), we could always calculate exact offset to UTC by call utcoffset() method and this value could be use in deserialisation to Pendulum Timezone

So if we would like to create generic timezone extraction we could create this method in airflow.utils.timezone and extract depend of the input type tzinfo or datetime.datetime

@bolkedebruin
Copy link
Contributor

Initially I thought that it would be also a good idea to use reuse serializer however there is one small but important thing.

When you have timezone (subclass of tzinfo) then we have a limited number of opportunity to extract exact timezone because this is depend on implementation, e.g. classes from pendulum.tz.timezone could store either in IANA timezone (Timezone ) or offset from utc (FixedTimezone) and it's pretty logical that we only support Pendulum implementation.

However in case if we work with datetime-naive and we don't know how to extract name (deatetime.timezone, zoneinfo, pytz and etc), we could always calculate exact offset to UTC by call utcoffset() method and this value could be use in deserialisation to Pendulum Timezone

So if we would like to create generic timezone extraction we could create this method in airflow.utils.timezone and extract depend of the input type tzinfo or datetime.datetime

I'm not sure if I follow. We already convert naive dates to UTC (pendulum) datetimes. And we explicitly do not support other libraries than Pendulum since the integration of timezones into Airflow. So I don't think your second case makes sense?

Refactoring into airflow.utils.timezone does make sense. We would end up with a wrapper that basically does the same as FixedTimezone and Timezone

@Taragolis
Copy link
Contributor

@bolkedebruin Sorry for confuse. I meant datetime-aware, but for unknown for me reason write down naive 🤦 🤦

@bolkedebruin
Copy link
Contributor

no worries! maybe we should see a first iteration and then figure out if we want to support non-pendulum timezone aware datetimes. I do not think we should, due to all the issues with the other implementations, but open to discussion.

@hussein-awala
Copy link
Member Author

are you up for that @hussein-awala ?

Yes, I will try to implement what you both suggested and add some unit tests

@Taragolis
Copy link
Contributor

maybe we should see a first iteration and then figure out if we want to support non-pendulum timezone aware datetimes. I do not think we should, due to all the issues with the other implementations, but open to discussion.

I'm just worry, if we only would serialize datetime.datetime and pendulum.DataTime with pendulum Timezone, then we potentially could breaks someone pipeline, because right now it is possible without any issues serialize datetime with UTC timezone regardless of tzinfo implementation. And different libraries use different tzinfo for timezones, for example

  • boto3: python-dateutils
  • psycopg2: own implementation
  • psycopg (formally v3): zoneinfo.ZoneInfo and backports.zoneinfo.ZoneInfo(Python < 3.9)

In general you could compare datetime.datetime with different tzinfo implementation

from dateutil.tz import tzutc
from datetime import datetime, timezone, timedelta
from pendulum.tz.timezone import Timezone
from zoneinfo import ZoneInfo
from pendulum.tz import timezone as ptimezone
from psycopg2.tz import FixedOffsetTimezone
from pytz import UTC

d1 = datetime(2021, 1, 1, tzinfo=timezone.utc)
d2 = datetime(2021, 1, 1, tzinfo=Timezone("UTC"))
d3 = datetime(2021, 1, 1, tzinfo=tzutc())
d4 = datetime(2021, 1, 1, tzinfo=ZoneInfo("UTC"))
d5 = datetime(2021, 1, 1, tzinfo=FixedOffsetTimezone(offset=0))
d6 = datetime(2021, 1, 1, tzinfo=UTC)

assert d1 == d2 == d3 == d4 == d5 == d6
assert d1.tzname() == d2.tzname() == d3.tzname() == d4.tzname() == d6.tzname()
assert d5.tzname() == "+00"  # psycopg2 uses Python 2.3 implementation ;-)

assert d1.utcoffset() == d2.utcoffset()
assert d1.utcoffset() == d3.utcoffset()
assert d1.utcoffset() == d4.utcoffset()
assert d1.utcoffset() == d5.utcoffset()
assert d1.utcoffset() == d6.utcoffset()

dl1 = datetime(2021, 6, 1, tzinfo=Timezone("Europe/London"))
dl2 = datetime(2021, 6, 1, tzinfo=ZoneInfo("Europe/London"))
dl3 = datetime(2021, 6, 1, tzinfo=timezone(timedelta(hours=1)))

assert dl1 == dl2 == dl3

dl1_from_name = datetime(2021, 6, 1, tzinfo=ptimezone(dl1.tzinfo.name))
dl1_from_offset = datetime(2021, 6, 1, tzinfo=ptimezone(int(dl1.utcoffset().total_seconds())))

assert dl1 == dl1_from_name == dl1_from_offset

dl2_from_name = datetime(2021, 6, 1, tzinfo=ptimezone(dl2.tzinfo.key))
dl2_from_offset = datetime(2021, 6, 1, tzinfo=ptimezone(int(dl2.utcoffset().total_seconds())))

assert dl2 == dl2_from_name == dl2_from_offset

dl3_from_offset = datetime(2021, 6, 1, tzinfo=ptimezone(int(dl3.utcoffset().total_seconds())))
assert dl3 == dl3_from_offset

Maybe we should try to serialize datetime.datetime but result of deserialization will always use Pendulum timezone regardless original one.

In case of serialize just timezone I think it is fine to serialize only Pendulum one, and when we have min Python 3.9 we could discuss should we also serialize zoneinfo.ZoneInfo

@hussein-awala hussein-awala force-pushed the fix_date_deserialization branch from b4b1ee0 to 1bcb521 Compare September 20, 2023 21:27
Comment on lines +86 to +91
s["__data__"]["tz"] = "EDT"
d = deserialize(s)
assert d.timestamp() == 1657505443.0
assert d.tzinfo.name == "-04:00"
# assert that it's serializable with the new format
assert deserialize(serialize(d)) == d
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test ensures that the current version of the datetime Serializer fixes the bug by deserialize the US unsupported timezones, and that the deserialized values are serializable with the new format (if the user read an xcom serialized in version 1 and return it or send it to a new XCom)

@hussein-awala
Copy link
Member Author

@Taragolis for your last comment; based on this:

We already convert naive dates to UTC (pendulum) datetimes. And we explicitly do not support other libraries than Pendulum since the integration of timezones into Airflow.

I wonder if we should support these classes or not, and if yes, should we include it in this bug fix or add it in a separate PR as a new feature.

@hussein-awala hussein-awala marked this pull request as ready for review September 20, 2023 21:51
@Taragolis
Copy link
Contributor

I don’t think we should support all this classes:

  • We don’t know exact numbers of this classes
  • Not all of them direct Airflow dependency
  • Some of them might be deprecated
  • I think it could be a problem if we add ZoneInfo now and user serialize it in python 3.8 (backompat package) and after will try to deserialize in python 3.9+

Better what we could do it serialize offset in seconds and after it could be deserialized as Pendulum Timezone.

@Taragolis
Copy link
Contributor

But we only could retrieve offset from datetime object.

@Taragolis
Copy link
Contributor

And in general I think we should fix current issue only. And after we could discuss should we make further changes or not.

Copy link
Contributor

@bolkedebruin bolkedebruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty good already! Some questions / nits.

@bolkedebruin
Copy link
Contributor

If you can adjust the nit, I'm okay with merging @hussein-awala

@hussein-awala
Copy link
Member Author

If you can adjust the nit, I'm okay with merging @hussein-awala

I removed pytz, should I remove the deserialization for the unsupported timezones or keep them?

Change naming to properly reflect the mapping
@eladkal eladkal added this to the Airflow 2.7.2 milestone Sep 28, 2023
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Sep 28, 2023
@bolkedebruin bolkedebruin merged commit a3c06c0 into apache:main Sep 28, 2023
@ferruzzi
Copy link
Contributor

ferruzzi commented Sep 28, 2023

The DynamoDB to S3 system test has been failing with an "invalid timezone" exception since this got merged. I'm looking into it.



ERROR    airflow.executors.debug_executor.DebugExecutor:debug_executor.py:92 Failed to execute task: Invalid timezone.
--
Traceback (most recent call last):
File "/opt/airflow/airflow/executors/debug_executor.py", line 86, in _run_task
ti.run(job_id=ti.job_id, **params)
File "/opt/airflow/airflow/utils/session.py", line 79, in wrapper
return func(*args, session=session, **kwargs)
File "/opt/airflow/airflow/models/taskinstance.py", line 2507, in run
self._run_raw_task(
File "/opt/airflow/airflow/utils/session.py", line 76, in wrapper
return func(*args, **kwargs)
File "/opt/airflow/airflow/models/taskinstance.py", line 2246, in _run_raw_task
self._execute_task_with_callbacks(context, test_mode, session=session)
File "/opt/airflow/airflow/models/taskinstance.py", line 2375, in _execute_task_with_callbacks
task_orig = self.render_templates(context=context)
File "/opt/airflow/airflow/models/taskinstance.py", line 2787, in render_templates
original_task.render_template_fields(context)
File "/opt/airflow/airflow/models/baseoperator.py", line 1248, in render_template_fields
self._do_render_template_fields(self, self.template_fields, context, jinja_env, set())
File "/opt/airflow/airflow/utils/session.py", line 79, in wrapper
return func(*args, session=session, **kwargs)
File "/opt/airflow/airflow/models/abstractoperator.py", line 699, in _do_render_template_fields
rendered_content = self.render_template(
File "/opt/airflow/airflow/template/templater.py", line 157, in render_template
return value.resolve(context)
File "/opt/airflow/airflow/utils/session.py", line 79, in wrapper
return func(*args, session=session, **kwargs)
File "/opt/airflow/airflow/models/xcom_arg.py", line 417, in resolve
result = ti.xcom_pull(
File "/opt/airflow/airflow/utils/session.py", line 76, in wrapper
return func(*args, **kwargs)
File "/opt/airflow/airflow/models/taskinstance.py", line 2967, in xcom_pull
return XCom.deserialize_value(first)
File "/opt/airflow/airflow/models/xcom.py", line 696, in deserialize_value
return BaseXCom._deserialize_value(result, False)
File "/opt/airflow/airflow/models/xcom.py", line 689, in _deserialize_value
return json.loads(result.value.decode("UTF-8"), cls=XComDecoder, object_hook=object_hook)
File "/usr/local/lib/python3.8/json/__init__.py", line 370, in loads
return cls(**kw).decode(s)
File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
File "/opt/airflow/airflow/utils/json.py", line 117, in object_hook
return deserialize(dct)
File "/opt/airflow/airflow/serialization/serde.py", line 255, in deserialize
return _deserializers[classname].deserialize(classname, version, deserialize(value))
File "/opt/airflow/airflow/serialization/serializers/datetime.py", line 86, in deserialize
tz = deserialize_timezone(data[TIMEZONE][1], data[TIMEZONE][2], data[TIMEZONE][0])
File "/opt/airflow/airflow/serialization/serializers/timezone.py", line 72, in deserialize
return timezone(data)
File "/usr/local/lib/python3.8/site-packages/pendulum/tz/__init__.py", line 37, in timezone
tz = _Timezone(name, extended=extended)
File "/usr/local/lib/python3.8/site-packages/pendulum/tz/timezone.py", line 40, in __init__
tz = read(name, extend=extended)
File "/usr/local/lib/python3.8/site-packages/pendulum/tz/zoneinfo/__init__.py", line 9, in read
return Reader(extend=extend).read_for(name)
File "/usr/local/lib/python3.8/site-packages/pendulum/tz/zoneinfo/reader.py", line 50, in read_for
file_path = pytzdata.tz_path(timezone)
File "/usr/local/lib/python3.8/site-packages/pytzdata/__init__.py", line 63, in tz_path
raise ValueError('Invalid timezone')
ValueError: Invalid timezone

@Taragolis
Copy link
Contributor

I think that happen because boto3 / botocore uses dateutil tzinfo implementation, see: #34492 (comment)

@ferruzzi
Copy link
Contributor

Looks like maybe we just need to tweak how DynamoDBToS3Operator handles the export_time argument to fix the issue I am seeing, but heads up that other things may have been broken.

ERROR airflow.task.operators:abstractoperator.py:707 Exception rendering Jinja template for task 'backup_db_to_point_in_time', field 'export_time'. Template: XComArg(<Task(_PythonDecoratedOperator): get_export_time>)

ephraimbuddy pushed a commit that referenced this pull request Oct 5, 2023
tzname() does not return full timezones and returned short hand notations are not deterministic. This changes the serialization to be deterministic and adds some logic to deal with serialized short-hand US Timezones and CEST.

---------

Co-authored-by: bolkedebruin <bolkedebruin@users.noreply.github.com>
(cherry picked from commit a3c06c0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pendulum.DateTime objects now being serialized as python objects with non-standard timezones

6 participants