Skip to content

Conversation

@shyft-mike
Copy link

This was created to address issues that occurred in a FIPS enabled environment. This allows you to set the CACHING_HASH_METHOD config value to override the default of md5. Also updates the serialized_dag hash code to work even in a FIPS environment.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg
Copy link

boring-cyborg bot commented Jan 11, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But since this is sensitive it needs another approval

@potiuk
Copy link
Member

potiuk commented Jan 29, 2023

Looks good (again). One NIT only to separate out the formatting changes applied to Yaml code to a separate PR.

@potiuk potiuk requested a review from jedcunningham January 29, 2023 08:26
@potiuk
Copy link
Member

potiuk commented Jan 29, 2023

Ah. One more small thing please @shyft-mike if I may - can you please add a newsfragment (misc?) explaining that on Python 3.9+ airflow is now FIPS-compatible when it comes to using hashing algorithms? I think it's worth mentioning and if you add a newsftagment (see https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#step-4-prepare-pr for details) it will make its way to release notes in a more prominent way.

@potiuk potiuk requested a review from Taragolis January 29, 2023 08:30
Copy link
Member

@uranusjr uranusjr Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should match the original signature as much as possible, namely

  1. Match the usedforsecurity argument name
  2. Name this module hashlib (removing the _wrapper part)

Also would be a good idea to make usedforsecurity a keyword-only argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think data should not be Any? The stdlib documentation seems to indicate it must be bytes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I can definitely update the typing.
To your 1st point though, it feels wrong to give it the same name as an existing stdlib module. I'd rather be explicit with what we are trying to do. Also, used_for_security felt more pythonic and easier to read. But I am by no means a python expert, so I could be swayed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename the usedforsecurity but wouldn't remove the _wrapper (or rather rename _wrapper to _compat if the wrapper name is not nice. I think this is usually the approach we take and I think there is a value in making sure that we know it's not "the" hashlib library. Might avoid some surprises (you woll only see the difference if you scroll up to the imports) and we might not forget to remove it when we bump min-airflow version to 3.9 (which is still long time from now)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made these suggested changes, leaving _wrapper at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally like _compat over _wrapper.

Fwiw, this isn't our convention (at least looking at existing compat stuff). We have functools, not functools_compat.

I'd rather see

from airflow.compat.hashlib import md5
md5(...)

than

from airflow.compat import hashlib_compat
hashlib_compat.md5(...)

My 2c.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming a submodule with the same name as a top-level module is not a problem (not in Python 3 anyway) since airflow.compat.hashlib and hashlib can never conflict. The _wrapper or _compat suffix is simply redundant. If you really feel strongly about having a suffix (this particular case is slightly different from functools since the md5 function is a wrapper, not a shim), you can choose to put the module in airflow.utils or even directly under airflow (a la airflow.typing_compat) instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought I responded. I moved things under utils.

Comment on lines +27 to +34
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t seem to be necessary; we could use hashlib.new() or even getattr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm unless I'm misunderstanding what you mean, this is needed because the flask_caching config takes a reference to the hashlib function itself. I was also recommended against hashlib.new() since it is much less performant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getattr is the right answer here then 👍

Copy link
Member

@uranusjr uranusjr Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashlib.new() should not be preferred when you know what argument you’re using. hashlib.md5(...) is more performant than hashlib.new("md5", ...), but putting the functions behind a dict and getattr(hashlib, "md5")(...) both negate most of the performance gain.

To put it in another way:

hashlib.md5(...)  # Better.
hashlib.new("md5")(...)
HASH_METHOD_MAPPING = {
    "md5": hashlib.md5,
    ...
}
HASH_METHOD_MAPPING[key]  # Better.

HASH_METHOD_MAPPING = {
    "md5": hashlib.new("md5"),
    ...
}
HASH_METHOD_MAPPING[key]

# Assuming HASH_METHOD_MAPPING[key]

But these three are more or less the same.

HASH_METHOD_MAPPING = {
    "md5": hashlib.md5,
    ...
}
HASH_METHOD_MAPPING[key]

hashlib.new(key)

getattr(hashlib, key)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not planning on changing anything related to this. The mapping gives us a reference to the actual function needed, which hashlib.new() doesn't do, and if I used getattr, I could give a config value of "name", or "new", or "file" and it would just pass that into flask cache. I'd rather just have a whitelist and give a descriptive error message rather than depending on it blowing up flask cache.

Comment on lines +27 to +34
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getattr is the right answer here then 👍

@Taragolis Taragolis added the full tests needed We need to run full set of tests for this PR to merge label Feb 18, 2023
@Taragolis
Copy link
Contributor

Taragolis commented Feb 18, 2023

This feature have different behaviour in python 3.9+ and our regular pipeline run in 3.7, so just in case I ran full test

@Taragolis
Copy link
Contributor

@uranusjr @jedcunningham are you satisfied with the changes?

@jasonwashburn
Copy link
Contributor

Looking forward to this update, is there still more to do here? Help needed?

@potiuk
Copy link
Member

potiuk commented Mar 21, 2023

Looking forward to this update, is there still more to do here? Help needed?

Conflicts resolution by @shyft-mike and confirmation from @uranusjr and @jedcunningham

@uranusjr
Copy link
Member

I don’t have further comments but also lack proper knowledge to actually approve, so it’s better to leave this to others.

@potiuk
Copy link
Member

potiuk commented Mar 22, 2023

Happy to approve and merge then as soon as conflicts get resolved @shyft-mike

@potiuk
Copy link
Member

potiuk commented Apr 1, 2023

@shyft-mike ?

@vchiapaikeo
Copy link
Contributor

@eladkal asked me to pick up because it seems like some folks are waiting on it and the PR went stale. I cherry picked and fix merge conflicts in this PR. Not sure where we landed with the getattr vs hashlib.new() vs the dictionary indexing debate. Happy to do what the maintainers agree to but for now, I left @shyft-mike's approach as is.

Moving this conversation to this PR: #30675

@potiuk potiuk closed this Apr 17, 2023
@potiuk
Copy link
Member

potiuk commented Apr 17, 2023

Implemented in #30675

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:serialization area:webserver Webserver related Issues full tests needed We need to run full set of tests for this PR to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants