-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Updated app to support configuring the caching hash method for FIPS #28846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But since this is sensitive it needs another approval
|
Looks good (again). One NIT only to separate out the formatting changes applied to Yaml code to a separate PR. |
|
Ah. One more small thing please @shyft-mike if I may - can you please add a newsfragment (misc?) explaining that on Python 3.9+ airflow is now FIPS-compatible when it comes to using hashing algorithms? I think it's worth mentioning and if you add a newsftagment (see https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#step-4-prepare-pr for details) it will make its way to release notes in a more prominent way. |
airflow/compat/hashlib_wrapper.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should match the original signature as much as possible, namely
- Match the
usedforsecurityargument name - Name this module
hashlib(removing the_wrapperpart)
Also would be a good idea to make usedforsecurity a keyword-only argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think data should not be Any? The stdlib documentation seems to indicate it must be bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I can definitely update the typing.
To your 1st point though, it feels wrong to give it the same name as an existing stdlib module. I'd rather be explicit with what we are trying to do. Also, used_for_security felt more pythonic and easier to read. But I am by no means a python expert, so I could be swayed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename the usedforsecurity but wouldn't remove the _wrapper (or rather rename _wrapper to _compat if the wrapper name is not nice. I think this is usually the approach we take and I think there is a value in making sure that we know it's not "the" hashlib library. Might avoid some surprises (you woll only see the difference if you scroll up to the imports) and we might not forget to remove it when we bump min-airflow version to 3.9 (which is still long time from now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made these suggested changes, leaving _wrapper at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally like _compat over _wrapper.
Fwiw, this isn't our convention (at least looking at existing compat stuff). We have functools, not functools_compat.
I'd rather see
from airflow.compat.hashlib import md5
md5(...)
than
from airflow.compat import hashlib_compat
hashlib_compat.md5(...)
My 2c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming a submodule with the same name as a top-level module is not a problem (not in Python 3 anyway) since airflow.compat.hashlib and hashlib can never conflict. The _wrapper or _compat suffix is simply redundant. If you really feel strongly about having a suffix (this particular case is slightly different from functools since the md5 function is a wrapper, not a shim), you can choose to put the module in airflow.utils or even directly under airflow (a la airflow.typing_compat) instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I thought I responded. I moved things under utils.
airflow/www/extensions/init_cache.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn’t seem to be necessary; we could use hashlib.new() or even getattr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm unless I'm misunderstanding what you mean, this is needed because the flask_caching config takes a reference to the hashlib function itself. I was also recommended against hashlib.new() since it is much less performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getattr is the right answer here then 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hashlib.new() should not be preferred when you know what argument you’re using. hashlib.md5(...) is more performant than hashlib.new("md5", ...), but putting the functions behind a dict and getattr(hashlib, "md5")(...) both negate most of the performance gain.
To put it in another way:
hashlib.md5(...) # Better.
hashlib.new("md5")(...)HASH_METHOD_MAPPING = {
"md5": hashlib.md5,
...
}
HASH_METHOD_MAPPING[key] # Better.
HASH_METHOD_MAPPING = {
"md5": hashlib.new("md5"),
...
}
HASH_METHOD_MAPPING[key]
# Assuming HASH_METHOD_MAPPING[key]But these three are more or less the same.
HASH_METHOD_MAPPING = {
"md5": hashlib.md5,
...
}
HASH_METHOD_MAPPING[key]
hashlib.new(key)
getattr(hashlib, key)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not planning on changing anything related to this. The mapping gives us a reference to the actual function needed, which hashlib.new() doesn't do, and if I used getattr, I could give a config value of "name", or "new", or "file" and it would just pass that into flask cache. I'd rather just have a whitelist and give a descriptive error message rather than depending on it blowing up flask cache.
airflow/www/extensions/init_cache.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getattr is the right answer here then 👍
Updated app to support configuring caching hash method Added tests
- Cleaned up app cache code
- Fixed unintended change to config.yml
- Added ignore for mypy
Moved hashlib_wrapper from compat to utils. Used PY39 variable.
|
This feature have different behaviour in python 3.9+ and our regular pipeline run in 3.7, so just in case I ran full test |
|
@uranusjr @jedcunningham are you satisfied with the changes? |
|
Looking forward to this update, is there still more to do here? Help needed? |
Conflicts resolution by @shyft-mike and confirmation from @uranusjr and @jedcunningham |
|
I don’t have further comments but also lack proper knowledge to actually approve, so it’s better to leave this to others. |
|
Happy to approve and merge then as soon as conflicts get resolved @shyft-mike |
|
@eladkal asked me to pick up because it seems like some folks are waiting on it and the PR went stale. I cherry picked and fix merge conflicts in this PR. Not sure where we landed with the Moving this conversation to this PR: #30675 |
|
Implemented in #30675 |
This was created to address issues that occurred in a FIPS enabled environment. This allows you to set the CACHING_HASH_METHOD config value to override the default of md5. Also updates the serialized_dag hash code to work even in a FIPS environment.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.