-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Add encryption methods to serde for primitives #35867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return classname, version, data | ||
|
|
||
|
|
||
| def encrypt(value: P) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if i understand what's going on. To be honest now that I look at this code, i have several doubts about the approach. I am not sur for example why we have iceberg and deltalake in "airflow" and not providers? Not sure what's the reasoning there. Why do we actually need to implement somewhat complex way of encryption.
I am not saying I am against this, but I have hard time commenting it without understanding the context. Maybe we could indeed start the discussion we talked about about general use case for serde, why it is useful, where we want to use it etc. That would likely help to put it in context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I understand, but let's not mix several things here. Serde is being used to serialize XCom values since 2.5 and in that way there is nothing new about it. It was also quite elaborately discussed on the PRs at the time. It was done without putting things into providers. Arguably, it should be the case (as I mentioned elsewhere), but it isn't now.
Now some background on this PR:
XCom exposes values to the UI as you can view those values. This means that everyone having access to the Airflow UI can see what has been shared between tasks, including potential sensitive values - like access keys, secret keys, etc. We limit visibility of this information for connections by encrypting them, here we are doing the same.
Both the datalake table and iceberg serializer already included encryption (see the change), because they store access credentials in their objects. ObjectStoragePath can also keep storage options which potentially includes access key and secret key. Hence, I decided to generalize and standardize the encryption as it is error prone.
A protocol is required, because fernet cannot serialize anything else as bytes and thus loses type information which is required for deserialization. I relied mostly on what pickle does to keep things standard and understood.
So, yes certain serializers could be moved to providers. It would require making the API of serde public. That is not the scope of this PR. And yes, a discussion should happen but I also think that should not hold back this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I REALLY wanted to know about is wjy we are doing it this way. Why encrypting selected fields during serialization and not thw whole serialized form. It seems reasonable to encrypt the resulting serialized value (whole of it) - why do we do it selectively in the first place? What's the use case? How about security properties of it ?
Encruption in general is very easy to get wrong and unsecure and this one seems to introduce our own encryption method which is iikely prone to an easy way of loosing encryption properties.
For example when we try to encrypt indiviidual small values, it's very eeasy to loose encryption propertes. Fernet is a symmetric encryption protocol and we ar not using random seeds to do encryption. This basically means tthat by encrypting 65535 int, you can easily create a dictionary and build a reverse-encryption that will allow you to decrypt all 655356 first int values. That's why almost always when you encrypt something, you encryptt more than jus small pieces of data.
That's why It seem reasonable (also for perofmrnace reasons, not only seceurity properties) to encrypt whole serialized "bllob", rather than individual values. Maybe I simply miss the use case and maybe there is a reason why want to build encryption into internal serialization protocol, risking that we are introducing insecure way of doing things, but welll.... I miss the use case. I do not understand why we want to encrypt individual values here. this is what I am after, If I were to review it.
This is what I am trying to understand - not because I am somehow mean, but because this is is really security. IIf have tomorrow someone who raises a securtiy issue to us, that when we think something is secure, that something is not and can be easily broken.
My security guu Bruce Schneirer's always says if you think you want to do your own security/encryption mechanism because you think you can do better, you are probably wrong. And for me this smells like something we do without fully understanding of all the security properties of.
So why we are doing it. Can we get to the basica and explain why are we trying to encrypt individual fields / primitives and not the whole serialized blob? Have we thoght about security consequences of it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes - I do not know too much about serde. I have not been paying too much attention to it so far - but when I am asked to review it, my natural instict is try to understand what we are doing and why - and when I started looking more closely - and I see things like iceberg being part of the "core" of airflow where we have "providers" dedicated for years to keep things dedicated to integration of airflow, and serializing primitives (which seems quite wrong from the security properties point of view) my natural instict is to try to understand why we are doing it all this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And just to add a bit of more context here... I woudl like to become serde expert.
We seem to have a bit shortage of these - I think. Very recently it turned out that "likely" we've done something wrong when reviewing providers code that was supposed to use serde even if we did not understand it (I still don't see why we would use it even after carefully reading #35885 - which is BTW. an excellent description of what serde does and how it can be used and we should get it in soon - with the caveat of not being part of airflow "user" documentatiom, more "collaborator/contributor" as I commented there.
So my goal is to help to make serde first-class citizen and really used - but in order to do that I need to understand
it - i am missing a lot of "whys" - i.e. context. alternatives, why we selected certain ways of doing it and whether we fully understand consequences of what we are doing - especially that what was simple "Seriialize object to bytes" now starts getting security "properties" and evidently we want it to be used everywhere.
Again - I am not saying it's wrong, I am saying - I do not understand it , and I would like to - especially if I am to review the code.
We can do it in a number of ways. One of the ways is to discuss every single PR and get back-forth discussions like this where I might be asking possibly stupid questions. Another way (and I precticed it - for example - in Breeze/CI where I brought more people in and what we have is set of Architecture Decision Records https://github.com/apache/airflow/tree/main/dev/breeze/doc/adr - where I can easily send anyone to if they want to know answers why we have made certain decisions - with "context", "decision" and "consequences" put together.
Of course it is not read on a regular basis, but at least anyone can refer to it and find out why some important decisions have been mande and - more importantly - we can question our past individual decision and modify them - by knowing why we made the decisions, what were the decisions and consequences. Very useful especially when there are new people and we want them to be able to make reviews or potentially revert/question/change decisions made in the past because we have new circumstances.
I am not saying we need exactly this format, but I think IF we want serde really used, we need to understand some of the decisions made there - especially if we are going to add security to the mix and likely in a form that is approachable by anyone (like me) who had not participated in past decisions and discussions.
I want to understand, that's it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll reach out separately on the understanding part.
For this PR:
No, there is no introduction of a separate encryption. It works with fernet exactly as the other places in airflow. It uses a protocol to store the type information because using fernet encryption that is lost or cannot be done (ie. Direct encryption of an int results in a tyoeerror).
It is done selectively because you might want to see e.g. the "conn_id" in the ui if you need to debug. Just like how we encrypt sensitive information of connections but not all information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the explanation now from the discussion in devlist.
Following on that - should not the decrypt/encrypt be a generic utility in this case rather than part of serde? I see these two methods as part of general functionality exposed to any kind of utilities - not only serde?
What got me confused was that encrypt/decrypt was part of serde, which what made me think this is something only for serde but I see it now as "encrypt/decrypt" util used by particular, dedicated serializers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we could do that. I understand that confusion and it was a point of controversy for myself. It was tied to what was needed to 'store' correctly, but you are right it isn't truly part of serde. I'll split it out.
I think you are right to research what it does particularly with 'simple' fields like int.
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Several serialized object could expose sensitive information in the UI
through XCom values. This generalizes how encryption works,
including versioning of the protocol.
This is a prerequisite for: #35820
cc @eladkal @Taragolis @potiuk
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.