Skip to content

Conversation

@hussein-awala
Copy link
Member

related: #36492

This PR encrypts all the string attributes in the trigger instead of encrypting only the ones that start with encrypted__ prefix; it also encrypts the string values stored in the dict, list, tuple or any nested object of these types.

cc: @potiuk @jscheffl @dstandish @ephraimbuddy

Comment on lines +720 to +733
def _decrypt(_value: Any) -> Any:
if isinstance(_value, str):
return fernet.decrypt(_value.encode("utf-8")).decode("utf-8")
if isinstance(_value, dict):
return {k: _decrypt(v) for k, v in _value.items()}
if isinstance(_value, list):
return [_decrypt(v) for v in _value]
if isinstance(_value, tuple):
return tuple(_decrypt(v) for v in _value)
return _value

decrypted_kwargs = {}
for key, value in trigger_row.kwargs.items():
decrypted_kwargs[key] = _decrypt(value)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not all the attributes?

fernet encrypt/decrypt only bytes, it will be complicated to convert all the types to bytes (without a 3rd party lib like pickle), also we cannot convert the dict to string because there is no guarantee that it's json serialzable.

Copy link
Member

@potiuk potiuk Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I am not sure. That will complicate rotation of the fernet key.

Actually if you look at the kwargs definition in the Trigger model, this field is ExtendedJson which (look at it) already serializes whatever is passed to it as string (and this is how it stores it).

So actually what we would need to do is we should - I think modify ExtendedJson to encrypt/decrypt the serialized string just before saving/after retrieving it - and there also all the fernet_rotation could be implemented as well much faster (without having to deserialize the data - it will just need to be re-encrypted. Also this ExtendedJson field type could handle migration from previous airflow version and simply use the original string if decryption fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also one more thing here - encrypting all strings vs. the whole blob might be a performance hit and it will produce strange results if strings are small. I had a discussion about it with @bolkedebruin - if you have quite a number o of small strings that have predictable values encrypted, there might be ways that this can be utilised to guess the encryption keys (rainbow tables and the like) - in this case it's not likely to be possible because fernet key is generally a rather long random bate array, but I think a good security practice is to encrypt whole blob rather than individual fields witin it (especially if some of those fields are small and predictable).

I do not have an exact attack in mind - it's just intelligent guess/intuition that in this case encrypting all kwargs fields serialized as a single blob will have much better performance and security properties.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with the performance argument, I'll try to implement it that way.


Triggers are serialized and stored in the database, so they can be re-instantiated on any triggerer process. This means that any sensitive information you pass to a trigger will be stored in the database.
If you want to pass sensitive information to a trigger, you can encrypt it before passing it to the trigger, and decrypt it inside the trigger, or update the argument name in the ``serialize`` method by adding ``encrypted__`` as a prefix, and Airflow will automatically encrypt the argument before storing it in the database, and decrypt it when it is read from the database.
Since Airflow 2.8.1, all the string values in the serialized trigger are encrypted before storing them in the database, and decrypted when they are read from the database. This means that the sensitive information is not stored in plain text in the database anymore.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ephraimbuddy in case you want to move it to 2.9.0, we need to commit this suggestion before merging.

Suggested change
Since Airflow 2.8.1, all the string values in the serialized trigger are encrypted before storing them in the database, and decrypted when they are read from the database. This means that the sensitive information is not stored in plain text in the database anymore.
Since Airflow 2.9.0, all the string values in the serialized trigger are encrypted before storing them in the database, and decrypted when they are read from the database. This means that the sensitive information is not stored in plain text in the database anymore.

@potiuk
Copy link
Member

potiuk commented Jan 15, 2024

Two things:

  1. we need to figure what to do with plain-text data already stored in the DB. It might well be that someone migrates to 2.8.1 while some triggers are running.
  2. rotate_fernet_key also should be handled.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things that are not 100% clear to me:

  1. Where to handle the key rotation as mentioned by @potiuk ?
  2. How about existing triggers if Airflow is upgraded, will running/open tasks on triggers fail after upgrade as previous data is not encrypted?

@jscheffl
Copy link
Contributor

Two things:

1. we need to figure what to do with plain-text data already stored in the DB. It might well be that someone migrates to 2.8.1 while some triggers are running.

2. rotate_fernet_key also should be handled.

Haha, hitting the same two things in parallel :-D

@hussein-awala
Copy link
Member Author

  1. we need to figure what to do with plain-text data already stored in the DB. It might well be that someone migrates to 2.8.1 while some triggers are running.

After the doc you created about the upgrade process, I think the safest solution is keeping this improvement to 2.9.0, where we recommend shutting down the Airflow components during the upgrade to a different minor version.

  1. rotate_fernet_key also should be handled.

I will check it.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling we should do it more "calmly" for 2.8.2.

Also I think if we do this automated encryption it's not a feature - merely an implementation detail, but _encrypted was a new feature.

Why? Because if any Trigger would like to store things in _encrypted and rely on it being well, encrypted, they will have to add (somehow) >= VERSION_OF_AIRFLOW_WHERE_IT_WAS RELEASED

If we encrypt by default, then all Triggers will use it automatically and this will become implementation detail (and essentially a bugfix)

@potiuk
Copy link
Member

potiuk commented Jan 15, 2024

After the doc you created about the upgrade process, I think the safest solution is keeping this improvement to 2.9.0, where we recommend shutting down the Airflow components during the upgrade to a different minor version.

See my above comments. I think it can be both - 2.8.2 and handling migration transparently with relative ease if we implement it by implementing it at ExtendedJson level

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can solve the migration problem

@hussein-awala
Copy link
Member Author

After the doc you created about the upgrade process, I think the safest solution is keeping this improvement to 2.9.0, where we recommend shutting down the Airflow components during the upgrade to a different minor version.

See my above comments. I think it can be both - 2.8.2 and handling migration transparently with relative ease if we implement it by implementing it at ExtendedJson level

Yes, let's keep it to 2.8.2, I will try to implement a full dict encryption to improve the performance, and for the migration, could you check the migration script (will update it after updating the encryption method)

@Taragolis
Copy link
Contributor

What if we encrypt entire payload? By the same way as we done in Connection Extra? By doing that it required to convert existed type field in DB from JSON (text depends on DB) to binary type. And this could be done during migration with convertion/encryption existed payload.

@potiuk
Copy link
Member

potiuk commented Jan 16, 2024

What if we encrypt entire payload? By the same way as we done in Connection Extra? By doing that it required to convert existed type field in DB from JSON (text depends on DB) to binary type. And this could be done during migration with convertion/encryption existed payload.

Quite Agree. I also thought the same - see my comment #36801 (comment) and our earlier discusion with @jscheffl #36492 (comment)

First of all - I think indeed there is no particular reason to use JSON type here - we are using ExtendedJSON type which implements airflow serialization, but that is on its own weird - our serialization serializes more than JSON - AND it uses much proprietary JSON that is generally not suitable to using as JSON source of data - the way it is implemented (and the fact it uses JSON) is merely an implementation detail - so using JSON field to store it, makes very little sense, because (as @jscheffl mentioned in his comments) it anyhow requires the data to be deserialized using our Python code to be "useful" so we generaly lose any DB-level support to read and query that json data. So it makes very little sense to store it in JSON field (and we are not using anywhere the fact that it is JSON field). IMHO we should convert it to a simple TEXT or even BLOB field.

Secondly the security property/ performance - let me repeat from what I wrote above - I have no hard data, but intuitively it's much more secure and faster to encrypt the whole blob rather than multiple individual fields in it. It likely takes a lot more memory and CPU (and enthropy which is also important performance factor for encryption) to convert and encrypt multiple individual (sometimes short) strings in a data structure, than it is to encrypt the whole string representation in a single operation. Especially if you consider the usual padding for short strings, and include the fact that encrypting multiple short string generally decreases "security" properties of these (think dictionary/rainbow attacks - it makes much more sense to encrypt the whole blob - especially if you have no need/possibility to read the remaining unencrypted fields for all your data.

I made the same comments in #35867 by @bolkedebruin - who wants to add encryption for serde primitives, which I think is generally bad idea and we should instead encrypt the whole serialized blob. Its faster, more secure,and we could even add a nice tool that will decrypt it for diagnostic purposes if needed.

And most of all - it is far more future-proof because you do not have to rely on individuals who write serialziation to mark all the fields that require to be serialized (the last point is somewhat mitigated by @hussein-awala here by serializing all strings, but I think it solves just a subset of it).

@github-actions
Copy link

github-actions bot commented Mar 2, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Mar 2, 2024
@hussein-awala hussein-awala removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Mar 3, 2024
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Apr 18, 2024
@hussein-awala
Copy link
Member Author

Closed in favor of #38233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers area:Scheduler including HA (high availability) scheduler area:Triggerer kind:documentation provider:amazon AWS/Amazon - related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants