Describe the bug
If several applications/workers share the same key ring, e.g. using the Redis-provider, the DataProtection keys between the various instances get out of sync if the backing storage is purged (e.g. redis server restart) and one of the workers is restarted (or another instance is started).
We had this issue happen on a standard azure hosting setup using a redis server and a web app with automatic scaling. The Redis Server lost its cache entries, probably because of a maintenance restart. The running instances were unaffected as they use the key ring loaded on startup and the keys were not expiring for a long time. However when a new instance was started, it did not find the key ring in the redis cache and created a new key which was instantly activated. Data encrypted by this new instance was then no longer decryptable by the previous instances causing massive failures on the production system ("Key {...} was not found in the key ring").
I understand that purging of the key ring is not a directly "supported" scenario and that at the very least it will cause some encrypted data to be permantly lost. However I do believe that the framework should be robust enough that it will recover from this situation within a reasonable amount of time; especially when it can occur in a very common hosting setup. The only way to get our instances back in sync was to restart all instances manually. Otherwise the issue would probably have persisted for several days until either the original keys are expired or all instances have been recycled for other reasons.
To Reproduce
Steps to reproduce the behavior:
- Using the current version of ASP.NET Core
- Run multiple instances of an application using Redis to synchronize data protection keys
- Restart redis server or otherwise clear the keys
- Start another instance of the application
- Data protected by the new instance is no longer readable by the old instances
Expected behavior
Errors may occur, but a recovery mechanism is in place which will automatically reload the keyring if necessary.
I see that similar issues were discussed in issue #3975 and code was added that allows automatic key refresh in the first minutes of an application lifetime. However this addition is not enough to fix the problem described here as it does not occur during application startup. IIRC there were also problems described in the references issue just after key expiration, which might also not be adressed by the original fix.
As I understand it we do not want to simply refresh the key ring if an unknown key is found as this might lead to a possible DoS attack, where an attacker can simply send a lot of unknown keys forcing the DP framework to constantly go to the backing store.
My proposal is to use a sliding window approach to mitigate this concern. The first time we encounter an unknown key we will refresh the key ring from the backing store and then block additional refresh for a certain amount of time (e.g. 5 minutes). This would fix this and related issues and still prevent DoS attacks.
I would be willing to work on a PR if the overall idea is acceptable to you.
Describe the bug
If several applications/workers share the same key ring, e.g. using the Redis-provider, the DataProtection keys between the various instances get out of sync if the backing storage is purged (e.g. redis server restart) and one of the workers is restarted (or another instance is started).
We had this issue happen on a standard azure hosting setup using a redis server and a web app with automatic scaling. The Redis Server lost its cache entries, probably because of a maintenance restart. The running instances were unaffected as they use the key ring loaded on startup and the keys were not expiring for a long time. However when a new instance was started, it did not find the key ring in the redis cache and created a new key which was instantly activated. Data encrypted by this new instance was then no longer decryptable by the previous instances causing massive failures on the production system ("Key {...} was not found in the key ring").
I understand that purging of the key ring is not a directly "supported" scenario and that at the very least it will cause some encrypted data to be permantly lost. However I do believe that the framework should be robust enough that it will recover from this situation within a reasonable amount of time; especially when it can occur in a very common hosting setup. The only way to get our instances back in sync was to restart all instances manually. Otherwise the issue would probably have persisted for several days until either the original keys are expired or all instances have been recycled for other reasons.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Errors may occur, but a recovery mechanism is in place which will automatically reload the keyring if necessary.
I see that similar issues were discussed in issue #3975 and code was added that allows automatic key refresh in the first minutes of an application lifetime. However this addition is not enough to fix the problem described here as it does not occur during application startup. IIRC there were also problems described in the references issue just after key expiration, which might also not be adressed by the original fix.
As I understand it we do not want to simply refresh the key ring if an unknown key is found as this might lead to a possible DoS attack, where an attacker can simply send a lot of unknown keys forcing the DP framework to constantly go to the backing store.
My proposal is to use a sliding window approach to mitigate this concern. The first time we encounter an unknown key we will refresh the key ring from the backing store and then block additional refresh for a certain amount of time (e.g. 5 minutes). This would fix this and related issues and still prevent DoS attacks.
I would be willing to work on a PR if the overall idea is acceptable to you.