What did you do?
I am looking at the helm operator manager sync code where it determines if a release is installed or not: https://github.com/operator-framework/operator-sdk/blob/v1.2.0/internal/helm/release/manager.go#L113-#L122
Its relying on an installed helm release to always have a deployed storage available.
But looking at the helm code it seems like at least in one case that might not be always true.
During the helm upgrade, it seems to put the release in another state before creating the new deployed state which creates a very small timing gap that helm operator might incorrectly determine the release is not installed:
https://github.com/helm/helm/blob/v3.4.1/pkg/action/upgrade.go#L345-#L346
https://github.com/helm/helm/blob/v3.4.1/pkg/action/upgrade.go#L137
The consequences seems to be when the helm operator thinks the release is not installed, it tries to perform an install which might fail and perform a uninstall as rollback, causing potential data lost even if the install is recovered afterwards in the next reconcile:
https://github.com/operator-framework/operator-sdk/blob/v1.2.0/internal/helm/controller/reconcile.go#L174-#L179
https://github.com/operator-framework/operator-sdk/blob/v1.2.0/internal/helm/release/manager.go#L170-#L175
What did you expect to see?
That when the helm release is installed, helm operator correctly determines it is installed or double check before performing the install->uninstall rollback
What did you see instead?
In some extremely rare cases, helm operator incorrectly determines an already installed release is not installed.
Environment
Operator type:
/language helm
Kubernetes cluster type:
Kind
$ operator-sdk version
Master branch
Possible Solution
Can we double check against the CR status and see if the deployedRelease is populated to verify that indeed the release is not installed?
What did you do?
I am looking at the helm operator manager sync code where it determines if a release is installed or not: https://github.com/operator-framework/operator-sdk/blob/v1.2.0/internal/helm/release/manager.go#L113-#L122
Its relying on an installed helm release to always have a deployed storage available.
But looking at the helm code it seems like at least in one case that might not be always true.
During the helm upgrade, it seems to put the release in another state before creating the new deployed state which creates a very small timing gap that helm operator might incorrectly determine the release is not installed:
https://github.com/helm/helm/blob/v3.4.1/pkg/action/upgrade.go#L345-#L346
https://github.com/helm/helm/blob/v3.4.1/pkg/action/upgrade.go#L137
The consequences seems to be when the helm operator thinks the release is not installed, it tries to perform an install which might fail and perform a uninstall as rollback, causing potential data lost even if the install is recovered afterwards in the next reconcile:
https://github.com/operator-framework/operator-sdk/blob/v1.2.0/internal/helm/controller/reconcile.go#L174-#L179
https://github.com/operator-framework/operator-sdk/blob/v1.2.0/internal/helm/release/manager.go#L170-#L175
What did you expect to see?
That when the helm release is installed, helm operator correctly determines it is installed or double check before performing the install->uninstall rollback
What did you see instead?
In some extremely rare cases, helm operator incorrectly determines an already installed release is not installed.
Environment
Operator type:
/language helm
Kubernetes cluster type:
Kind
$ operator-sdk versionMaster branch
Possible Solution
Can we double check against the CR status and see if the deployedRelease is populated to verify that indeed the release is not installed?