fix(controller): Use node.Name instead of node.DisplayName for onExit nodes#5486
Conversation
Signed-off-by: Simon Behar <simbeh7@gmail.com>
| completed = false | ||
| } else if childNode.Completed() { | ||
| hasOnExitNode, onExitNode, err := woc.runOnExitNode(ctx, step.OnExit, step.Name, childNode.Name, stepsCtx.boundaryID, stepsCtx.tmplCtx) | ||
| hasOnExitNode, onExitNode, err := woc.runOnExitNode(ctx, step.OnExit, childNode.Name, stepsCtx.boundaryID, stepsCtx.tmplCtx) |
There was a problem hiding this comment.
parentDisplayName is non unique, causing nodes with the same display name (e.g. when using withParam) to create onExit nodes of the same name. After the first onExit node is created, subsequent onExit nodes created by different parent nodes (but with the same display name) will fail creation as they already exist. Use parentNodeName which is always unique instead.
Codecov Report
@@ Coverage Diff @@
## master #5486 +/- ##
==========================================
+ Coverage 46.54% 46.74% +0.20%
==========================================
Files 240 240
Lines 15001 15004 +3
==========================================
+ Hits 6982 7014 +32
+ Misses 7119 7091 -28
+ Partials 900 899 -1
Continue to review full report at Codecov.
|
|
what happens to in-flight workflows at the time of upgrade? can do they end up with 2x on-exit nodes? |
Yes, there is a window of time where onExit nodes could get duplicated in the event of an upgrade while a cluster is running. note this window is only from the time an onExit node with the old name is created until the time its boundary node (Step Group, DAG) completes. Once the boundary node completes, we won't recurse down to the onExit node again. I don't believe we make guarantees on the consistency of a running workflow if there is a version upgrade while it's running. |
|
I don't think many users expect an upgrade to typically break workflows. Is there an alternative you can see? |
@alexec I don't believe so. This is a narrow issue: are we using the right name or not? We are currently not using the right name, which means there are bugs in the code (see #5486 (comment)). If it's critical that no running workflows are broken during the upgrade, I could introduce some code that:
We'd have to keep this scaffold code in the codebase for a couple of versions to ensure that users upgrading and skipping some patch versions (e.g. |
|
Also perhaps we could include a notice in the release notes the advise users to not upgrade with running workflows. And we can include that notice whenever applicable. It seems to me that we must have already made similar bug fixes that would have broken running nodes in the past. This could be a good middle ground between users not seeing unexpected, one-time issues during upgraded and us being able to fix bugs that currently exist in the codebase. |
|
Decision: add scaffolding code |
|
|
||
| // createRunningPods creates the pods that are marked as running in a given test so that they can be accessed by the | ||
| // pod assessor | ||
| func createRunningPods(ctx context.Context, woc *wfOperationCtx, with ...with) { |
There was a problem hiding this comment.
Useful utility testing function to create pods in the client that are supposed to be running according to the test workflow status.
There was a problem hiding this comment.
Only the with parameter was unused. Removed.
| // Previously we used `depNode.DisplayName` to generate all onExit node names. However, as these can be non-unique | ||
| // we transitioned to using `depNode.Name` instead, which are guaranteed to be unique. In order to not disrupt | ||
| // running workflows during upgrade time, we do an additional check to see if there is an onExit node with the old | ||
| // name (`depNode.DisplayName`). | ||
| // TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0 | ||
| // See more: https://github.com/argoproj/argo-workflows/issues/5502 | ||
| legacyOnExitNodeName := common.GenerateOnExitNodeName(depNode.DisplayName) | ||
| if onExitNode := d.wf.GetNodeByName(legacyOnExitNodeName); onExitNode != nil && d.wf.GetNodeByName(depNode.Name).HasChild(onExitNode.ID) { |
There was a problem hiding this comment.
This is tested by TestOnExitDAGStatusCompatibility in this PR
| } | ||
|
|
||
| const testOnExitNameBackwardsCompatibility = ` | ||
| apiVersion: argoproj.io/v1alpha1 |
There was a problem hiding this comment.
I've already ensured these test workflows are minimal.
| // Previously we used `parentDisplayName` to generate all onExit node names. However, as these can be non-unique | ||
| // we transitioned to using `parentNodeName` instead, which are guaranteed to be unique. In order to not disrupt | ||
| // running workflows during upgrade time, we first check if there is an onExit node that currently exists with the | ||
| // legacy name AND said node is a child of the parent node. If it does, we continue execution with the legacy name. | ||
| // If it doesn't, we use the new (and unique) name for all operations henceforth. | ||
| // TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0 | ||
| // When the scaffold code is removed, we should only have the following: | ||
| // | ||
| // onExitNodeName := common.GenerateOnExitNodeName(parentNodeName) | ||
| // | ||
| // See more: https://github.com/argoproj/argo-workflows/issues/5502 | ||
| onExitNodeName := common.GenerateOnExitNodeName(parentNodeName) | ||
| legacyOnExitNodeName := common.GenerateOnExitNodeName(parentDisplayName) | ||
| if legacyNameNode := woc.wf.GetNodeByName(legacyOnExitNodeName); legacyNameNode != nil && woc.wf.GetNodeByName(parentNodeName).HasChild(legacyNameNode.ID) { | ||
| onExitNodeName = legacyOnExitNodeName | ||
| } |
There was a problem hiding this comment.
This is tested by TestOnExitNameBackwardsCompatibility in this PR
There was a problem hiding this comment.
the block of code looks to be duplicated - maybe create a utility func?
There was a problem hiding this comment.
I had a utility function before, but once all the parameters were taken into account it didn't really reduce complexity so I decided to keep it like so.
| } | ||
|
|
||
| func (n NodeStatus) HasChild(childID string) bool { | ||
| for _, nodeID := range n.Children { |
There was a problem hiding this comment.
Working on it right now, will push soon
|
|
||
| // createRunningPods creates the pods that are marked as running in a given test so that they can be accessed by the | ||
| // pod assessor | ||
| func createRunningPods(ctx context.Context, woc *wfOperationCtx, with ...with) { |
| // Previously we used `depNode.DisplayName` to generate all onExit node names. However, as these can be non-unique | ||
| // we transitioned to using `depNode.Name` instead, which are guaranteed to be unique. In order to not disrupt | ||
| // running workflows during upgrade time, we do an additional check to see if there is an onExit node with the old | ||
| // name (`depNode.DisplayName`). | ||
| // TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0 | ||
| // See more: https://github.com/argoproj/argo-workflows/issues/5502 | ||
| legacyOnExitNodeName := common.GenerateOnExitNodeName(depNode.DisplayName) | ||
| if onExitNode := d.wf.GetNodeByName(legacyOnExitNodeName); onExitNode != nil && d.wf.GetNodeByName(depNode.Name).HasChild(onExitNode.ID) { |
| // Previously we used `parentDisplayName` to generate all onExit node names. However, as these can be non-unique | ||
| // we transitioned to using `parentNodeName` instead, which are guaranteed to be unique. In order to not disrupt | ||
| // running workflows during upgrade time, we first check if there is an onExit node that currently exists with the | ||
| // legacy name AND said node is a child of the parent node. If it does, we continue execution with the legacy name. | ||
| // If it doesn't, we use the new (and unique) name for all operations henceforth. | ||
| // TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0 | ||
| // When the scaffold code is removed, we should only have the following: | ||
| // | ||
| // onExitNodeName := common.GenerateOnExitNodeName(parentNodeName) | ||
| // | ||
| // See more: https://github.com/argoproj/argo-workflows/issues/5502 | ||
| onExitNodeName := common.GenerateOnExitNodeName(parentNodeName) | ||
| legacyOnExitNodeName := common.GenerateOnExitNodeName(parentDisplayName) | ||
| if legacyNameNode := woc.wf.GetNodeByName(legacyOnExitNodeName); legacyNameNode != nil && woc.wf.GetNodeByName(parentNodeName).HasChild(legacyNameNode.ID) { | ||
| onExitNodeName = legacyOnExitNodeName | ||
| } |
There was a problem hiding this comment.
the block of code looks to be duplicated - maybe create a utility func?
… nodes (#5486) Signed-off-by: Simon Behar <simbeh7@gmail.com>
Split from #5478