Fix operators showing "Running" state after workflow completion #3463
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes a bug where the frontend incorrectly shows an operator as still running, even though the backend execution has completed.
Timeline for this bug
This issue was originally identified and fixed in #2411. However, it re-emerged after we migrated our RPC layer to gRPC in #2950. A minor mistake during the migration caused the originally chained future to be overridden by an
EmptyResponseas the return value.Root Cause
The core issue is that the
controllerInitiatedQueryStatscall returns immediately instead of waiting for the worker's response. Previously, we relied on this call to collect worker stats after execution, and used the updated stats to infer workflow completion. Because the call now returns immediately, the stats are never updated.Additionally, our current design infers region completion from port completion and workflow completion from region completion. This decouples workflow execution state from the actual state of the workers. As a result, a workflow may be marked as complete even though the frontend still shows an operator as running, when in fact, the worker has already finished execution.
Example Scenario
Here’s how the issue manifests when the last operator finishes:
QueryStatscontrol message is sent to the worker that reported port completion (response pending).QueryStatsmessage (response also pending).QueryStatsmessages are lost because the worker is already shut down.The Fix
The solution is straightforward: correctly chain and return the future so that the control message awaits the worker’s response before proceeding.
How the fix is verified
The bug is not always reproducible because sometimes the execution is killed after updating the states. So I looked at the WebSocket event sequence. In the correct behavior, setting the workflow state to
completeshould be after all the stats updates.Before the fix:

After the fix:
