Skip to content

Conversation

@shengquan-ni
Copy link
Contributor

@shengquan-ni shengquan-ni commented Jun 5, 2025

This PR fixes a bug where the frontend incorrectly shows an operator as still running, even though the backend execution has completed.

截屏2025-06-05 下午3 44 23

Timeline for this bug

This issue was originally identified and fixed in #2411. However, it re-emerged after we migrated our RPC layer to gRPC in #2950. A minor mistake during the migration caused the originally chained future to be overridden by an EmptyResponse as the return value.

Root Cause

The core issue is that the controllerInitiatedQueryStats call returns immediately instead of waiting for the worker's response. Previously, we relied on this call to collect worker stats after execution, and used the updated stats to infer workflow completion. Because the call now returns immediately, the stats are never updated.

Additionally, our current design infers region completion from port completion and workflow completion from region completion. This decouples workflow execution state from the actual state of the workers. As a result, a workflow may be marked as complete even though the frontend still shows an operator as running, when in fact, the worker has already finished execution.

Example Scenario

Here’s how the issue manifests when the last operator finishes:

  1. A port finishes, triggering a handler that marks the port as completed.
  2. A QueryStats control message is sent to the worker that reported port completion (response pending).
  3. We check if all ports in the region are complete—if so, the region is marked complete.
  4. Separately, we receive a worker execution completion event and send another QueryStats message (response also pending).
  5. Since all ports are marked complete, the workflow is marked complete and the entire execution is terminated.
  6. Both pending QueryStats messages are lost because the worker is already shut down.

The Fix

The solution is straightforward: correctly chain and return the future so that the control message awaits the worker’s response before proceeding.

How the fix is verified

The bug is not always reproducible because sometimes the execution is killed after updating the states. So I looked at the WebSocket event sequence. In the correct behavior, setting the workflow state to complete should be after all the stats updates.

Before the fix:
截屏2025-06-05 下午3 50 34

After the fix:
截屏2025-06-05 下午3 52 05

Copy link
Contributor

@bobbai00 bobbai00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive PR description! LGTM!

@Yicong-Huang Yicong-Huang self-requested a review June 6, 2025 15:46
@shengquan-ni shengquan-ni merged commit 22641d0 into master Jun 6, 2025
9 checks passed
@shengquan-ni shengquan-ni deleted the shengquan-fix-incorrect-stats branch June 6, 2025 18:08
@Yicong-Huang Yicong-Huang linked an issue Jun 13, 2025 that may be closed by this pull request
@Yicong-Huang Yicong-Huang removed the bug label Oct 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The workflow stats display is wrong

3 participants