Skip to content

Conversation

@shengquan-ni
Copy link
Contributor

This PR addresses an issue where operators are not marked as finished upon the completion of a workflow. The underlying problem stems from the premature sending of PortComplete messages before WorkerComplete messages. Specifically, within the PortCompleteHandler, the current region is marked as completed. Consequently, when WorkerCompleteHandler executes, it attempts to query stats for the currently running region, which has already been marked as completed due to the prior PortComplete call. To resolve this issue, this PR introduces an additional call to querystats before a region is marked as completed, ensuring that the operators are correctly finished after workflow completion.

@shengquan-ni shengquan-ni self-assigned this Feb 25, 2024
Copy link
Contributor

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shengquan-ni shengquan-ni merged commit 0f77dff into master Feb 25, 2024
@shengquan-ni shengquan-ni deleted the shengquan-fix-port-complete branch February 25, 2024 23:17
shengquan-ni added a commit that referenced this pull request Jun 6, 2025
This PR fixes a bug where the frontend incorrectly shows an operator as
still running, even though the backend execution has completed.

<img width="80%" alt="截屏2025-06-05 下午3 44 23"
src="https://github.com/user-attachments/assets/63865886-1b24-4f36-8ebf-155c2748f98e"
/>



### Timeline for this bug
This issue was originally identified and fixed in #2411. However, it
re-emerged after we migrated our RPC layer to gRPC in #2950. A minor
mistake during the migration caused the originally chained future to be
overridden by an `EmptyResponse` as the return value.

### Root Cause
The core issue is that the `controllerInitiatedQueryStats` call returns
immediately instead of waiting for the worker's response. Previously, we
relied on this call to collect worker stats after execution, and used
the updated stats to infer workflow completion. Because the call now
returns immediately, the stats are never updated.

Additionally, our current design infers region completion from port
completion and workflow completion from region completion. This
decouples workflow execution state from the actual state of the workers.
As a result, a workflow may be marked as complete even though the
frontend still shows an operator as running, when in fact, the worker
has already finished execution.

### Example Scenario
Here’s how the issue manifests when the last operator finishes:
1. A port finishes, triggering a handler that marks the port as
completed.
2. A `QueryStats` control message is sent to the worker that reported
port completion (response pending).
3. We check if all ports in the region are complete—if so, the region is
marked complete.
4. Separately, we receive a worker execution completion event and send
another `QueryStats` message (response also pending).
5. Since all ports are marked complete, the workflow is marked complete
and the entire execution is terminated.
6. Both pending `QueryStats` messages are lost because the worker is
already shut down.

### The Fix
The solution is straightforward: correctly chain and return the future
so that the control message awaits the worker’s response before
proceeding.

### How the fix is verified
The bug is not always reproducible because sometimes the execution is
killed after updating the states. So I looked at the WebSocket event
sequence. In the correct behavior, setting the workflow state to
`complete` should be after all the stats updates.

Before the fix:
<img width="100%" alt="截屏2025-06-05 下午3 50 34"
src="https://github.com/user-attachments/assets/da996255-743d-4cda-8a07-85bd9a9d1d3d"
/>

After the fix:
<img width="825" alt="截屏2025-06-05 下午3 52 05"
src="https://github.com/user-attachments/assets/6b64660b-547f-4f19-8769-16a7f0676031"
/>
@Yicong-Huang Yicong-Huang removed the bug label Oct 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants