Don't return update handles until desired stage reached by Sushisource · Pull Request #2066 · temporalio/sdk-java

Sushisource · 2024-05-17T00:56:33Z

What was changed

Add admitted stage to wait policy
Don't return update handle from startUpdate until the update is complete if the user specified that stage (not sure why they would, though, when using start)
Retry submitting update to server until seen accepted

Why?

See temporalio/features#432

Checklist

Closes [Feature Request] SDK should not return an update handle if the update has not reached the desired state #2002
How was this tested:
Existing / new tests
Any docs updates needed?

Sushisource · 2024-05-17T00:56:58Z

temporal-sdk/src/test/java/io/temporal/workflow/shared/TestWorkflows.java

-  @WorkflowInterface
-  public interface SimpleWorkflowWithUpdate {


This was unused

Quinn-With-Two-Ns · 2024-05-17T01:18:45Z

temporal-sdk/src/main/java/io/temporal/internal/client/RootWorkflowClientInvoker.java

+    UpdateWorkflowExecutionResponse result;
+    do {
+      result = genericClient.update(updateRequest, pollTimeoutDeadline);
+    } while (result.getStage().getNumber() < input.getWaitPolicy().getLifecycleStage().getNumber()


Do we set a default for input.getWaitPolicy()?

Per @drewhoskins-temporal's latest requirements, we want wait-for-stage to be a required field for start. Also, we should call it "wait-for-stage" IMO to match Python and future SDKs (or if we don't like that term, we should call it something else and be consistent across SDKs with what it is called).

Quinn-With-Two-Ns · 2024-05-17T01:29:08Z

temporal-sdk/src/main/java/io/temporal/internal/client/RootWorkflowClientInvoker.java

@@ -334,8 +334,17 @@ public <R> StartUpdateOutput<R> startUpdate(StartUpdateInput<R> input) {
            .setRequest(request)
            .build();
    Deadline pollTimeoutDeadline = Deadline.after(POLL_UPDATE_TIMEOUT_S, TimeUnit.SECONDS);


Shouldn't the deadline be in the loop?

Arguably it doesn't need to be set at all

I unset this in the most recent commit - but, I'm not sure having a super long timeout by default is what we want to do? OTOH I don't have a firm reason why not I suppose.

I have no strong opinion so long as it's always longer than server's by enough to let server return an empty response on its timeout

We should treat start update as a long poll, hence the long timeout

Quinn-With-Two-Ns · 2024-05-17T01:35:20Z

temporal-sdk/src/main/java/io/temporal/client/LazyUpdateHandleImpl.java

-        workflowClientInvoker.pollWorkflowUpdate(
-            new WorkflowClientCallsInterceptor.PollWorkflowUpdateInput<>(
-                execution, updateName, id, resultClass, resultType, timeout, unit));
+    WorkflowClientCallsInterceptor.PollWorkflowUpdateOutput<T> pollCall;


Sorry, i have read this a few times and I am not sure logic trying to accomplish?

Since before the handle is returned from start when user says complete, there might be a result from polling already and if there is we want to use that, otherwise try it - but then we need to wipe that result in case getResult gets called again

hm this seems racy, if you have two concurrent calls isn't is possible for pollCall=null if two threads interleave in the right way?

Ah yes it is, I didn't think about this being called concurrently, too used to Rust.

Simplified this (works better as a cache now too)

Quinn-With-Two-Ns · 2024-05-17T02:54:59Z

While we are refactoring this area could we also do #2045? I believe @cretz did this in python by requiring a wait stage be passed? If not I'll do it in a follow up PR.

Quinn-With-Two-Ns · 2024-05-17T02:56:34Z

temporal-sdk/src/main/java/io/temporal/client/UpdateWaitPolicy.java

@@ -23,7 +23,19 @@
 import io.temporal.api.enums.v1.UpdateWorkflowExecutionLifecycleStage;

 public enum UpdateWaitPolicy {


Hmm in python looks like we changed the name to WorkflowUpdateStage I think we should do the same here because this enum will also be used when describing an updates stage.

👍 And the docs here about what each enum means only apply to starting an update and maybe should move to there (but maybe not).

K, I'm down to change the name

Docs wise, weirdly, even the proto APIs mention nothing about what the stages mean beyond as input to requests. That would be good to fix.

cretz · 2024-05-17T12:07:01Z

temporal-sdk/src/main/java/io/temporal/client/UpdateWaitPolicy.java

@@ -23,7 +23,19 @@
 import io.temporal.api.enums.v1.UpdateWorkflowExecutionLifecycleStage;

 public enum UpdateWaitPolicy {


👍 And the docs here about what each enum means only apply to starting an update and maybe should move to there (but maybe not).

cretz · 2024-05-17T12:09:29Z

temporal-sdk/src/main/java/io/temporal/internal/client/RootWorkflowClientInvoker.java

+    UpdateWorkflowExecutionResponse result;
+    do {
+      result = genericClient.update(updateRequest, pollTimeoutDeadline);
+    } while (result.getStage().getNumber() < input.getWaitPolicy().getLifecycleStage().getNumber()


Per @drewhoskins-temporal's latest requirements, we want wait-for-stage to be a required field for start. Also, we should call it "wait-for-stage" IMO to match Python and future SDKs (or if we don't like that term, we should call it something else and be consistent across SDKs with what it is called).

cretz · 2024-05-17T12:13:55Z

temporal-sdk/src/main/java/io/temporal/internal/client/RootWorkflowClientInvoker.java

+    UpdateWorkflowExecutionResponse result;
+    do {
+      result = genericClient.update(updateRequest, pollTimeoutDeadline);
+    } while (result.getStage().getNumber() < input.getWaitPolicy().getLifecycleStage().getNumber()


I think the latest requirements for start were to, if the wait stage is COMPLETED, after ACCEPTED you switched to polling for response before returning from the start call. Can you confirm at least from the user perspective that occurs?

Yes that happens here: https://github.com/temporalio/sdk-java/pull/2066/files#diff-64e98c2533a94b01fbda951826a90567fb17dd80b70af7f6f44c5c6dd1759c69R364

👍 Sorry I missed that

cretz · 2024-05-17T12:16:17Z

temporal-sdk/src/main/java/io/temporal/internal/client/RootWorkflowClientInvoker.java

+    // by the user.
+    UpdateWorkflowExecutionResponse result;
+    do {
+      result = genericClient.update(updateRequest, pollTimeoutDeadline);


Can you make sure down below in pollWorkflowUpdateHelper that you remove the logic that retries on gRPC deadline exceeded error? That should no longer occur, we should just be bubbling all errors out

Wait, isn't the Python one doing that when it passes retry=True to the service client? Or, if that doesn't retry timeouts, then where is that happening? Because https://github.com/temporalio/sdk-python/blob/1a2acd59634a3b1d694937b8a8433c0014247370/temporalio/client.py#L4303 says it will, but there's no explicit handling of timeouts here: https://github.com/temporalio/sdk-python/blob/1a2acd59634a3b1d694937b8a8433c0014247370/temporalio/client.py#L4359

It's easy to change Java to not do this and just default to max timeout for getResult calls, but, not sure that's the right thing to do.

(I committed it so we can see what I mean - works fine, but, seems like maybe not right? At minimum what python is saying the doc vs. what it does is either inconsistent, or the loop is not needed, or not the same as what I've just done here)

Ok, I need to update that Python doc to remove that last sentence (I fixed logic but forgot about docs). We are no longer using timeout/exceptions to drive the loop.

Just need to remove the idea that deadline exceeded means something special in the start/poll loop. Let all RPC exceptions bubble out as they always would and change the code to only care about the successful result instead of the whenComplete today that cares about either result or failure (not sure what the combinator is for success-only).

Yeah, that's done now. All it's doing is just interpreting the failure code into the right exception type which makes sense to me.

cretz · 2024-05-17T12:17:49Z

temporal-sdk/src/main/java/io/temporal/internal/client/RootWorkflowClientInvoker.java

@@ -334,8 +334,17 @@ public <R> StartUpdateOutput<R> startUpdate(StartUpdateInput<R> input) {
            .setRequest(request)
            .build();
    Deadline pollTimeoutDeadline = Deadline.after(POLL_UPDATE_TIMEOUT_S, TimeUnit.SECONDS);


Arguably it doesn't need to be set at all

Quinn-With-Two-Ns · 2024-05-17T21:17:46Z

temporal-sdk/src/main/java/io/temporal/client/LazyUpdateHandleImpl.java

        .getResult()
        .exceptionally(
            failure -> {
+              // If the poll didn't find the completion successfully, reset the previous poll call


Just because the poll call failed does not mean it shouldn't be cached right? A update rejection would also complete the poll future exceptional If I recall correctly. I would probably drop caching since we don't cache workflow result either and if we get user feedback address all these functions with a consistent strategy.

Sure, I could just delete it every time still. I need something like it for the don't-return-handle-until-completed case and I figured why not cache it for the success case since that's definitely not going to change.

I think for the don't-return-handle-until-completed case you can take the result and put it in the CompletedUpdateHandleImpl

The reason I'd done it this way is to still get all the exception conversion stuff for free (and the encapsulation of the parmeters). So, I'll just wipe out the cache every time

K, I've done a much more targeted version of this.

We still do want to avoid caching most exceptions though, just the ones from the update outcome should be cached

Quinn-With-Two-Ns · 2024-05-17T21:21:00Z

temporal-sdk/src/main/java/io/temporal/client/LazyUpdateHandleImpl.java

+
+  WorkflowClientCallsInterceptor.PollWorkflowUpdateOutput<T> pollUntilComplete(
+      long timeout, TimeUnit unit) {
+    synchronized (this) {


Using synchronized like this can be really problematic with virtual threads because the virtual thread will be pinned while executing pollWorkflowUpdate https://mikemybytes.com/2024/02/28/curiosities-of-java-virtual-threads-pinning-with-synchronized/

Ah, interesting didn't realize that.

Yeah it is a very unfortunate issue with the current virtual thread limitation. My current stance is avoid IO in synchronized blocks and remove any I see

cretz · 2024-05-17T22:01:59Z

temporal-sdk/src/main/java/io/temporal/internal/client/RootWorkflowClientInvoker.java

                          == Status.Code.DEADLINE_EXCEEDED)
-                  || pollTimeoutDeadline.isExpired()
+                  || deadline.isExpired()
                  || (e == null && !r.hasOutcome())) {


I think this is supposed to recurse in this situation (keeps retrying until outcome is present)

Right, OK, I keep getting confused about the situations where server could return no outcome, but, it's like long polling on tasks

Quinn-With-Two-Ns · 2024-05-17T22:36:43Z

Don't we also need to update the test server to return the updates current lifecycle state? or does the test server never actually return an empty response?

Sushisource · 2024-05-17T22:44:06Z

Don't we also need to update the test server to return the updates current lifecycle state? or does the test server never actually return an empty response?

I have changed it to do so - here for example https://github.com/temporalio/sdk-java/pull/2066/files#diff-809c076b3ee441df02cf0c4566a20f7abc69c94c15d8b689875350ae3fcdbfd9R1808

cretz · 2024-05-20T12:16:11Z

temporal-sdk/src/main/java/io/temporal/internal/client/RootWorkflowClientInvoker.java

-                }
+                  || deadline.isExpired()) {
+                resultCF.completeExceptionally(
+                    new TimeoutException(


Note this may be changing shortly with #2069

dandavison · 2024-06-03T16:02:59Z

temporal-sdk/src/main/java/io/temporal/client/LazyUpdateHandleImpl.java

+    synchronized (this) {
+      if (waitCompletedPollCall != null) {
+        pollCall = waitCompletedPollCall;
+        waitCompletedPollCall = null;


setFromWaitCompleted is never changed to true. I think the intention was to do that here?

yeah this looks like a bug

The Java StartUpdate code path is different from the other SDKs in a few subtle ways as well i'll try to align it with other SDKs as well

Sushisource added 4 commits May 16, 2024 15:57

Re attempt update start until admitted

f03eeb9

Make test server work

8085c83

Make sure handle isn't returned until completed when asked in start

d085d54

Add test for handle not being returned until complete

95dc2f3

Sushisource requested a review from a team as a code owner May 17, 2024 00:56

Sushisource commented May 17, 2024

View reviewed changes

Fix double-calls of get result on handle

ca23272

Quinn-With-Two-Ns reviewed May 17, 2024

View reviewed changes

cretz reviewed May 17, 2024

View reviewed changes

Rename enum / don't retry timeout inside of poller call automatically

e66446c

Quinn-With-Two-Ns reviewed May 17, 2024

View reviewed changes

cretz reviewed May 17, 2024

View reviewed changes

More targeted waitComplete on start

6e9cb0c

Sushisource force-pushed the update-handles-reach-desired-stage branch from c9584e5 to 6e9cb0c Compare May 17, 2024 22:06

Retry polling while no outcome

13c3bb9

cretz approved these changes May 20, 2024

View reviewed changes

Temporarily use features branch to handle breaking change

0a7499e

Sushisource force-pushed the update-handles-reach-desired-stage branch 2 times, most recently from d9c9d0b to 0a7499e Compare May 21, 2024 00:03

Merge branch 'master' into update-handles-reach-desired-stage

17a0460

Quinn-With-Two-Ns approved these changes May 21, 2024

View reviewed changes

Sushisource enabled auto-merge (squash) May 21, 2024 17:30

Sushisource merged commit 82d5a88 into temporalio:master May 21, 2024

Sushisource deleted the update-handles-reach-desired-stage branch May 21, 2024 20:19

dandavison reviewed Jun 3, 2024

View reviewed changes

Quinn-With-Two-Ns mentioned this pull request Jun 3, 2024

Require WaitForStage in StartUpdate #2088

Merged

		@WorkflowInterface
		public interface SimpleWorkflowWithUpdate {

		@@ -23,7 +23,19 @@
		import io.temporal.api.enums.v1.UpdateWorkflowExecutionLifecycleStage;

		public enum UpdateWaitPolicy {

Conversation

Sushisource commented May 17, 2024

What was changed

Why?

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Quinn-With-Two-Ns commented May 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource May 17, 2024 •

edited

Loading

cretz May 17, 2024 •

edited

Loading

cretz May 17, 2024 •

edited

Loading

Sushisource May 17, 2024 •

edited

Loading

cretz May 17, 2024 •

edited

Loading

Sushisource May 17, 2024 •

edited

Loading

cretz May 17, 2024 •

edited

Loading

Sushisource commented May 17, 2024 •

edited

Loading