-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Standalone Activity #8875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standalone Activity #8875
Conversation
Fixup feature branch after rebasing on `main` and pulling in latest `api` protos: - Rename `GetActivityResult` => `GetActivityExecutionResult` - Add `WorkflowHandler` stubs Foundations for CHASM activity - built
Adds a state machine defining valid transitions for a CHASM activity. The state machine does not include transitions related to cancel and pause states: we can add those when we implement those features. Part of foundations for CHASM activity implementation / Standalone Activity. - Built (a unit test was written but not committed)
Added support for starting a standalone activity execution. Required API for standalone activity feature. Focus here is plumbing in entity creation and handler boiler plate. - [X] built - [X] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) Integration tests to be added later when we have all the RPCs handled and integrated. Tasks not plumbed in yet to support full interaction with services. --------- Co-authored-by: Dan Davison <dan.davison@temporal.io> Co-authored-by: Roey Berman <roey@temporal.io>
Consolidated and refactored activity request validations to the chasm package. Added additional tests for the validator. Existing workflow activities and standalone activities should share the same validation code. Standalone activities also directly process the frontend request and therefore has additional fields to validate compared to embedded activities. - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) This refactors the existing stack, but the validation is basically exactly copied over and all relevant tests are passing. The standalone start activity request is now cloned before sanitizing the request attributes to preserve idempotent behavior during retries, but can potentially impact performance if there are large inputs. --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>
Added standalone activity Chasm tasks. Added handling of start activity and e2e implementation of standalone activity start execution with existing services. Updated protos related to standalone activities. The Chasm tasks are needed to kick off standalone activity execution via the existing services. Proto changes needed to so that the component ref can be passed and handled via service stack. - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) --------- Co-authored-by: Roey Berman <roey.berman@gmail.com> Co-authored-by: Dan Davison <dan.davison@temporal.io>
Update to latest protos from temporalio/api#640
## What changed? Added timeout tasks for standalone activities. Refactored existing backoff algorithm to common. Added outcome field. Fixed a few errors as the result of main rebase. Regen go.sum. Fixed broken redirection tests. Uncommented `LifecycleStateTimedout` and added supporting logic in `closeTransactionHandleRootLifecycleChange` ## Why? We need schedule-to-start, schedule-to-close, and start-to-close timeout support for standalone activities. Also reusing existing activity retry backoff code, so refactored it to a common package. We need to bring back LifecycleStateTimedout because when a timeout lifecycle occurs the workflow status needs to update to `WORKFLOW_EXECUTION_STATUS_TIMED_OUT`, else we encounter an invalidate state change: `unable to change workflow state from Created to Completed, status Failed` ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) ## Potential risks E2E timeout tests not added yet as we need complete workflow API plumbed before we can test this. --------- Co-authored-by: Roey Berman <roey.berman@gmail.com> Co-authored-by: Dan Davison <dan.davison@temporal.io> Co-authored-by: Roey Berman <roey@temporal.io>
) ## What changed? Added request ID transition option when creating activity entity. ## Why? The requestID is set by the matching service to a UUID, allowing safe retries if the response is lost. Here we add the request ID transition option so chasm engine handles it for us under the hood. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
- Add `history.ChasmNotifier` for subscribing to CHASM execution state transitions - Implement `chasm.PollComponent` - Add `PollActivityExecution` API handler - Needed for standalone activity - Needed for long-poll of other CHASM archetypes - [x] built - [x] added new unit test(s) - [x] added new functional test(s)
Added standalone activity completion and failure handling. Refactored existing timeout failure handling. Refactored existing check for retry method. Needed to support standalone activities full operation. - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [X] added new functional test(s) --------- Co-authored-by: Dan Davison <dan.davison@temporal.io>
## What changed? Added standalone activity termination handling. ## Why? Needed to support standalone activities full operation. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [X] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds termination support for standalone activities across API, frontend/handler, state machine, and tests. > > - **API/Proto**: > - Add `TerminateActivityExecution` RPC: request/response messages in `proto/v1/request_response.proto` and service in `proto/v1/service.proto`. > - Regenerate clients/servers: `activitypb` request/response, service, layered client, and gRPC stubs. > - **Frontend**: > - Implement `TerminateActivityExecution` in `frontend.go` to resolve namespace and forward to activity service. > - **Backend/Handler**: > - Add handler `TerminateActivityExecution` to update component via `(*Activity).handleTerminated`. > - Add `Activity.handleTerminated` to apply termination transition. > - **State Machine**: > - Enhance `TransitionTerminated` to finalize execution, set `LastWorkerIdentity`, and record `TerminatedFailureInfo` in outcome. > - **Tests**: > - Add unit test `TestTransitionTerminated`. > - Add functional tests for terminating an activity and preventing termination after completion. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit de5e4f2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Dan Davison <dan.davison@temporal.io>
Added standalone activity cancellation request and respond handling. Needed to support standalone activities full operation. - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [X] added new functional test(s) Co-authored-by: Dan Davison <dan.davison@temporal.io>
## What changed? Add handling of business ID policy for standalone activities. Refactored standalone activity validations. ## Why? Required so that the chasm engine will handle the business ID policies based on the RPC request. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [X] added new functional test(s) ## Potential risks Will need to add more tests once we rebase standalone-activity with main. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds business ID reuse/conflict policy handling for standalone activities, refactors validation to operate on the full request with sane defaults, and updates tests accordingly. > > - **Activity Backend**: > - Map `enumspb` activity ID reuse/conflict policies to chasm policies and pass via `chasm.WithBusinessIDPolicy` in `StartActivityExecution`. > - Propagate `RequestId` via `chasm.WithRequestID`. > - **Validation**: > - Refactor `ValidateStandaloneActivity` to accept the entire `workflowservice.StartActivityExecutionRequest` and mutate in place (request ID, ID policies, input size, search attributes). > - Add `normalizeAndValidateIDPolicy` with defaults and incompatibility checks; require request ID when attaching completion callbacks. > - Minor tweaks to input/search attributes validation paths. > - **Frontend**: > - Update `validateAndPopulateStartRequest` to use new validation signatures and remove redundant vars. > - **Tests**: > - Add/extend unit tests for ID policy defaults/mismatch, input size warning/error, and request ID length. > - Add functional tests covering ID reuse (`REJECT_DUPLICATE`, `ALLOW_DUPLICATE_FAILED_ONLY`) and conflict policy failure on existing execution. > - Adjust cancellation tests to use unique `RequestId`; remove default `RequestId` from `startActivity` helper. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 1225ccc. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Dan Davison <dan.davison@temporal.io>
) ## What changed? - Fix bug: schedule-to-close timer task validator was incorrectly requiring activity attempt at task execution time to be equal to activity attempt at task creation - Add test of schedule-to-close timeout that fails with the bug fix reverted - Do not set empty struct as outcome failure on attempt failure when retries are exhausted. - Improve doc comments ## Why? - Standalone activity schedule-to-close was incorrect: would not have fired after attempt 1 without this fix - Setting empty struct on attempt failure when retries are exhausted should not be necessary and it is fragile to introduce special values that code might start to rely on. ## How did you test it? - [x] built - [x] added new functional test(s)
…#8723) ## What changed? Bug fix: with schedule-to-close timeout not set by the caller, we were not scheduling retries ## Why? Required for CHASM activity correctness (e.g. standalone activity) ## How did you test it? - [x] built - [x] added new functional test(s)
All changes needed to make tests compile and pass after merging main into standalone-activity.
…onOutcome (#8771) - Add new public gRPC methods `DescribeActivityExecution` and `GetActivityExecutionOutcome`. See temporalio/api#673 - These replace the previous `PollActivityExecution` - Respond to additional API changes: inlining of `ActivityOptions` - Configure quota for the new methods in their blocking and non-blocking forms - Update test suite - Implements agreed Standalone Activity design - [x] built - [x] covered by existing tests - [x] added new functional test(s)
## What changed? Change dev environment to enable Chasm ## Why? Chasm should be enabled for development by default so devs don't have to explicitly set the config. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
Implement heartbeating for CHASM activities (standalone activity) - [x] built - [x] added new functional test(s)
## What changed? Add standalone activity metrics ## Why? Need standalone activity metrics for observability purposes ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds comprehensive metrics and payload-size recording for standalone activities, introducing contextual request wrappers and wiring metrics through state transitions, timeout executors, and history APIs. > > - **Activity component/state machine**: > - Introduces `RequestWithContext` carrying `Token`, `MetricsHandler`, `NamespaceName`, and `BreakdownMetricsByTaskQueue` and replaces prior request wrappers. > - Records metrics on schedule, attempts, and terminal transitions (success, fail, cancel, timeouts), including `ActivityStartToCloseLatency`, `ActivityScheduleToCloseLatency`, success/fail/cancel/timeout counters, and per-timeout tags. > - Adds `recordPayloadSize(...)` to emit payload sizes for input, heartbeat details, results, and failures. > - **Timeout and dispatch executors (`chasm/lib/activity/activity_tasks.go`)**: > - Add `timeoutTaskExecutorOptions` (dynamic config, metrics, namespace registry); resolve namespace and emit timeout metrics during schedule/start/close/heartbeat timeouts and retries. > - **Activity handler**: > - Injects `metrics.Handler` and `namespace.Registry`; `StartActivityExecution` emits input payload size on scheduling. > - **History APIs (record/respond activity ops)**: > - Pass `RequestWithContext` (token, metrics handler, namespace, breakdown setting) into chasm component calls. > - **Tests**: > - Update/add unit tests to validate metric emissions and payload-size recording across transitions and timeouts. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 02c6aba. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
## What changed? Add dynamic config to toggle standalone activity functionality. Refactored frontend dynamic config into activity config. Removed standalone activity dc prefix `chasm` ## Why? We need the ability to toggle standalone activity functionality as we go to prod. We also agreed to remove the `chasm` prefix from any chasm originated dynamic config keys ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed? After merging `main` into `standalone-activity` (see 88b0e52) a couple of things were broken: - The standalone activity protos needed rebasing on `main`'s protos - `GetActivityExecutionOutcome` needed renaming to `PollActivityExecution` (respond to upstream proto change) - The way we obtain the namespace name in the worker API handlers shared with workflow activities needed changing, to pass the correct business ID to `api.GetActiveNamespace` ## How did you test it? - [x] built - [x] covered by existing tests
## What changed?
Add standalone activity terminate request ID handling.
## Why?
On termination retry, it's nice if a duplicate req ID won't fail if the
termination happened on first attempt.
## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [X] added new functional test(s)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Introduce request ID–based idempotency for standalone activity
termination, validating/populating RequestId at the frontend and
persisting it in a new terminate_state, with accompanying proto changes
and tests.
>
> - **Activity/state machine**:
> - Implement idempotent termination in `handleTerminated`: no-op on
same `request_id`, `FailedPrecondition` on different ID.
> - `TransitionTerminated` now sets
`ActivityState.terminate_state.request_id` and records terminated
failure.
> - **Frontend/API**:
> - `TerminateActivityExecution` clones request, validates `request_id`
length, auto-generates UUID when missing, and forwards to service.
> - **Proto/serialization**:
> - Add `message ActivityTerminateState { string request_id = 1; }` and
`ActivityState.terminate_state` field.
> - Regenerate Go code and helper methods; fix "cancelation" →
"cancellation".
> - **Tests**:
> - Unit: assert `terminate_state.request_id` set on termination.
> - Functional: duplicate `RequestId` termination succeeds; different
`RequestId` fails with `FailedPrecondition`.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
1452ec2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
## What changed? Add standalone heartbeat by ID ## Why? We want to support existing RPCs to activity heartbeat ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [X] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Enable `RecordActivityTaskHeartbeatById` to heartbeat standalone activities (no `workflowId`) by embedding a `componentRef` in the activity task token; add a functional test verifying heartbeats keep the activity alive. > > - **Frontend** > - `service/frontend/workflow_handler.go` > - `RecordActivityTaskHeartbeatById`: accept empty `workflowId` for standalone activities; construct and serialize `componentRef` via `chasm.NewComponentRef[*activity.Activity]` and include it in the activity `taskToken`. > - **Tests** > - `tests/standalone_activity_test.go` > - Add `RecordHeartbeatByIDStaysAlive` to verify heartbeating-by-ID prevents timeout and activity completes successfully. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c2a4310. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Implement standalone activity ListActivityExecutions and CountActivityExecutions.
## What changed? Added support for ID conflict use existing. Added error details on conflict policy fail. ## Why? We now have the use existing policy support on chasm engine, so we can support it. Also need to pass back error details on conflict policy fail in struct ActivityExecutionAlreadyStartedError ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [X] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > <sup>[Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) is generating a summary for commit f264bba. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
## What changed? In `chasm.PollComponent`, if the passed component ref has empty run ID, resolve this to the current run ID. ## Why? The previous code state was broken: it was possible to submit `PollActivityExecution` with no run ID, which led to `Subscribe(key)` being called where key lacked a run ID. But the notifications were being sent to a different key (one with the run ID) which led to the long-poll never receiving any notification and hence always timing out. Existing chasm APIs `ReadComponent` and `UpdateComponent` both accept empty run ID and silently resolve it to current using `getExecutionLease`. It was a bug that `PollComponent` resolved to current using `getExecutionLease`, but then subscribed to notifications using a key that did not contain the resolved run ID. At the external API level, in several of our non-CHASM user-facing gRPC APIs, empty run ID means "use latest at the time the request is processed" (e.g. `GetWorkflowExecutionHistory`). And we want these semantics for CHASM APIs such as `PollActivityExecution`. This change gives those semantics. ## How did you test it? - [x] built - [x] added new unit test(s) - [x] added new functional test(s)
## What changed? Cleaner implementation of logic involved when obtaining activity outcome for `DescribeActivityExecution` and `PollActivityExecution`. ## Why? Code cleanup. Suggested at #8771 (comment) ## How did you test it? - [x] built - [x] covered by existing tests
## What changed? Filled remaining ActivityExecutionInfo fields. Enabled test that was blocked by chasm bug. Added schedule-to-start-timeout test. Test refactor. ## Why? We need to return all the fields for GET RPCs. Test cleanup/completion before merge to main. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) --------- Co-authored-by: Dan Davison <dan.davison@temporal.io>
## What changed? Added alias to imports in activity.go ## Why? When `make lint-code` is run locally, it is insistent on aliasing the activity API import to `activitypb "go.temporal.io/api/activity/v1"`. This is because the activity model `"go.temporal.io/server/chasm/lib/activity/gen/activitypb/v1"` also has the name. I've renamed both so lint-code won't keep modifying it. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed? If the group by key is returned as `[]byte`, convert it to string in `CountWorkflowExecutions` and `CountCHASMExecutions`. ## Why? The Go driver for MySQL returns `VARCHAR` columns as `[]byte`. This is apparently not true of our other persistence backends. ## How did you test it? - [x] built - [x] added new functional test(s) ## Potential risks Could break workflow visibility queries if this change is not correct.
| state *activitypb.ActivityState, | ||
| parent ActivityStore, | ||
| ) { | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Exported function has empty body and silently no-ops
The NewEmbeddedActivity function is exported but has an empty body. Any code calling this function to create an embedded activity would silently receive nothing - the function takes parameters (ctx, state, parent) but performs no operations and returns no value. If this is intentionally a placeholder for future implementation, it would be better to panic with "not implemented" or make it unexported to prevent accidental usage.
| &activitypb.HeartbeatTimeoutTask{ | ||
| Attempt: a.LastAttempt.Get(ctx).GetCount(), | ||
| }, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Missing heartbeat timeout check causes spurious timeout tasks
RecordHeartbeat unconditionally schedules a HeartbeatTimeoutTask without checking if HeartbeatTimeout is configured (greater than 0). When HeartbeatTimeout is 0 (not set), the task is scheduled for immediate execution at ctx.Now(a).Add(0), which could trigger spurious heartbeat timeout failures. This is inconsistent with the state machine code in statemachine.go which correctly checks if heartbeatTimeout > 0 before scheduling the initial heartbeat timeout task.
22de093 to
5852b15
Compare
| FrontendResponse: &workflowservice.DescribeActivityExecutionResponse{}, | ||
| }, nil | ||
| } | ||
| return response, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Long-poll timeout returns nil response instead of empty response
The condition err != nil && ctx.Err() != nil fails to handle the documented PollComponent timeout behavior. According to the docstring in engine.go, PollComponent returns (nil, nil, nil) on server-imposed timeout as "an indication that the caller should continue long-polling." Since err is nil in this case, the condition is false and the handler returns (nil, nil) instead of the intended empty response. This causes callers to receive a nil FrontendResponse, potentially causing nil pointer issues downstream. The condition should check ctx.Err() != nil without requiring err != nil.
Additional Locations (1)
## What changed? Update api protos to use api-go main ## Why? #8875 accidentally left the proto dependency pointing at a non-main commit in api-go. ## How did you test it? - [x] built - [x] covered by existing tests
What changed?
Standalone Activity feature branch
Why?
Adds Standalone Activity feature (CHASM Activity archetype)
Note
Adds standalone Activities powered by CHASM, including new Activity APIs, engine/notifier/polling, visibility, persistence/token wiring, and end-to-end tests.
Activitycomponent, state machine, tasks, handlers, metrics, config, and FX modules.PollComponent, execution notifications, structured refs, transition API changes.Start/Describe/Poll/Terminate/RequestCancel/DeleteActivityExecution, plusList/CountActivityExecutions.component_ref.component_reftoTaskInfoand task token.component_refthrough add/poll paths.activity.enableStandalone,history.enableChasm), long-poll timeouts/buffers.Written by Cursor Bugbot for commit 5852b15. This will update automatically on new commits. Configure here.