Conversation
🦋 Changeset detectedLatest commit: 0058bb4 The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (169 failed)mongodb (42 failed):
redis (42 failed):
starter (43 failed):
turso (42 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes the step handler performance by parallelizing independent async operations to reduce step execution overhead. The changes introduce three strategic optimizations that overlap HTTP calls with CPU-bound work and run independent async operations concurrently.
Changes:
- Parallelize initialization of port, span kind, and step entity fetching at handler startup
- Overlap step_started event creation with CPU-bound argument hydration
- Run step_completed event and trace serialization concurrently before queueing workflow continuation
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| packages/core/src/runtime/step-handler.ts | Implements three parallelization optimizations to reduce step execution latency by overlapping independent async operations |
| .changeset/step-handler-parallelization.md | Documents the patch-level change for release notes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Parallelize getPort(), getSpanKind(), and world.steps.get() calls - Start step_started event creation while hydrating arguments (CPU work) - Parallelize step_completed event with trace serialization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reverts Optimization 1 to fix a race condition where hydrateStepArguments() could throw before stepStartedPromise was awaited, causing stale step.attempt in the catch handler and potentially allowing extra retries. Optimizations 0 and 2 are preserved as they don't have this issue. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Eliminate the world.steps.get() HTTP call by calling step_started first and relying on server-side validation. This saves 50-80ms per step execution by removing one HTTP round-trip. The server (workflow-server) now validates: - Step not in terminal state (returns 409) - retryAfter timestamp reached (returns 425 with Retry-After header) - Workflow still active (returns 410 if completed) Changes: - Remove world.steps.get() from initial Promise.all - Call step_started first to get step entity and validate state - Handle 409 (terminal state) by re-queueing workflow - Handle 425 (retryAfter not reached) by returning timeout - Handle 410 (workflow gone) as no-op Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add server-side retryAfter validation to match workflow-server behavior: - Check retryAfter timestamp before allowing step_started - Return HTTP 425 with retryAfter timestamp in response meta - Clear retryAfter field when step starts successfully This ensures consistent behavior across all world implementations and allows the step-handler optimization to work correctly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5d150b3 to
cc840ff
Compare
Aligns local and postgres worlds with workflow-server, which returns 409 via InvalidOperationStateError for step in terminal state. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The world-vercel package was creating spans under a separate 'workflow-world-vercel' service name, causing HTTP spans for workflow-server API calls (step_started, step_completed) to be filtered out when viewing traces for the main application service. Now uses the same 'workflow' tracer name as @workflow/core to ensure all spans are reported under the parent application's service.
…-parallelization * origin/main: Add x-workflow-run-id header to queue messages (#922) Bump Next.js and React in workbenches (#944) Add subpath export resolution for package IDs (#901) Consolidate console logging to structured logger utility (#935) # Conflicts: # packages/core/src/runtime/step-handler.ts
- Start Jaeger container for local trace visualization - Configure OTEL exporter environment variables for dev server - Open Jaeger UI automatically - Add documentation about available trace attributes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add telemetry.ts and instrumentObject.ts to world-local for tracing parity with world-vercel (world.runs, world.steps, world.events, world.hooks spans) - Change workflow span name from uppercase "WORKFLOW" to lowercase "workflow" for consistency with step spans and OTEL naming conventions - Add step.execute child span to trace actual user step function execution separately from step handler infrastructure These changes enable local development to have the same observability as production deployments, making performance analysis and debugging easier. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…spans HTTP spans use uppercase methods like "GET /path" and "POST /path". Following the same convention, workflow and step spans now use: - WORKFLOW <workflow-name> - STEP <step-name> Child spans (workflow.run, workflow.loadEvents, step.execute, world.events.create) remain lowercase as they represent internal operations, not top-level entries. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Include traceparent and tracestate headers when queueing step execution messages. This enables automatic trace propagation by Vercel's infrastructure, potentially linking step invocation spans to the parent workflow trace. The trace carrier is now serialized once and included in both: - Payload: for manual context restoration in step handler - Headers: for automatic HTTP-based trace propagation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ation - Add peer.service attributes for workflow-server and VQS for Datadog service maps - Rename queueMessage span to queue.publish for consistency - Add step.hydrate, step.dehydrate, and workflow.replay spans - Include event type in world.events.create span names (e.g., "world.events.create step_started") - Add span.recordException() for errors with category classification (fatal/retryable/transient) - Add span events for milestones: retry.scheduled, step.skipped, step.delayed - Add HTTP semantic conventions with peer.service for world-vercel HTTP calls - Add baggage propagation for workflow context (run_id, workflow_name) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- step-handler-parallelization.md: Add race condition fix and 409 status code fix - world-vercel-telemetry-tracer.md: Add peer.service and event type in span names - otel-tracing-improvements.md: New changeset for comprehensive OTEL improvements Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Datadog derives the resource name from rpc.method attribute. Updated to use the full span name (which includes event type) instead of just the method name, so Datadog shows "world.events.create step_started" instead of "world.events.create workflow-server". Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Skip expensive S3 ref resolution (~200-460ms) for event types where the client doesn't use the response entity data (step_created, step_completed, step_failed, run_completed, etc). Only resolve refs for run_created, run_started, and step_started where the client reads the response. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| } | ||
|
|
||
| // ============================================================ | ||
| // Baggage Propagation Utilities |
There was a problem hiding this comment.
TIL about OTEL baggage. Naming is hard
| // Events where the client uses the response entity data need 'resolve' (default). | ||
| // Events where the client discards the response can use 'lazy' to skip expensive | ||
| // S3 ref resolution on the server, saving ~200-460ms per event. | ||
| const eventsNeedingResolve = new Set([ |
There was a problem hiding this comment.
Could be moved outside of the function - maybe event to world package or similar, since the way we except event responses to look like is consistent across worlds. Could be helpful for reference for third-party worlds, mostly, if the comment strings are framed appropriately.
Separately, it looks like we're missing fine-grained control for resolution (e.g. resolve run but not step, resolve step but not run, etc.), which could be a future optimization
There was a problem hiding this comment.
Moved to module scope — good call, no reason to recreate it on every call.
Moving this to @workflow/world so all world implementations can share it, and adding fine-grained per-entity resolution control (e.g. resolve run but not step) are both great follow-up ideas. Will track those separately.
| throw new WorkflowAPIError( | ||
| `Cannot modify step in terminal state "${validatedStep.status}"`, | ||
| { status: 410 } | ||
| { status: 409 } |
There was a problem hiding this comment.
A lot of other similar conflicts in the local world still throw 410. Should those be changed too?
There was a problem hiding this comment.
Good question! I audited all three implementations (workflow-server, world-local, world-postgres) to verify status code consistency.
Summary of the error semantics:
| Condition | workflow-server error | HTTP Status | Meaning |
|---|---|---|---|
| Run state transition on terminal run | InvalidOperationStateError |
409 Conflict | Can't re-complete/re-fail a finished run |
| Creating step/hook on terminal run | InvalidOperationStateError |
409 Conflict | Can't create new work on a finished run |
| Step in terminal state | InvalidOperationStateError |
409 Conflict | Step already completed/failed |
| Non-running step on terminal run | Not applicable (falls through) | 410 Gone | Run is done, step wasn't even started |
| Run not in 'running' state | WorkflowNotRunningError |
410 Gone | Run hasn't started or is gone |
| retryAfter not reached | RetryAfterNotReachedError |
425 Too Early | Wait before retrying |
| Step/run/hook not found | EntityNotFoundError |
404 Not Found | Entity doesn't exist |
Findings: Both world-local and world-postgres were using 410 (Gone) for cases that should be 409 (Conflict), specifically:
- "Cannot transition run from terminal state" — should be 409 (matches
InvalidOperationStateErrorin workflow-server) - "Cannot create new entities on terminal run" — should be 409 (same reason)
- step_completed/step_failed fallbacks for step in terminal state (world-postgres only) — should be 409
The remaining 410s are correct:
- "Cannot modify non-running step on run in terminal state" = the run is gone, 410 is appropriate (matches
WorkflowNotRunningErrorsemantics)
All mismatches have been fixed in this PR.
…us code audit - Move eventsNeedingResolve Set to module scope (PR review #2) - Rewrite changelog-style OPTIMIZATION comments as current-state docs (PR review #4) - Fix run terminal state errors: 410 → 409 in world-local and world-postgres to match workflow-server's InvalidOperationStateError (409) (PR review #3) - Fix remaining step terminal state 410 → 409 in world-postgres fallback paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
This PR includes performance optimizations for the step handler and comprehensive OpenTelemetry tracing improvements.
Performance Optimizations
world.steps.get()call and rely onstep_startedto return the step entitystep_completedevent creation andserializeTraceCarrier()concurrentlyremoteRefBehavior: 'lazy'for event types where the client discards the response data (e.g.step_created,step_completed,step_failed,run_completed), skipping expensive S3 ref resolution (~200-460ms savings per event)Bug Fixes
step_startedbefore hydration to ensure correct attempt count in error handlersOpenTelemetry Improvements
traceparent/tracestateheaders to step queue messages for cross-service trace linkingpeer.serviceand RPC semantic conventions for Datadog service maps (workflow-server, VQS)step.hydrate,step.dehydrate,workflow.replayspansqueueMessage→queue.publish, use uppercase WORKFLOW/STEP span namesworkflow.run_id,workflow.name)retry.scheduled,step.skipped,step.delayedrecordException()with error categorization (fatal/retryable/transient)world.events.createspans now include event type (e.g.,world.events.create step_started)Estimated savings: 50-80ms per step (one fewer HTTP round-trip) + 200-460ms per fire-and-forget event (skip S3 ref resolution)
Related
Test plan
pnpm test- 304 tests passing)Observability Impact
After these changes, Datadog traces will show: