Skip to content
8 changes: 8 additions & 0 deletions .changeset/throttle-retry-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
"@workflow/errors": patch
"@workflow/world": patch
"@workflow/world-vercel": patch
"@workflow/core": patch
---

Add 429 throttle retry handling and 500 server error retry with exponential backoff to the workflow and step runtimes
382 changes: 213 additions & 169 deletions packages/core/src/runtime.ts

Large diffs are not rendered by default.

71 changes: 70 additions & 1 deletion packages/core/src/runtime/helpers.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import { WorkflowAPIError } from '@workflow/errors';
import type {
Event,
HealthCheckPayload,
Expand All @@ -6,6 +7,7 @@ import type {
} from '@workflow/world';
import { HealthCheckPayloadSchema } from '@workflow/world';
import { monotonicFactory } from 'ulid';
import { runtimeLogger } from '../logger.js';
import * as Attribute from '../telemetry/semantic-conventions.js';
import { getSpanKind, trace } from '../telemetry.js';
import { getWorld } from './world.js';
Expand All @@ -17,7 +19,7 @@ const DEFAULT_HEALTH_CHECK_TIMEOUT = 30_000;
* Pattern for safe workflow names. Only allows alphanumeric characters,
* underscores, hyphens, dots, and forward slashes (for namespaced workflows).
*/
const SAFE_WORKFLOW_NAME_PATTERN = /^[a-zA-Z0-9_\-.\/]+$/;
const SAFE_WORKFLOW_NAME_PATTERN = /^[a-zA-Z0-9_\-./]+$/;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this, but apparently it works?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I double checked it. Seems to work fine


/**
* Validates a workflow name and returns the corresponding queue name.
Expand Down Expand Up @@ -398,3 +400,70 @@ export function getQueueOverhead(message: { requestedAt?: Date }) {
return;
}
}

/**
* Wraps a queue handler with HTTP 429 throttle retry logic.
* - retryAfter < 10s: waits in-process via setTimeout, then retries once
* - retryAfter >= 10s: returns { timeoutSeconds } to defer to the queue
*
* Safe to retry the entire handler because 429 is sent from server middleware
* before the request is processed — no server state has changed.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Safe to retry the entire handler because 429 is sent from server middleware before the request is processed — no server state has changed."

This holds if the 429 always hits on the first API call in the handler. But the handler makes multiple world API calls sequentially (e.g. runs.getevents.create(run_started) → replay → events.create(run_completed)). If a later call gets 429'd, the retry re-executes everything from the top.

For the workflow handler this is probably fine since replay is deterministic and events are idempotent.

For the step handler this is more concerning — the retry would re-execute user step code, which may not be idempotent. Is the assumption that the workflow-server's rate limiting middleware rejects at the connection level (so all calls in one handler invocation either succeed or all fail)? If so, worth documenting that assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed

// biome-ignore lint/suspicious/noConfusingVoidType: matches Queue handler return type
export async function withThrottleRetry(
fn: () => Promise<void | { timeoutSeconds: number }>
): Promise<void | { timeoutSeconds: number }> {
try {
return await fn();
} catch (err) {
if (WorkflowAPIError.is(err) && err.status === 429) {
const retryAfterSeconds = Math.max(
// If we don't have a retry-after value, 30s seems a reasonable default
// to avoid re-trying during the unknown rate-limiting period.
1,
typeof err.retryAfter === 'number' ? err.retryAfter : 30
);

if (retryAfterSeconds < 10) {
runtimeLogger.warn(
'Throttled by workflow-server (429), retrying in-process',
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if retryAfter is undefined (header missing), this defaults to 1 second. For 429 responses without a Retry-After header, 1 second might be too aggressive. A slightly higher default (e.g., 3-5s) would be more conservative and reduce the chance of hitting the server again immediately while it's under load.

retryAfterSeconds,
url: err.url,
}
);
// Short wait: sleep in-process, then retry once
await new Promise((resolve) =>
setTimeout(resolve, retryAfterSeconds * 1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we account for function execution time limits specifically in the case of Vercel world? If the serverless fuction is already close to the end of it's limit and the workflow server throws a 429, adding a 10 sec sleep could potentially exceed the function execution limit and the function could get SIGKILLd midway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow layer should never take much more than a few seconds, so I think it's highly unlikely that we'd run into timeouts, so I'm not too worried about this, but technically a concern

);
try {
return await fn();
} catch (retryErr) {
// If the retry also gets throttled, defer to queue
if (WorkflowAPIError.is(retryErr) && retryErr.status === 429) {
const retryRetryAfter = Math.max(
1,
typeof retryErr.retryAfter === 'number' ? retryErr.retryAfter : 1
);
runtimeLogger.warn('Throttled again on retry, deferring to queue', {
retryAfterSeconds: retryRetryAfter,
});
return { timeoutSeconds: retryRetryAfter };
}
throw retryErr;
}
}

// Long wait: defer to queue infrastructure
runtimeLogger.warn(
'Throttled by workflow-server (429), deferring to queue',
{
retryAfterSeconds,
url: err.url,
}
);
return { timeoutSeconds: retryAfterSeconds };
}
throw err;
}
}
13 changes: 13 additions & 0 deletions packages/core/src/runtime/step-handler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,19 @@ const stepHandler = getWorldHandlers().createQueueHandler(
step = startResult.step;
} catch (err) {
if (WorkflowAPIError.is(err)) {
if (WorkflowAPIError.is(err) && err.status === 429) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step handler has 429 handling here inside the step_started catch block, but unlike the workflow handler, it does not use withThrottleRetry and does not have the short-wait in-process retry path. This means a brief 429 (e.g., retryAfter=2s) will always defer to the queue rather than sleeping in-process. Is that intentional? If steps should also get the in-process retry for short waits, consider wrapping with withThrottleRetry similarly to the workflow handler.

const retryRetryAfter = Math.max(
1,
typeof err.retryAfter === 'number' ? err.retryAfter : 1
);
runtimeLogger.warn(
'Throttled again on retry, deferring to queue',
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant check: this is already inside if (WorkflowAPIError.is(err)), so the second WorkflowAPIError.is(err) on this line is always true. Should just be if (err.status === 429).

retryAfterSeconds: retryRetryAfter,
}
);
return { timeoutSeconds: retryRetryAfter };
}
// 410 Gone: Workflow has already completed
if (err.status === 410) {
console.warn(
Expand Down
11 changes: 10 additions & 1 deletion packages/errors/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -101,10 +101,18 @@ export class WorkflowAPIError extends WorkflowError {
status?: number;
code?: string;
url?: string;
/** Retry-After value in seconds, present on 429 responses */
retryAfter?: number;

constructor(
message: string,
options?: { status?: number; url?: string; code?: string; cause?: unknown }
options?: {
status?: number;
url?: string;
code?: string;
retryAfter?: number;
cause?: unknown;
}
) {
super(message, {
cause: options?.cause,
Expand All @@ -113,6 +121,7 @@ export class WorkflowAPIError extends WorkflowError {
this.status = options?.status;
this.code = options?.code;
this.url = options?.url;
this.retryAfter = options?.retryAfter;
}

static is(value: unknown): value is WorkflowAPIError {
Expand Down
4 changes: 2 additions & 2 deletions packages/world-vercel/src/queue.ts
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ const MAX_DELAY_SECONDS = Number(
type QueueFunction = (
queueName: ValidQueueName,
payload: QueuePayload,
opts?: QueueOptions & { delaySeconds?: number }
opts?: QueueOptions
) => ReturnType<Queue['queue']>;

export function createQueue(config?: APIConfig): Queue {
Expand All @@ -71,7 +71,7 @@ export function createQueue(config?: APIConfig): Queue {
const queue: QueueFunction = async (
queueName,
payload,
opts?: QueueOptions & { delaySeconds?: number }
opts?: QueueOptions
) => {
// Check if we have a deployment ID either from options or environment
const deploymentId = opts?.deploymentId ?? process.env.VERCEL_DEPLOYMENT_ID;
Expand Down
27 changes: 20 additions & 7 deletions packages/world-vercel/src/utils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,18 @@ import { type StructuredError, StructuredErrorSchema } from '@workflow/world';
import { decode, encode } from 'cbor-x';
import type { z } from 'zod';
import {
trace,
ErrorType,
getSpanKind,
HttpRequestMethod,
HttpResponseStatusCode,
UrlFull,
PeerService,
RpcService,
RpcSystem,
ServerAddress,
ServerPort,
ErrorType,
trace,
UrlFull,
WorldParseFormat,
PeerService,
RpcSystem,
RpcService,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the HTTP Retry-After header can also be an HTTP-date (e.g., Wed, 21 Oct 2015 07:28:00 GMT). The parseInt approach correctly returns NaN for date values, falling back to undefined. This is fine since the workflow-server likely only sends numeric values, but a comment noting this would be helpful.

} from './telemetry.js';
import { version } from './version.js';

Expand Down Expand Up @@ -292,10 +292,23 @@ export async function makeRequest<T>({
`Failed to fetch, reproduce with:\ncurl -X ${request.method} ${stringifiedHeaders} "${url}"`
);
}

// Parse Retry-After header for 429 responses (value is in seconds)
let retryAfter: number | undefined;
if (response.status === 429) {
const retryAfterHeader = response.headers.get('Retry-After');
if (retryAfterHeader) {
const parsed = parseInt(retryAfterHeader, 10);
if (!Number.isNaN(parsed)) {
retryAfter = parsed;
}
}
}

const error = new WorkflowAPIError(
errorData.message ||
`${request.method} ${endpoint} -> HTTP ${response.status}: ${response.statusText}`,
{ url, status: response.status, code: errorData.code }
{ url, status: response.status, code: errorData.code, retryAfter }
);
// Record error attributes per OTEL conventions
span?.setAttributes({
Expand Down
4 changes: 4 additions & 0 deletions packages/world/src/queue.ts
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ export const WorkflowInvokePayloadSchema = z.object({
runId: z.string(),
traceCarrier: TraceCarrierSchema.optional(),
requestedAt: z.coerce.date().optional(),
/** Number of times this message has been re-enqueued due to server errors (5xx) */
serverErrorRetryCount: z.number().int().optional(),
});

export const StepInvokePayloadSchema = z.object({
Expand Down Expand Up @@ -60,6 +62,8 @@ export interface QueueOptions {
deploymentId?: string;
idempotencyKey?: string;
headers?: Record<string, string>;
/** Delay message delivery by this many seconds */
delaySeconds?: number;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that this was promoted from the vercel-specific type to the shared interface. Note that world-local currently ignores delaySeconds entirely — so 5xx retries in local dev will fire immediately instead of with backoff. Not a blocker but worth a follow-up or at minimum a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to ignore for local world, since there are no 429s

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although we probably do actually want to support delaySeconds in local world anyway as we start using this option in queue more often in the future for more use cases. even if it's not implemented in this PR, I think we should leave an explicit comment that local world ignores this since it's a nuance. We should later have an e2e test that checks for this behaviour and would fail on local world without proper queue delaySeconds implementation (cc @TooTallNate )

}

export interface Queue {
Expand Down
Loading