Skip to content

CPU-bound steps trigger phantom retries due to visibility timeout failure #925

@aluxf

Description

@aluxf

Description

When a workflow step performs CPU-intensive work (string manipulation, data processing) for more than ~30 seconds without I/O operations, Vercel Workflows triggers a phantom retry even though the original execution is still running successfully.

Reproduction

We have a step that:

  1. Fetches data from blob storage (I/O) ✓
  2. Processes ~25,000 anchors with string manipulation (~57 seconds of pure CPU)
  3. Uploads results to blob storage (I/O) ✓

The step runs for ~1 minute total. At ~37 seconds, while the CPU work is still in progress, a second execution starts.

Observed Behavior

From logs:

22:35:29 - First execution starts
22:36:06 - Second execution starts (~37s later, while first is still running!)
22:36:26 - First completes successfully
22:36:26 - "Failed to extend visibility for message XXX"
22:37:03 - Second execution fails: "Cannot set output - output already exists"

Key error:

Failed to extend visibility for message 019c25a5-7494-76bc-86d5-0165dcb91e3a: 
Error [MessageNotAvailableError]: Message not available for processing

Analysis

  • Vercel Workflows appears to use a message queue with visibility timeout (~30s)
  • During step execution, there's a mechanism to extend visibility
  • CPU-bound work without I/O prevents visibility extension from succeeding
  • When extension fails, the message becomes visible again → phantom retry starts
  • Both executions race to complete → conflicts

Comparison

Works fine (3+ minutes): processFiguresStep - constant I/O (fetching images, calling external APIs)

Fails (~37s): injectAnchorsAndDescriptionsStep - has a 57-second CPU-bound section

The difference isn't total duration - it's whether there's I/O activity for visibility extension.

Workarounds Attempted

  1. allowOverwrite: true on blob uploads - prevents blob conflict but doesn't stop the retry
  2. HEAD request heartbeats between functions - doesn't help if a single function call takes >30s

Expected Behavior

  • CPU-bound steps should not trigger phantom retries
  • Visibility timeout should be configurable, or extension should work without I/O
  • At minimum, document this limitation so developers know to split CPU-heavy steps

Environment

  • Vercel Workflows (latest)
  • Next.js 16.1.1
  • Deployed on Vercel

Questions

  1. Is there a way to configure the visibility timeout?
  2. Is this expected behavior, or a bug in the visibility extension mechanism?
  3. What's the recommended pattern for CPU-intensive steps?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions