-
Notifications
You must be signed in to change notification settings - Fork 186
Open
Description
Description
When a workflow step performs CPU-intensive work (string manipulation, data processing) for more than ~30 seconds without I/O operations, Vercel Workflows triggers a phantom retry even though the original execution is still running successfully.
Reproduction
We have a step that:
- Fetches data from blob storage (I/O) ✓
- Processes ~25,000 anchors with string manipulation (~57 seconds of pure CPU)
- Uploads results to blob storage (I/O) ✓
The step runs for ~1 minute total. At ~37 seconds, while the CPU work is still in progress, a second execution starts.
Observed Behavior
From logs:
22:35:29 - First execution starts
22:36:06 - Second execution starts (~37s later, while first is still running!)
22:36:26 - First completes successfully
22:36:26 - "Failed to extend visibility for message XXX"
22:37:03 - Second execution fails: "Cannot set output - output already exists"
Key error:
Failed to extend visibility for message 019c25a5-7494-76bc-86d5-0165dcb91e3a:
Error [MessageNotAvailableError]: Message not available for processing
Analysis
- Vercel Workflows appears to use a message queue with visibility timeout (~30s)
- During step execution, there's a mechanism to extend visibility
- CPU-bound work without I/O prevents visibility extension from succeeding
- When extension fails, the message becomes visible again → phantom retry starts
- Both executions race to complete → conflicts
Comparison
Works fine (3+ minutes): processFiguresStep - constant I/O (fetching images, calling external APIs)
Fails (~37s): injectAnchorsAndDescriptionsStep - has a 57-second CPU-bound section
The difference isn't total duration - it's whether there's I/O activity for visibility extension.
Workarounds Attempted
allowOverwrite: trueon blob uploads - prevents blob conflict but doesn't stop the retry- HEAD request heartbeats between functions - doesn't help if a single function call takes >30s
Expected Behavior
- CPU-bound steps should not trigger phantom retries
- Visibility timeout should be configurable, or extension should work without I/O
- At minimum, document this limitation so developers know to split CPU-heavy steps
Environment
- Vercel Workflows (latest)
- Next.js 16.1.1
- Deployed on Vercel
Questions
- Is there a way to configure the visibility timeout?
- Is this expected behavior, or a bug in the visibility extension mechanism?
- What's the recommended pattern for CPU-intensive steps?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels