Provide a configuration option to enable a "fail fast" development mode #1274

kkersten · 2023-01-30T20:18:44Z

Problem: the server can be configured in a way that causes an indefinitely hanging job
The current FLARE controller is designed to allow setting the minimum number of required clients along with a server timeout. When min_clients is set to the total number of available clients with server_timeout=0, a failed client will cause the server workflow to hang.

This feature is useful for production use cases, in which the server workflow should be resilient to temporary interruptions in client communication, allowing for clients to temporarily fail and reconnect.

But in cases where a client has failed and is unrecoverable, the server workflow should timeout, independent from the controller workflow configuration. This would also allow a "development mode" in which any client failure causes the server workflow to terminate.

Potential solution
A separate server timeout configuration could be implemented independent of the controller configuration (for example in the server communication layer). This could be configured as a server job timeout, where

a timeout of 0 could trigger immediate failure (development mode)
a timeout of -1 (inf) would result in current behavior (production mode)
a non-zero positive timeout, depending on your level of patience

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a configuration option to enable a "fail fast" development mode #1274

Provide a configuration option to enable a "fail fast" development mode #1274

kkersten commented Jan 30, 2023

Provide a configuration option to enable a "fail fast" development mode #1274

Provide a configuration option to enable a "fail fast" development mode #1274

Comments

kkersten commented Jan 30, 2023