Only error tasks if the failure is within the task implementation #300

nsaje · 2023-09-27T10:02:06Z

Errors like TaskImportError or errors stemming from a broken child context manager shouldn't move the task to the ERROR state - it should remain in QUEUED.

This way when we somehow break TaskTiger outside of the task implementation we don't need to worry about which tasks should be retried and make recovery far easier.

To implement this we'd need a way for the child process to signal an "abort". Currently we return exit code 0 in case of success and 1 in case of error. We could add exit code 2 to signify an abortion of task processing, in which case the parent process could move the task back into the QUEUED state.

WDYT @thomasst ?

The text was updated successfully, but these errors were encountered:

tsx · 2023-10-02T08:47:55Z

Not all TaskImportErrors are equal. There are two kinds:

When we expect the task to become importable on its own - this typically happens as race conditions during deployments, for example when a deployment that schedules a new task was rolled out but the workers that handle that task are not aware of it just yet. These will recover automatically, so leaving them in Queued state would be fine.
When the task reference is actually broken, like it was removed or moved and there were in-flight tasks that reference the old (now not existing) name. These need manual intervention and leaving them in Queued means we'll ignore a real problem.

Unless we have a reliable way to distinguish between the two kinds of failures (honestly I don't see how a worker can know whether the task is expected to come in the future or not - we don't have the crystal balls necessary for that), I wouldn't recommend leaving those tasks as Queued.

That said, keeping a list of tasks that have failed with an import error, and retrying them on next worker restart (redeployment presumably) might make recovery (in both scenarios) easier by reducing the need for manually tracking down tasks and removing them.

As for the signalling mechanism - I don't think using exit codes is the best idea. I think the child process can just catch the appropriate ImportError exceptions and write some richer state/report to TT redis instead of trying to communicate through a 1-bit channel. It already stores the execution data from the forked child, so why not add it there.

nsaje · 2023-10-02T09:09:02Z

I think both of those cases would be better served by leaving the task in Queued state but still loudly logging an error (rollbar etc.). This way the manual intervention needed would be reduced to just fixing the issue in the code, whereas right now you also have to figure out exactly which tasks failed because of this issue and retry them manually.

The parent process handles the state/queue transitions so we do need to get that info back from the child somehow. Passing it via a Redis entry sounds fine as well, I don't have a strong opinion on it.

thomasst · 2023-10-04T16:06:09Z

Do you have an example where a broken child context manager caused a TaskImportError?

nsaje · 2023-10-04T17:55:23Z

No, there's an "or" in there :-) e.g. either TaskImportError or any error caused by the context managers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only error tasks if the failure is within the task implementation #300

Only error tasks if the failure is within the task implementation #300

nsaje commented Sep 27, 2023

tsx commented Oct 2, 2023

nsaje commented Oct 2, 2023

thomasst commented Oct 4, 2023

nsaje commented Oct 4, 2023

Only error tasks if the failure is within the task implementation #300

Only error tasks if the failure is within the task implementation #300

Comments

nsaje commented Sep 27, 2023

tsx commented Oct 2, 2023

nsaje commented Oct 2, 2023

thomasst commented Oct 4, 2023

nsaje commented Oct 4, 2023