-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only error tasks if the failure is within the task implementation #300
Comments
Not all
Unless we have a reliable way to distinguish between the two kinds of failures (honestly I don't see how a worker can know whether the task is expected to come in the future or not - we don't have the crystal balls necessary for that), I wouldn't recommend leaving those tasks as Queued. That said, keeping a list of tasks that have failed with an import error, and retrying them on next worker restart (redeployment presumably) might make recovery (in both scenarios) easier by reducing the need for manually tracking down tasks and removing them. As for the signalling mechanism - I don't think using exit codes is the best idea. I think the child process can just catch the appropriate ImportError exceptions and write some richer state/report to TT redis instead of trying to communicate through a 1-bit channel. It already stores the execution data from the forked child, so why not add it there. |
I think both of those cases would be better served by leaving the task in Queued state but still loudly logging an error (rollbar etc.). This way the manual intervention needed would be reduced to just fixing the issue in the code, whereas right now you also have to figure out exactly which tasks failed because of this issue and retry them manually. The parent process handles the state/queue transitions so we do need to get that info back from the child somehow. Passing it via a Redis entry sounds fine as well, I don't have a strong opinion on it. |
Do you have an example where a broken child context manager caused a |
No, there's an "or" in there :-) e.g. either TaskImportError or any error caused by the context managers. |
Errors like
TaskImportError
or errors stemming from a broken child context manager shouldn't move the task to the ERROR state - it should remain in QUEUED.This way when we somehow break TaskTiger outside of the task implementation we don't need to worry about which tasks should be retried and make recovery far easier.
To implement this we'd need a way for the child process to signal an "abort". Currently we return exit code 0 in case of success and 1 in case of error. We could add exit code 2 to signify an abortion of task processing, in which case the parent process could move the task back into the QUEUED state.
WDYT @thomasst ?
The text was updated successfully, but these errors were encountered: