Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry failed jobs #21

Open
gravis opened this issue Sep 3, 2014 · 5 comments
Open

Retry failed jobs #21

gravis opened this issue Sep 3, 2014 · 5 comments

Comments

@gravis
Copy link
Contributor

gravis commented Sep 3, 2014

It would be nice to have features like sidekiq provides (https://github.com/mperham/sidekiq/wiki/Error-Handling), especially retry failed jobs.
Something like:

"If you don't fix the bug within 25 retries (about 21 days), Sidekiq will stop retrying and move your job to the Dead Job Queue. You can fix the bug and retry the job manually anytime within the next 6 months using the Web UI."

@cdrage
Copy link

cdrage commented Mar 9, 2015

Doesn't Resque do this though?

Or do a rescue in Go?

@rohit4813
Copy link

Hi,
I am experimenting with the goworker library.
I have a requirement of stopping and starting jobs.

Is it possible with the current version? Can anyone tell any workaround for it?

@mingan
Copy link

mingan commented Nov 6, 2017

@rohit4813 The current implementation listens for few signals and if it receives them, it stops enqueuing new jobs but lets the running jobs finish.

I'm not sure I understand exactly what you're trying to do, but here's our use: We have a scenario where each job is potentially pretty long but has a natural stopping point. For this, we create a channel in the main function that gets written into when a signal is received (basically he same code that is in goworker already). Then we create another channel, this time buffered (capacity = number of workers) and pass that channel to each worker. Workers then select from that channel at natural stopping points. In a separate goroutine (kicked off from the main function), we read from the signals channel and write N times (= number of workers) to the channel.

The whole flow looks like:

  1. The process receives a signal
  2. Both our signals channel and goworker's signals channels are written into
  3. Goworker stops enqueuing new jobs
  4. We copy the event N times
  5. When any worker finishes, it's handled normally
  6. When a worker gets to a checkpoint where it checks the channel, it returns and goworker takes care of it

@rohit4813
Copy link

@mingan Thanks for the great explanation.

If I understand correctly, all the workers will read from the workers channel(which gets populated from the signals channel)?

And I have a use case where I want to stop single/multiple worker(s), say which are running for a very long time and if that is the case all the workers will stop on passing the signal to the channel.

I can identify the worker on which the job is running for a very long time. How can I send the signal to this particular worker?

Hope this is not confusing, or am I missing something.

@mingan
Copy link

mingan commented Nov 6, 2017

@rohit4813 Our use case just creates breakpoints in long-running jobs so that when we need to restart the process, we don't have to wait (tens of) minutes for the whole job to finish.

If you needed to discriminate between workers, I guess you could do that by sending some meaningful value through the channel and then the worker would decide "this msg is meant for me, I'll stop" or "this is meant for the slow one over there, I can keep running". Though, I can't imagine the use case for such behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants