-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stackdriver: Crash when transient error or rate limiting happens. #89
Comments
hi @philwo could you please share with me how you are able to get this work in GCP? I am also using GCP but somehow, I can't get my Horizontal Pod Autoscaler to work using the metrics from buildkite-agent metrics. |
Hi @dmoxyeze, I'm sorry, I don't remember if I ever got this to work reliably. Whatever I tried at the time definitely didn't work well for auto-scaling. I think that was also because auto-scaling on GCP then didn't support a good way to signal which machines are "safe to shutdown", so it often picked the ones that were still running jobs. Sorry that I can't be of more help here, hope you can figure it out! Philipp |
Hi Philip,
Thanks for the help all the same.
Best regards,
Success.
…On Wed, Apr 17, 2024 at 9:01 AM Philipp Wollermann ***@***.***> wrote:
Hi @dmoxyeze <https://github.com/dmoxyeze>,
I'm sorry, I don't remember if I ever got this to work reliably. Whatever
I tried at the time definitely didn't work well for auto-scaling. I think
that was also because auto-scaling on GCP then didn't support a good way to
signal which machines are "safe to shutdown", so it often picked the ones
that were still running jobs.
Sorry that I can't be of more help here, hope you can figure it out!
Philipp
—
Reply to this email directly, view it on GitHub
<#89 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG5AHW3QHZEKU5EGZDZZAA3Y5XCXXAVCNFSM4HXXNADKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBWGAYTKMZUHE2Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Spotted today in our logs:
We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set
RestartSec=10
(or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:(The daemon should probably also handle these rate limiting errors better.)
The text was updated successfully, but these errors were encountered: