Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stackdriver: Crash when transient error or rate limiting happens. #89

Open
philwo opened this issue Jun 13, 2019 · 3 comments
Open

stackdriver: Crash when transient error or rate limiting happens. #89

philwo opened this issue Jun 13, 2019 · 3 comments

Comments

@philwo
Copy link
Contributor

philwo commented Jun 13, 2019

Spotted today in our logs:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Failed with result 'exit-code'.

We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set RestartSec=10 (or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Failed with result 'exit-code'.

(The daemon should probably also handle these rate limiting errors better.)

@dmoxyeze
Copy link

Spotted today in our logs:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Failed with result 'exit-code'.

We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set RestartSec=10 (or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Failed with result 'exit-code'.

(The daemon should probably also handle these rate limiting errors better.)

hi @philwo could you please share with me how you are able to get this work in GCP? I am also using GCP but somehow, I can't get my Horizontal Pod Autoscaler to work using the metrics from buildkite-agent metrics.

@philwo
Copy link
Contributor Author

philwo commented Apr 17, 2024

Hi @dmoxyeze,

I'm sorry, I don't remember if I ever got this to work reliably. Whatever I tried at the time definitely didn't work well for auto-scaling. I think that was also because auto-scaling on GCP then didn't support a good way to signal which machines are "safe to shutdown", so it often picked the ones that were still running jobs.

Sorry that I can't be of more help here, hope you can figure it out!

Philipp

@dmoxyeze
Copy link

dmoxyeze commented Apr 17, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants