stackdriver: Crash when transient error or rate limiting happens. #89

philwo · 2019-06-13T09:01:27Z

Spotted today in our logs:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Failed with result 'exit-code'.

We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set RestartSec=10 (or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Failed with result 'exit-code'.

(The daemon should probably also handle these rate limiting errors better.)

The text was updated successfully, but these errors were encountered:

dmoxyeze · 2024-04-11T05:44:15Z

Spotted today in our logs:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5922]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount] value [24], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/TotalAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:17.000', New: '2019/06/12-09:51:18.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Failed with result 'exit-code'.

We can workaround this by restarting the daemon via systemd on failure. Note to others: It's important to set RestartSec=10 (or higher) in the systemd unit file, otherwise the daemon will immediately crash again, because of Stackdriver rate limiting the updates:

Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: 2019/06/12 16:51:19 [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics buildkite-agent-metrics[5928]: [Collect] could not write metric [custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount] value [15], rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/buildkite/bazel_testing/BusyAgentCount, Timestamps: {Youngest Existing: '2019/06/12-09:51:18.000', New: '2019/06/12-09:51:19.000'}}: timeSeries[0]
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Unit entered failed state.
Jun 12 16:51:19 buildkite-agent-metrics systemd[1]: [email protected]: Failed with result 'exit-code'.

(The daemon should probably also handle these rate limiting errors better.)

hi @philwo could you please share with me how you are able to get this work in GCP? I am also using GCP but somehow, I can't get my Horizontal Pod Autoscaler to work using the metrics from buildkite-agent metrics.

philwo · 2024-04-17T01:01:25Z

Hi @dmoxyeze,

I'm sorry, I don't remember if I ever got this to work reliably. Whatever I tried at the time definitely didn't work well for auto-scaling. I think that was also because auto-scaling on GCP then didn't support a good way to signal which machines are "safe to shutdown", so it often picked the ones that were still running jobs.

Sorry that I can't be of more help here, hope you can figure it out!

Philipp

dmoxyeze · 2024-04-17T06:56:36Z

Hi Philip, Thanks for the help all the same. Best regards, Success.

…

On Wed, Apr 17, 2024 at 9:01 AM Philipp Wollermann ***@***.***> wrote: Hi @dmoxyeze <https://github.com/dmoxyeze>, I'm sorry, I don't remember if I ever got this to work reliably. Whatever I tried at the time definitely didn't work well for auto-scaling. I think that was also because auto-scaling on GCP then didn't support a good way to signal which machines are "safe to shutdown", so it often picked the ones that were still running jobs. Sorry that I can't be of more help here, hope you can figure it out! Philipp — Reply to this email directly, view it on GitHub <#89 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG5AHW3QHZEKU5EGZDZZAA3Y5XCXXAVCNFSM4HXXNADKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBWGAYTKMZUHE2Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stackdriver: Crash when transient error or rate limiting happens. #89

stackdriver: Crash when transient error or rate limiting happens. #89

philwo commented Jun 13, 2019 •

edited

Loading

dmoxyeze commented Apr 11, 2024

philwo commented Apr 17, 2024

dmoxyeze commented Apr 17, 2024 via email

stackdriver: Crash when transient error or rate limiting happens. #89

stackdriver: Crash when transient error or rate limiting happens. #89

Comments

philwo commented Jun 13, 2019 • edited Loading

dmoxyeze commented Apr 11, 2024

philwo commented Apr 17, 2024

dmoxyeze commented Apr 17, 2024 via email

philwo commented Jun 13, 2019 •

edited

Loading