Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reporting transient or permanent failures, and job retries #1100

Open
stephen-soltesz opened this issue Jul 29, 2022 · 0 comments
Open

Reporting transient or permanent failures, and job retries #1100

stephen-soltesz opened this issue Jul 29, 2022 · 0 comments
Labels
review/triage Team should review and assign priority

Comments

@stephen-soltesz
Copy link
Contributor

Recently, the ParserFailureRateTooHighOrMissing alert fired https://github.com/m-lab/dev-tracker/issues/727 due to an actual spike in task errors (individual archives).

Upon investigation, it was due to ETLSourceError, which can be due to transient connectivity problems between the parser and GCS API servers. This is something we cannot control directly. The alert resolved on its own when the connectivity was restored.

Ideally:

  • the data pipeline alerts should not fire on recoverable, transient events.
  • the data pipeline should differentiate between errors that are transient or permanent. (where possible)
  • the data pipeline (parser or gardener, as appropriate) should retry until some absolute threshold was reached and the task abandoned.

Currently:

  • the parser tries to open a task archive, if that fails, it stops processing that task, and does not always report the error to gardener.
  • the gardener only appears to update task state with states Parsing and ParsingComplete with heartbeats with periodic heartbeats.
  • the gardener will retry failed bq jobs, but does not appear to retry tasks issued by the /v2/jobs/next API (or the control path is very opaque).
@autolabel autolabel bot added the review/triage Team should review and assign priority label Jul 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review/triage Team should review and assign priority
Projects
None yet
Development

No branches or pull requests

1 participant