Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline: Annotation Service (grafana) Clean up various latency metrics #121

Open
gfr10598 opened this issue Dec 2, 2018 · 3 comments
Open
Assignees
Labels
4 2019 on-call Issues resulting from or relevant to on-call responsibilities. Sprint 4 Story

Comments

@gfr10598
Copy link
Contributor

gfr10598 commented Dec 2, 2018

The latency metrics are horribly confusing. Some come from etl metrics, and some from annotator metrics. "service" therefore means different things in different panels.

Some of the metrics have ridiculous - 10s or 100s of minutes - so there are clearly some errors, most likely in the queries.

@gfr10598
Copy link
Contributor Author

gfr10598 commented Dec 2, 2018

Might help to break down by test_type. etl metric already has this field.
The outlier seems to be annotatorss internal measurement. It is often many minutes, even though the etl timeout is 2 seconds.
So one thing that would help is to have an appropriate context for request handling in the annotator.

@gfr10598
Copy link
Contributor Author

gfr10598 commented Dec 6, 2018

The annotatorss currently running is 20181108t112031. There does not seem to be a corresponding travis build, though. Unfortunately, annotation-service does not yet implement a useful status page, either.
The logs show a huge number of 499, with 7 second latency. The latency for successful requests is on the order of 0.7 to 1.4 seconds. For batch requests, we probably should increase the timeout to more than the current 2 seconds.

ALSO: The code does not specify the quantiles for the summary. So we are only getting the default 0.5, 0.9, 0.99. We should also update this.
It looks like the median is much more sensible than the average, suggesting that there are very long duration outliers.

@kokosta kokosta added Sprint 4 and removed review/triage Team should review and assign priority labels Dec 10, 2018
@gfr10598
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 2019 on-call Issues resulting from or relevant to on-call responsibilities. Sprint 4 Story
Projects
None yet
Development

No branches or pull requests

2 participants