Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COS / More alerts #517

Open
gustavosr98 opened this issue Nov 27, 2024 · 2 comments
Open

COS / More alerts #517

gustavosr98 opened this issue Nov 27, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@gustavosr98
Copy link

gustavosr98 commented Nov 27, 2024

Steps to reproduce

Follow tutorial for COS integration

Expected behavior

A few alerts I would like to see

  • Machine is down
  • Machine is up but service is down
  • Cluster is not writable
  • Cluster will not be writable if I lose one more node
  • Query latency goes over X ms
  • Number of connections is close to max connections limit

And any similar related to degradation of to prevent the system stops working as expected

Actual behavior

No errors in the alerts, just two alerts
Pasted image 20241107183950

Versions

Juju 3.5.4

data-integrator                    active      1  data-integrator           latest/stable   41  no
grafana-agent                      active      3  grafana-agent             latest/edge    299  no       tracing: off
mongodb                            active      3  mongodb                   6/stable       199  no
self-signed-certificates           active      1  self-signed-certificates  latest/stable  155  no
@gustavosr98 gustavosr98 added the bug Something isn't working label Nov 27, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6073.

This message was autogenerated

@MiaAltieri
Copy link
Contributor

MiaAltieri commented Dec 9, 2024

Hi @gustavosr98 I have just opened a PR(#521) for these desired alerts. Thanks for taking the time to outline your wishlist! <3

I've gone ahead and added most of them. The ones I did not add in that PR are:

  • Machine is down
  • Machine is up but service is down
    and
  • Query latency goes over X ms

The first two I did not add since they are already there (screenshots 1+2) the third I do not believe is possible since I don't think grafana supports alerts based on user provided input (i.e. X), if you have a specific latency that you have in mind let me know and I can implement that for you ASAP

Please note that #1 will be further improved on o11y end since they are currently undergoing work for it

Screenshot 2024-12-09 at 14 18 40 Screenshot 2024-12-09 at 14 20 04

MiaAltieri added a commit that referenced this issue Dec 11, 2024
Addressing #517 by adding the following requested alerts:

- Cluster is not writable
- Cluster will not be writable if I lose one more node
- Number of connections is close to max connections limit

along with a few others from the Percona alert rules

## testing

- Cluster is not writable
<img width="1137" alt="Screenshot 2024-12-09 at 14 22 58"
src="https://github.com/user-attachments/assets/9deb7250-7701-4a9f-bdc0-ee74b5069641">

- Cluster will not be writable if I lose one more node - note this is
firing because it was deployed with a single replica, when the replica
set is scaled up it goes back to green
<img width="1148" alt="Screenshot 2024-12-09 at 14 22 04"
src="https://github.com/user-attachments/assets/50516710-97b5-4c08-a37d-37e43796bfb9">

- Number of connections is close to max connections limit (80%)
<img width="1117" alt="Screenshot 2024-12-09 at 14 23 32"
src="https://github.com/user-attachments/assets/14da278e-e9e7-42b6-ba69-11927f6c9b0e">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants