-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADR - Options to prevent bugzilla from pausing webhook calls #775
Conversation
Thanks for this ADR @alexcottner ! This is certainly an issue worthy of an ADR. In terms of options--I think option 1, possibly with some slight addition would be a good approach. In terms of options 3, 4, 5: I'm hesitant for adding additional infra, like a redis-cache/queue, when there's already a "queue" in the system (the bugzilla webhook queue). I understand it would give JBI more control over the messages, but also is data redundant. In terms of option 1, depending on how we log, we could use those logs to automate alerts (https://cloud.google.com/logging/docs/alerting/log-based-alerts) which could then be sent to slack/email/other channels. That way we "log and move-on" but also have a breadcrumb trail in public space so our users could be aware that they might be affected by data-loss. The alert could reveal the particular configuration, and bug_id so that interested parties could jump if they were expecting something that did not appear. Thanks again for spinning up this ADR! 🙇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #82
One main comment: the format would require a bit more structure to follow the arguments better (criteria/decision drivers in front of each proposed solution). See other ADRs in docs/adrs/
Also, there is another solution that Graham suggested a while ago: when onboarding a project on JBI, we assign a stakeholder that is in charge of setting up the webhook for their product in Bugzilla. When there is an error, this stakeholder is alerted, and is charge of fixing it to unblock their queue.
Co-authored-by: Mathieu Leplatre <[email protected]>
Co-authored-by: Mathieu Leplatre <[email protected]>
About option 1: log and move on. How long do we retain logs for? What about a solution where we log the error (+notification to devs on Slack) and have a super simple DLQ? |
Currently this is 30 days in log-explorer and 90 days in bigquery. But we could do this for as long or as little as we wanted, or even stream these specific logs to a bucket as a DLQ.
Appending to single a file is annoying when running parallel containers. But if the spirit of this is "send it to a DLQ that we could re-process later", then it's easy enough to send them to a bucket or a volume on K8s. It would probably be something like |
…tter-queue option and added a seperate option for dedicated queue
We had a discussion today where I think we're leaning towards one or both of these:
Both of these are relatively small efforts to implement with good outcomes. If we can easily generate alerts based on our simple DLQ (ex: an emailed report from a postgres table or messaging an IM channel based on a file being dropped into a bucket) then this could be a simple solution with high resiliency. |
💯 I vote to start with Option 4 with files on disk (mounted volume), and iterate if needed And with regards to alerting, we could have another dedicated ADR (bugzilla comment vs Slack vs email to workflow owner, etc.) |
I think we're underestimating the engineering effort of Option 4. By "engineering effort" here I mean code complexity, not necessarily infrastructure.
There's a lot packed in here. How will we decide when it's needed (or not needed) to replay a message from the DLQ? How will we avoid situations where, for example:
Also, something I saw on AWS docs on DLQs:
I think for now, that's us (illustrated by the example above). I think we could code our way around it in certain cases, but that adds to the complexity of this option |
Also, something I noticed in the ADR:
Based on our own criteria, Option 2 requires lower effort and maintenance that Option 4. Performance and intuitiveness are the same. What then, is compelling us to choose Option 4? Are there some other considerations that we haven't documented in the ADR that make Option 4 more attractive? |
Good points, processing data out of order can be just as dropping it. Gonna update the ADR with that as a risk for now. And since we're stateless we can't compare timestamps easily. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some feedback after re-reading again the whole doc...
I haven't fully understood how you would distinguish from blocking events from others. Is it based on Jira HTTP responses? (ie. 401/500 vs. 400)
How do we merge events? What is the cardinality? (product, bug #, event_type)
?
I know that this ADR is about choosing an overall strategy and not going into implementation details, but I believe that the chosen solution would be easier to understand if some aspects were detailed.
This commit adds the classes that allow us to manage our "Dead Letter Queue". See #775 for motivation. We also add a basic heartbeat check in this commit to assert that we are able to read from / write to the queue. In future commits, we'll introduce a method to process items in the queue and fully integrate the queue into the `/bugzilla_webhook` route. --------- Co-authored-by: Mathieu Leplatre <[email protected]> Co-authored-by: Alex Cottner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read again quickly, seems to match our implementation.
I think the format of this ADR is not exactly like others we have, where we list all options, and in the Chosen solution
we only say which one we chose and why.
Examples:
- https://github.com/mozilla/remote-settings/blob/main/docs/adr/adr_002_packaging_releasing_admin_UI.md#decision-outcome
- https://github.com/mozilla-services/telescope/blob/main/docs/adrs/adr_0001_history.md#decision-outcome
But this may not be very important :)
Co-authored-by: Mathieu Leplatre <[email protected]>
Added ADR with proposed options to prevent bugzilla from pausing webhook calls due to configuration errors.
Link to ADR for easy reference.
Issues this could solve (or affect) depending on solution we implement: