Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR - Options to prevent bugzilla from pausing webhook calls #775

Merged
merged 24 commits into from
Apr 22, 2024
Merged

ADR - Options to prevent bugzilla from pausing webhook calls #775

merged 24 commits into from
Apr 22, 2024

Conversation

alexcottner
Copy link
Contributor

@alexcottner alexcottner commented Nov 27, 2023

Added ADR with proposed options to prevent bugzilla from pausing webhook calls due to configuration errors.

Link to ADR for easy reference.

Issues this could solve (or affect) depending on solution we implement:

@alexcottner alexcottner requested a review from a team as a code owner November 27, 2023 20:15
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 27, 2023
@bsieber-mozilla
Copy link
Contributor

Thanks for this ADR @alexcottner ! This is certainly an issue worthy of an ADR. In terms of options--I think option 1, possibly with some slight addition would be a good approach.

In terms of options 3, 4, 5: I'm hesitant for adding additional infra, like a redis-cache/queue, when there's already a "queue" in the system (the bugzilla webhook queue). I understand it would give JBI more control over the messages, but also is data redundant.

In terms of option 1, depending on how we log, we could use those logs to automate alerts (https://cloud.google.com/logging/docs/alerting/log-based-alerts) which could then be sent to slack/email/other channels.

That way we "log and move-on" but also have a breadcrumb trail in public space so our users could be aware that they might be affected by data-loss. The alert could reveal the particular configuration, and bug_id so that interested parties could jump if they were expecting something that did not appear.

Thanks again for spinning up this ADR! 🙇

Copy link
Contributor

@leplatrem leplatrem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #82

One main comment: the format would require a bit more structure to follow the arguments better (criteria/decision drivers in front of each proposed solution). See other ADRs in docs/adrs/

Also, there is another solution that Graham suggested a while ago: when onboarding a project on JBI, we assign a stakeholder that is in charge of setting up the webhook for their product in Bugzilla. When there is an error, this stakeholder is alerted, and is charge of fixing it to unblock their queue.

docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
@grahamalama grahamalama requested a review from a team December 6, 2023 16:29
@leplatrem
Copy link
Contributor

About option 1: log and move on. How long do we retain logs for?

What about a solution where we log the error (+notification to devs on Slack) and have a super simple DLQ?
For example, we would append failing webhook payloads to a file, and have a command to replay them and purge the file? Payloads failing again during retry would be added to DLQ of the current day/month.
If file-based sounds too prehistoric and not flexible enough for statistics and stuff, we could have a simple bigquery table, that unlike logs could be kept forever.

@alexcottner
Copy link
Contributor Author

About option 1: log and move on. How long do we retain logs for?

Currently this is 30 days in log-explorer and 90 days in bigquery. But we could do this for as long or as little as we wanted, or even stream these specific logs to a bucket as a DLQ.

What about a solution where we log the error (+notification to devs on Slack) and have a super simple DLQ? For example, we would append failing webhook payloads to a file, and have a command to replay them and purge the file? Payloads failing again during retry would be added to DLQ of the current day/month. If file-based sounds too prehistoric and not flexible enough for statistics and stuff, we could have a simple bigquery table, that unlike logs could be kept forever.

Appending to single a file is annoying when running parallel containers. But if the spirit of this is "send it to a DLQ that we could re-process later", then it's easy enough to send them to a bucket or a volume on K8s. It would probably be something like /{year}/{month}/{date}/{pod-id}-{message-id}.json.

…tter-queue option and added a seperate option for dedicated queue
@alexcottner
Copy link
Contributor Author

We had a discussion today where I think we're leaning towards one or both of these:

  • Option 2, ask a human to do something
  • Option 4, use a simple DLQ

Both of these are relatively small efforts to implement with good outcomes. If we can easily generate alerts based on our simple DLQ (ex: an emailed report from a postgres table or messaging an IM channel based on a file being dropped into a bucket) then this could be a simple solution with high resiliency.

@leplatrem
Copy link
Contributor

Both of these are relatively small efforts to implement with good outcomes. If we can easily generate alerts based on our simple DLQ (ex: an emailed report from a postgres table or messaging an IM channel based on a file being dropped into a bucket) then this could be a simple solution with high resiliency.

💯

I vote to start with Option 4 with files on disk (mounted volume), and iterate if needed

And with regards to alerting, we could have another dedicated ADR (bugzilla comment vs Slack vs email to workflow owner, etc.)

@grahamalama
Copy link
Contributor

grahamalama commented Feb 21, 2024

I think we're underestimating the engineering effort of Option 4. By "engineering effort" here I mean code complexity, not necessarily infrastructure.

We would always return 200, but any events that fail to process internally would get sent to a DLQ and be replayed later if needed. ... A scheduled kubernetes job would then run to try and pick these up and reprocess them later (every 4 hours, for exmaple).

There's a lot packed in here. How will we decide when it's needed (or not needed) to replay a message from the DLQ? How will we avoid situations where, for example:

  • user creates a bug, it syncs
  • user adds EDIT 1: ... to the summary
  • some network error happens, the bug fails to sync, and it's placed in the DLQ
  • the network error solves itself
  • user adds EDIT 2: ... to the summary. The summary is synced.
  • 4 hours later, we replay the message in the DLQ. The summary is synced, but it erases EDIT 2 since it wasn't on the message in the DLQ

Also, something I saw on AWS docs on DLQs:

Don't use a dead-letter queue with a FIFO queue if you don't want to break the exact order of messages or operations.

I think for now, that's us (illustrated by the example above). I think we could code our way around it in certain cases, but that adds to the complexity of this option

@grahamalama
Copy link
Contributor

Also, something I noticed in the ADR:

Option 2:
- Amount of initial engineering effort: low
- Amount of maintenance effort: low
Option 4:
- Amount of initial engineering effort: mid
- Amount of maintenance effort: mid-low

Based on our own criteria, Option 2 requires lower effort and maintenance that Option 4. Performance and intuitiveness are the same. What then, is compelling us to choose Option 4? Are there some other considerations that we haven't documented in the ADR that make Option 4 more attractive?

@alexcottner
Copy link
Contributor Author

alexcottner commented Feb 21, 2024

Good points, processing data out of order can be just as dropping it. Gonna update the ADR with that as a risk for now. And since we're stateless we can't compare timestamps easily.

Copy link
Contributor

@leplatrem leplatrem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback after re-reading again the whole doc...

I haven't fully understood how you would distinguish from blocking events from others. Is it based on Jira HTTP responses? (ie. 401/500 vs. 400)

How do we merge events? What is the cardinality? (product, bug #, event_type) ?

I know that this ADR is about choosing an overall strategy and not going into implementation details, but I believe that the chosen solution would be easier to understand if some aspects were detailed.

docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
grahamalama added a commit that referenced this pull request Apr 10, 2024
This commit adds the classes that allow us to manage our "Dead Letter Queue". See #775 for motivation.

We also add a basic heartbeat check in this commit to assert that we are able to read from / write to the queue. 

In future commits, we'll introduce a method to process items in the queue and fully integrate the queue into the `/bugzilla_webhook` route.

---------

Co-authored-by: Mathieu Leplatre <[email protected]>
Co-authored-by: Alex Cottner <[email protected]>
Copy link
Contributor

@leplatrem leplatrem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read again quickly, seems to match our implementation.

I think the format of this ADR is not exactly like others we have, where we list all options, and in the Chosen solution we only say which one we chose and why.

Examples:

But this may not be very important :)

docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants