ADR - Options to prevent bugzilla from pausing webhook calls #775

alexcottner · 2023-11-27T19:41:09Z

Added ADR with proposed options to prevent bugzilla from pausing webhook calls due to configuration errors.

Issues this could solve (or affect) depending on solution we implement:

bsieber-mozilla · 2023-11-27T21:37:10Z

Thanks for this ADR @alexcottner ! This is certainly an issue worthy of an ADR. In terms of options--I think option 1, possibly with some slight addition would be a good approach.

In terms of options 3, 4, 5: I'm hesitant for adding additional infra, like a redis-cache/queue, when there's already a "queue" in the system (the bugzilla webhook queue). I understand it would give JBI more control over the messages, but also is data redundant.

In terms of option 1, depending on how we log, we could use those logs to automate alerts (https://cloud.google.com/logging/docs/alerting/log-based-alerts) which could then be sent to slack/email/other channels.

That way we "log and move-on" but also have a breadcrumb trail in public space so our users could be aware that they might be affected by data-loss. The alert could reveal the particular configuration, and bug_id so that interested parties could jump if they were expecting something that did not appear.

Thanks again for spinning up this ADR! 🙇

leplatrem

See #82

One main comment: the format would require a bit more structure to follow the arguments better (criteria/decision drivers in front of each proposed solution). See other ADRs in docs/adrs/

Also, there is another solution that Graham suggested a while ago: when onboarding a project on JBI, we assign a stakeholder that is in charge of setting up the webhook for their product in Bugzilla. When there is an error, this stakeholder is alerted, and is charge of fixing it to unblock their queue.

docs/adrs/003-bugzilla-response-codes.md

Co-authored-by: Mathieu Leplatre <[email protected]>

leplatrem · 2023-12-18T09:09:19Z

About option 1: log and move on. How long do we retain logs for?

What about a solution where we log the error (+notification to devs on Slack) and have a super simple DLQ?
For example, we would append failing webhook payloads to a file, and have a command to replay them and purge the file? Payloads failing again during retry would be added to DLQ of the current day/month.
If file-based sounds too prehistoric and not flexible enough for statistics and stuff, we could have a simple bigquery table, that unlike logs could be kept forever.

alexcottner · 2023-12-18T20:33:56Z

About option 1: log and move on. How long do we retain logs for?

Currently this is 30 days in log-explorer and 90 days in bigquery. But we could do this for as long or as little as we wanted, or even stream these specific logs to a bucket as a DLQ.

What about a solution where we log the error (+notification to devs on Slack) and have a super simple DLQ? For example, we would append failing webhook payloads to a file, and have a command to replay them and purge the file? Payloads failing again during retry would be added to DLQ of the current day/month. If file-based sounds too prehistoric and not flexible enough for statistics and stuff, we could have a simple bigquery table, that unlike logs could be kept forever.

Appending to single a file is annoying when running parallel containers. But if the spirit of this is "send it to a DLQ that we could re-process later", then it's easy enough to send them to a bucket or a volume on K8s. It would probably be something like /{year}/{month}/{date}/{pod-id}-{message-id}.json.

…tter-queue option and added a seperate option for dedicated queue

alexcottner · 2024-02-20T18:12:22Z

We had a discussion today where I think we're leaning towards one or both of these:

Option 2, ask a human to do something
Option 4, use a simple DLQ

Both of these are relatively small efforts to implement with good outcomes. If we can easily generate alerts based on our simple DLQ (ex: an emailed report from a postgres table or messaging an IM channel based on a file being dropped into a bucket) then this could be a simple solution with high resiliency.

leplatrem · 2024-02-21T10:14:05Z

Both of these are relatively small efforts to implement with good outcomes. If we can easily generate alerts based on our simple DLQ (ex: an emailed report from a postgres table or messaging an IM channel based on a file being dropped into a bucket) then this could be a simple solution with high resiliency.

💯

I vote to start with Option 4 with files on disk (mounted volume), and iterate if needed

And with regards to alerting, we could have another dedicated ADR (bugzilla comment vs Slack vs email to workflow owner, etc.)

grahamalama · 2024-02-21T15:29:54Z

I think we're underestimating the engineering effort of Option 4. By "engineering effort" here I mean code complexity, not necessarily infrastructure.

We would always return 200, but any events that fail to process internally would get sent to a DLQ and be replayed later if needed. ... A scheduled kubernetes job would then run to try and pick these up and reprocess them later (every 4 hours, for exmaple).

There's a lot packed in here. How will we decide when it's needed (or not needed) to replay a message from the DLQ? How will we avoid situations where, for example:

user creates a bug, it syncs
user adds EDIT 1: ... to the summary
some network error happens, the bug fails to sync, and it's placed in the DLQ
the network error solves itself
user adds EDIT 2: ... to the summary. The summary is synced.
4 hours later, we replay the message in the DLQ. The summary is synced, but it erases EDIT 2 since it wasn't on the message in the DLQ

Also, something I saw on AWS docs on DLQs:

Don't use a dead-letter queue with a FIFO queue if you don't want to break the exact order of messages or operations.

I think for now, that's us (illustrated by the example above). I think we could code our way around it in certain cases, but that adds to the complexity of this option

grahamalama · 2024-02-21T15:34:48Z

Also, something I noticed in the ADR:

Option 2:
- Amount of initial engineering effort: low
- Amount of maintenance effort: low

Option 4:
- Amount of initial engineering effort: mid
- Amount of maintenance effort: mid-low

Based on our own criteria, Option 2 requires lower effort and maintenance that Option 4. Performance and intuitiveness are the same. What then, is compelling us to choose Option 4? Are there some other considerations that we haven't documented in the ADR that make Option 4 more attractive?

alexcottner · 2024-02-21T21:45:00Z

Good points, processing data out of order can be just as dropping it. Gonna update the ADR with that as a risk for now. And since we're stateless we can't compare timestamps easily.

…ntegration into adr-003

leplatrem

Some feedback after re-reading again the whole doc...

I haven't fully understood how you would distinguish from blocking events from others. Is it based on Jira HTTP responses? (ie. 401/500 vs. 400)

How do we merge events? What is the cardinality? (product, bug #, event_type) ?

I know that this ADR is about choosing an overall strategy and not going into implementation details, but I believe that the chosen solution would be easier to understand if some aspects were detailed.

docs/adrs/003-bugzilla-response-codes.md

…ntegration into adr-003

This commit adds the classes that allow us to manage our "Dead Letter Queue". See #775 for motivation. We also add a basic heartbeat check in this commit to assert that we are able to read from / write to the queue. In future commits, we'll introduce a method to process items in the queue and fully integrate the queue into the `/bugzilla_webhook` route. --------- Co-authored-by: Mathieu Leplatre <[email protected]> Co-authored-by: Alex Cottner <[email protected]>

leplatrem

I read again quickly, seems to match our implementation.

I think the format of this ADR is not exactly like others we have, where we list all options, and in the Chosen solution we only say which one we chose and why.

Examples:

But this may not be very important :)

docs/adrs/003-bugzilla-response-codes.md

…ntegration into adr-003

Co-authored-by: Mathieu Leplatre <[email protected]>

first draft of adr

e34d77b

alexcottner requested a review from a team as a code owner November 27, 2023 20:15

github-actions bot added the documentation Improvements or additions to documentation label Nov 27, 2023

fixed typo in title

d324219

leplatrem reviewed Nov 28, 2023

View reviewed changes

docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved

grahamalama reviewed Nov 28, 2023

View reviewed changes

docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved

docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved

docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved

alexcottner and others added 3 commits November 28, 2023 09:43

Update docs/adrs/003-bugzilla-response-codes.md

8335c81

Co-authored-by: Mathieu Leplatre <[email protected]>

Update docs/adrs/003-bugzilla-response-codes.md

db3ec23

Co-authored-by: Mathieu Leplatre <[email protected]>

second draft

8f72827

grahamalama requested a review from a team December 6, 2023 16:29

Merge branch 'mozilla:main' into adr-003

bd40855

Updated based on meeting today, added more details around the dead-le…

2492319

…tter-queue option and added a seperate option for dedicated queue

alexcottner added 6 commits February 21, 2024 14:46

Added risks about processing data out of order

a1ad4f6

Merge branch 'main' of https://github.com/alexcottner/jira-bugzilla-i…

496c0aa

…ntegration into adr-003

working on potential solution details, still a WIP

67fc610

fixing collapsable

7928744

Cleaning up typos

1662c34

Cleaned up some thoughts

e85856a

leplatrem reviewed Mar 6, 2024

View reviewed changes

alexcottner added 2 commits March 6, 2024 10:52

Merge branch 'main' of https://github.com/alexcottner/jira-bugzilla-i…

461de3f

…ntegration into adr-003

Trying to make things clearer

23ce3bf

alexcottner added 4 commits March 8, 2024 10:03

Adding bucket graphic

3ffcfb1

updating ADR from today's discussion

df11583

adding misc options section

6c33714

added note about using a queue only as a dlq

3059d2d

leplatrem approved these changes Apr 11, 2024

View reviewed changes

docs/adrs/003-bugzilla-response-codes.md Outdated Show resolved Hide resolved

alexcottner and others added 5 commits April 15, 2024 15:24

Merge branch 'main' of https://github.com/alexcottner/jira-bugzilla-i…

f90058c

…ntegration into adr-003

Marking as accepted

518b6ce

merging in main

2fef30f

Updating status to implemented

3c08428

Update docs/adrs/003-bugzilla-response-codes.md

7723648

Co-authored-by: Mathieu Leplatre <[email protected]>

alexcottner enabled auto-merge (squash) April 22, 2024 20:02

alexcottner merged commit 7f99b9e into mozilla:main Apr 22, 2024
5 checks passed

This was referenced Apr 23, 2024

Setting up Webhooks with 'ANY' Product #21

Open

System is able to "drop" erring messages without user interference #710

Closed

Consider migrating verification checks from /heartbeat to Telescope? #343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR - Options to prevent bugzilla from pausing webhook calls #775

ADR - Options to prevent bugzilla from pausing webhook calls #775

alexcottner commented Nov 27, 2023 •

edited

Loading

bsieber-mozilla commented Nov 27, 2023

leplatrem left a comment

leplatrem commented Dec 18, 2023

alexcottner commented Dec 18, 2023

alexcottner commented Feb 20, 2024

leplatrem commented Feb 21, 2024

grahamalama commented Feb 21, 2024 •

edited

Loading

grahamalama commented Feb 21, 2024

alexcottner commented Feb 21, 2024 •

edited

Loading

leplatrem left a comment

leplatrem left a comment

ADR - Options to prevent bugzilla from pausing webhook calls #775

ADR - Options to prevent bugzilla from pausing webhook calls #775

Conversation

alexcottner commented Nov 27, 2023 • edited Loading

bsieber-mozilla commented Nov 27, 2023

leplatrem left a comment

Choose a reason for hiding this comment

leplatrem commented Dec 18, 2023

alexcottner commented Dec 18, 2023

alexcottner commented Feb 20, 2024

leplatrem commented Feb 21, 2024

grahamalama commented Feb 21, 2024 • edited Loading

grahamalama commented Feb 21, 2024

alexcottner commented Feb 21, 2024 • edited Loading

leplatrem left a comment

Choose a reason for hiding this comment

leplatrem left a comment

Choose a reason for hiding this comment

alexcottner commented Nov 27, 2023 •

edited

Loading

grahamalama commented Feb 21, 2024 •

edited

Loading

alexcottner commented Feb 21, 2024 •

edited

Loading