Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn about duplicated events received from GitHub via Admin Monitor #388

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

gbhat618
Copy link

@gbhat618 gbhat618 commented Feb 3, 2025

Problem
If duplicated GitHub events get delivered to Jenkins controller, Jenkins need to act on events both times (they are not automatically de-duplicated). These events may be a branch scan, build trigger etc.

further discussion of problem:

A branch scan event running twice may not be a problem, however build triggering twice could be a problem. Even the builds doesn't get in triggered twice due to quietPeriod: 5s (unless the quietPeriod was set to 0, or duplicates come after quietPeriod: 5s, which case two builds do get triggered).

Note: usually GH by itself won't automatically redeliver an event if failed, GH simply marks the event as failed (an event is marked failed if GH didn't receive a 200 OK response within 10s; reference)

The reason for such duplicated events is,

  • mostly, a wrong configuration in GitHub side, such as registering two webhooks (at different multiple places such as org, and repo levels) which are targeting the same controller.
  • if a custom setup is done to redeliver failed events if the redeliver is going wrong (this should be rare to never)

Although it is typically rare for this duplication issue to happen, if it happens, it could require quite some time to figure out what is going on. And in some case it may simply go unnoticed.

Logs showing build can trigger twice in case of `quietPeriod: 0`

Note: in the below logs, when the two events came

  • at exactly the same second, Jenkins didn't execute 2 builds, rather both the events corresponded to a single build; except in the build log we saw Started by GitHub push by gbhat618 twice (ref); however rest of logs showed build steps were executed only once; I verified by this by having a single executor.
  • within 1 second apart, Jenkins executed two builds one after other.
2025-02-06 03:20:19.393+0000 [id=87]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber#onEvent: Received PushEvent for https://github.com/gbhat618/dummy from 140.82.115.146:35792 → 140.82.115.146 ⇒ https://localhost:8080/github-webhook/
2025-02-06 03:20:19.393+0000 [id=88]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber#onEvent: Received PushEvent for https://github.com/gbhat618/dummy from 140.82.115.168:30022 → 140.82.115.168 ⇒ https://localhost:8080/github-webhook/
2025-02-06 03:20:19.432+0000 [id=88]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber$1#run: Poked p1
2025-02-06 03:20:19.432+0000 [id=87]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber$1#run: Poked p1
2025-02-06 03:20:20.133+0000 [id=90]	INFO	c.c.jenkins.GitHubPushTrigger$1#run: SCM changes detected in p1. Triggering #6
2025-02-06 03:20:20.133+0000 [id=91]	INFO	c.c.jenkins.GitHubPushTrigger$1#run: SCM changes detected in p1. Triggering #6
2025-02-06 03:20:35.660+0000 [id=123]	INFO	o.j.p.g.ApiRateLimitChecker$RateLimitCheckerAdapter#checkRateLimit: LocalChecker for rate limit was not set for this thread. Configured using system settings with API URL 'https://api.github.com'.
2025-02-06 03:21:12.640+0000 [id=19]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber#onEvent: Received PushEvent for https://github.com/gbhat618/dummy from 140.82.115.172:50884 → 140.82.115.172 ⇒ https://localhost:8080/github-webhook/
2025-02-06 03:21:12.644+0000 [id=174]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber#onEvent: Received PushEvent for https://github.com/gbhat618/dummy from 140.82.115.82:36500 → 140.82.115.82 ⇒ https://localhost:8080/github-webhook/
2025-02-06 03:21:12.649+0000 [id=174]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber$1#run: Poked p1
2025-02-06 03:21:12.649+0000 [id=19]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber$1#run: Poked p1
2025-02-06 03:21:13.301+0000 [id=185]	INFO	c.c.jenkins.GitHubPushTrigger$1#run: SCM changes detected in p1. Triggering #7
2025-02-06 03:21:13.301+0000 [id=186]	INFO	c.c.jenkins.GitHubPushTrigger$1#run: SCM changes detected in p1. Triggering #7
2025-02-06 03:21:30.552+0000 [id=122]	INFO	o.j.p.g.ApiRateLimitChecker$RateLimitCheckerAdapter#checkRateLimit: LocalChecker for rate limit was not set for this thread. Configured using system settings with API URL 'https://api.github.com'.
2025-02-06 03:21:47.226+0000 [id=19]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber#onEvent: Received PushEvent for https://github.com/gbhat618/dummy from 140.82.115.11:19518 → 140.82.115.11 ⇒ https://localhost:8080/github-webhook/
2025-02-06 03:21:47.226+0000 [id=171]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber#onEvent: Received PushEvent for https://github.com/gbhat618/dummy from 140.82.115.73:36722 → 140.82.115.73 ⇒ https://localhost:8080/github-webhook/
2025-02-06 03:21:47.229+0000 [id=171]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber$1#run: Poked p1
2025-02-06 03:21:47.229+0000 [id=19]	INFO	o.j.p.g.w.s.DefaultPushGHEventSubscriber$1#run: Poked p1
2025-02-06 03:21:47.874+0000 [id=242]	INFO	c.c.jenkins.GitHubPushTrigger$1#run: SCM changes detected in p1. Triggering #8
2025-02-06 03:21:47.889+0000 [id=241]	INFO	c.c.jenkins.GitHubPushTrigger$1#run: SCM changes detected in p1. Triggering #9

Proposed solution

An admin monitor to show duplicate events are being received - admin monitor goes away by itself after 24 hours if no more duplicates.

further discussion of solution: Duplication of event is identified by a header `X-GitHub-Delivery` ([reference](https://docs.github.com/en/webhooks/webhook-events-and-payloads#delivery-headers), also linked in javadoc), which is a unique identifier for an event in GitHub.

An admin monitor is shown to warn about duplicated events being received. If the duplicated events are to occur, they typically occur in a few seconds window itself, we track an event for duplicate checking for about 10 minutes - so it should be enough to catch the issue.

For the cleanup of cached event entries, currently implemented via a PeriodWork to avoid any slowness in the event processing code path.

The admin can find the last received duplicate payload via the Click Here button shown (see the screenshots in Testing section) -- it opens the payload JSON in the new browser tab next to current tab. (browsers like firefox and chrome are by default acting as pretty json viewer - due to content-type application/json)

I have considered alternative ways of presenting the event payload as well,

  • Logging the event payload
    • Typically users tend to take a look at the problem if it is shown on the UI.
    • In the logs events get missed. Besides we need to also consider while coding, payload is very huge for log, making sure to log the event only once (thread synchronization), how often to log the event. So logging doesn't seem like a good user experience for the admin.
  • Automatically downloading a JSON file upon Click Here: most of the admin can take a decision directly from seeing the payload in the browser (downloading to laptop - is another problem for the admin to cleanup the file later)
    • The payload is not that huge as in to not be visible in the UI (or crashing the browser etc.). It is within the limit of ~200 lines in pretty format.
    • There are not multiple files (such as how the jenkins support bundle would have)
    • Browsers like, firefox and chrome can automatically pretty format the JSON, so it is also acting as a JSON viewer in this case.

Testing done

  • Automated tests are submitted

    • mimicking the GH event sending via triggering webhook endpoint; and validating admin monitor appears in case of duplication
    • unit tests to assert logic of duplicate events computation and cleanup validation.
  • Manual testing

    • Registering two webhooks on a github repo, and trigger commits on a branch; jenkins received 2 events. Admin monitor showed. (also two builds do get triggered for setting quietPeriod: 0).
    • Background cleanup was monitored by setting up log recorder to see the FINE logs.
Screenshots
Desc Screenshot
Admin monitor appears when duplicate events discovered image
Click here button opens in the new browser tab with the event JSON image
If somehow admin goes to the above json payload page - but there are no duplicates image
Duplicate event subscriber cleaning up logs image

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@gbhat618
Copy link
Author

gbhat618 commented Feb 3, 2025

👋 @Vlatombe , @jglick

Copy link
Member

@Vlatombe Vlatombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Made some suggestions.

Comment on lines 105 to 107
* At present, as the {@link #TTL_MILLIS} is set to 10 minutes, we consider half of it for cleanup.
* This recurrence period is chosen to balance removing stale entries from accumulating in memory vs.
* additional load on Jenkins due to a new periodic job execution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caffeine could be considered if you are looking for stronger assertions about the number of entries.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never knew about caffeine (checked now, it's good, thank you). But not sure if it is okay to bring in a new dependency into the plugin though.

The map here (Map<String, Long>) approximately taking 44 bytes per entry (String: storing GUID 36 bytes, Long: 8 bytes); with 1MB memory usage ~23k entries can be stored; this is seems like a sufficient for 10 minutes worth events from GH (for one controller) (?).

So thinking maybe bringing in caffeine would be adding another dependency in here.

@gbhat618 gbhat618 marked this pull request as ready for review February 6, 2025 04:58
@gbhat618 gbhat618 requested a review from a team as a code owner February 6, 2025 04:58
@gbhat618
Copy link
Author

gbhat618 commented Feb 7, 2025

I am fixing how the last duplicate event is presented to the user from, enable FINEST logger (current implementation) to click here to download

Edit: I made the payload open in a new browser tab since browsers like Firefox and Chrome format JSON nicely. With ~200 lines, it's easy to view without issues. The admin can still download it if needed or simply review it in the browser and proceed.

@gbhat618 gbhat618 marked this pull request as draft February 7, 2025 04:13
@gbhat618 gbhat618 marked this pull request as ready for review February 7, 2025 16:38
@gbhat618
Copy link
Author

gbhat618 commented Feb 7, 2025

I think this PR is now ready 😄 I feel like covered all the features, and automated tests and manual tests.
Please help review, ty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants