Skip to content

Conversation

@adityathebe
Copy link
Member

@adityathebe adityathebe commented Aug 26, 2025

merge after: flanksource/duty#1533
resolves: flanksource/duty#1770

Summary by CodeRabbit

  • New Features

    • Stable, deterministic event IDs and resource-scoped identifiers for improved tracking and idempotent notification delivery
    • Events are automatically enriched with related resource context (components, checks, configs, canaries)
  • Refactor

    • Centralized event resource construction and unified event-queue conflict handling for more consistent processing
  • Tests

    • Comprehensive end-to-end tests for event resource generation and scheduler behavior

@adityathebe adityathebe force-pushed the fix/notification-event-payload branch from d8f62af to 31512d2 Compare August 26, 2025 06:11
@adityathebe adityathebe requested a review from Copilot August 26, 2025 06:12
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the notification event payload structure to support notification idempotency by separating the notification ID from the resource ID and adding automatic ID generation.

Key changes:

  • Splits the payload's ID field into ID (for notification identification) and ResourceID (for resource identification)
  • Adds WithIDSet() method to generate deterministic notification IDs based on notification context
  • Updates all references throughout the codebase to use the appropriate ID field

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
notification/send.go Refactors NotificationEventPayload struct to separate notification and resource IDs, adds ID generation logic
notification/job.go Updates logging and event processing to use ResourceID instead of generic ID
notification/events.go Updates all notification processing logic to use the correct ID fields and calls WithIDSet() for ID generation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@adityathebe adityathebe force-pushed the fix/notification-event-payload branch from 31512d2 to 819e0c8 Compare August 26, 2025 16:41
@adityathebe adityathebe requested a review from moshloop August 26, 2025 16:42
@adityathebe adityathebe force-pushed the fix/notification-event-payload branch 4 times, most recently from 0948ae5 to d1271f6 Compare August 29, 2025 05:54
@adityathebe adityathebe force-pushed the fix/notification-event-payload branch from f10666c to 638b796 Compare February 10, 2026 14:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
notification/events.go (1)

329-331: ⚠️ Potential issue | 🟡 Minor

Use duty.Now() instead of time.Now() for the NotBefore timestamp.

Line 331 uses time.Now() which should be duty.Now() per project conventions for database timestamps.

Suggested fix
-				NotBefore:                 lo.ToPtr(time.Now().Add(*n.WaitFor)),
+				NotBefore:                 lo.ToPtr(duty.Now().Add(*n.WaitFor)),

Based on learnings: Use duty.Now() instead of time.Now() for database timestamps and soft deletes

notification/job.go (1)

320-325: 🛠️ Refactor suggestion | 🟠 Major

Inconsistent event reconstruction: inline construction here vs. ParentEvent() in processPendingNotification.

Line 320 constructs the event inline while processPendingNotification (line 410) uses payload.ParentEvent(). The ParentEvent() method also sets Properties["config_id"] for config.changed/config.updated events, providing a safety net when payload.Properties is empty. Consider using ParentEvent() here too for consistency.

Suggested refactor
-	originalEvent := models.Event{Name: payload.EventName, EventID: payload.EventID, CreatedAt: payload.EventCreatedAt}
+	originalEvent := payload.ParentEvent()
🤖 Fix all issues with AI agents
In `@events/resource.go`:
- Line 42: The doc comment currently mentions the unexported name
buildEventResource while the function is exported; update the comment to use the
exported identifier and proper Go doc format—e.g., change the comment to start
with "BuildEventResource creates an EventResource..." and place it immediately
above the BuildEventResource function to match Go documentation conventions and
the exported symbol.
- Around line 46-60: The DB lookup error handling in the Event resource loader
(involving ctx.DB().Where(...).First(&eventResource.Check),
.First(&eventResource.Canary), etc.) currently replaces all errors with a
generic dutyAPI.ENOTFOUND; update each lookup to distinguish
gorm.ErrRecordNotFound (use errors.Is(err, gorm.ErrRecordNotFound)) from other
DB errors: if record-not-found return dutyAPI.Errorf(dutyAPI.ENOTFOUND, ...) as
before, but for any other err return that original err (or wrap it with context)
so the caller sees real connection/timeout/DB errors; apply the same change to
every call site in this function and keep duty.CheckSummary logic intact.

In `@go.mod`:
- Line 439: Remove or comment out the local replace directive "replace
github.com/flanksource/duty => ../duty" so the module doesn't require a local
checkout; instead update the go.mod requirement for github.com/flanksource/duty
to the published version that contains the changes from duty PR `#1533` (use the
released semver or commit hash for that PR). Ensure the replace line is not left
active and run "go mod tidy" to verify dependency resolution succeeds without
the local path.

In `@incidents/responder/responder_events.go`:
- Around line 61-62: generateCommentAddedAsyncEvent reads event.EventID into
commentID but when building downstream events (the ones that set
Properties["id"]) it never populates the EventID field, causing handlers like
reconcileCommentEvent to receive a zero UUID; update
generateCommentAddedAsyncEvent so each created models.Event has its EventID set
to the original event.EventID (or parsed from commentID) before
publishing/returning, ensuring EventID and Properties["id"] remain consistent
for functions such as reconcileCommentEvent to operate correctly.
- Line 32: The downstream events currently only put the responder ID into
Properties["id"] which lets GORM generate a different api.Event.ID, so
reconcileResponderEvent reads the wrong ID; in generateResponderAddedAsyncEvent
(and the equivalent responder comment event generator) parse the responder ID
string into a UUID and assign it to the api.Event.ID field (instead of leaving
it unset) before creating the event, and handle parse errors; this ensures
reconcileResponderEvent (which uses event.ID/EventID.String() semantics) can
look up the responder correctly.

In `@notification/events.go`:
- Around line 920-924: The code reads configID from event.Properties for events
named api.EventConfigChanged or api.EventConfigUpdated and then calls
query.GetCachedConfig; add an explicit non-empty check for configID before
invoking query.GetCachedConfig to avoid silent failures—if configID == "" skip
the cache lookup (or return/log an error) so you don't call
query.GetCachedConfig with an empty string; reference the event.Name check and
the variables configID and the call to query.GetCachedConfig (and note
ParentEvent() in send.go is not sufficient to guarantee properties) when
applying the guard.

In `@playbook/scheduler_test.go`:
- Around line 447-473: The test name is misleading: the TestCase.Name says
"config.updated" but the test actually creates a models.ConfigChange in
DatabaseChange and asserts api.EventConfigChanged in ExpectedEvents; update the
TestCase.Name to reflect the actual behavior (e.g., "should generate correct
EventResource for organically triggered config.changed event" or similar) so it
matches the DatabaseChange logic and ExpectedEvents in the testEvent invocation.
- Around line 486-488: In the DatabaseChange callback (the anonymous func
DatabaseChange(ctx context.Context, res TestResource) error) replace use of
time.Now() with duty.Now() when updating the deleted_at column (i.e., change
UpdateColumn("deleted_at", time.Now()) to UpdateColumn("deleted_at",
duty.Now())); also ensure the duty package is imported where this test lives so
the symbol duty.Now() resolves.
🧹 Nitpick comments (2)
notification/send.go (1)

80-99: GenerateEventID silently discards the error from hash.DeterministicUUID and doesn't cover PlaybookID recipients.

Two observations:

  1. The error from hash.DeterministicUUID is silently discarded on line 97. If it ever fails, generated would be a zero UUID, potentially causing silent deduplication collisions.

  2. When PlaybookID is set as the recipient (no Connection, Person, Team, or CustomService), recipientSig remains empty, making the generated ID independent of the recipient. This may be intentional for playbooks (only one run per event), but worth confirming.

Suggested improvement for error handling
-	generated, _ := hash.DeterministicUUID(sig)
-	return generated
+	generated, err := hash.DeterministicUUID(sig)
+	if err != nil {
+		// Fallback to a random UUID if deterministic generation fails
+		return uuid.New()
+	}
+	return generated

,

events/resource.go (1)

67-80: Duplicate switch cases can be consolidated.

The EventConfigHealthy/Unhealthy/Warning/Unknown/Degraded case (Lines 67-70) and the EventConfigCreated/Deleted case (Lines 72-75) have identical bodies — both fetch a ConfigItem by event.EventID. Merging them into a single case reduces duplication.

Suggested consolidation
-	case api.EventConfigHealthy, api.EventConfigUnhealthy, api.EventConfigWarning, api.EventConfigUnknown, api.EventConfigDegraded:
-		if err := ctx.DB().Model(&models.ConfigItem{}).Where("id = ?", event.EventID).First(&eventResource.Config).Error; err != nil {
-			return eventResource, dutyAPI.Errorf(dutyAPI.ENOTFOUND, "config(id=%s) not found", event.EventID)
-		}
-
-	case api.EventConfigCreated, api.EventConfigDeleted:
+	case api.EventConfigHealthy, api.EventConfigUnhealthy, api.EventConfigWarning, api.EventConfigUnknown, api.EventConfigDegraded,
+		api.EventConfigCreated, api.EventConfigDeleted:
 		if err := ctx.DB().Model(&models.ConfigItem{}).Where("id = ?", event.EventID).First(&eventResource.Config).Error; err != nil {
 			return eventResource, dutyAPI.Errorf(dutyAPI.ENOTFOUND, "config(id=%s) not found", event.EventID)
 		}

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
playbook/events.go (1)

356-357: ⚠️ Potential issue | 🟡 Minor

Use duty.Now() instead of time.Now() for database timestamps.

These timestamps are stored in a PlaybookRun record. Per coding guidelines, database timestamps should use duty.Now().

♻️ Proposed fix
-				StartTime:  lo.ToPtr(time.Now()),
-				EndTime:    lo.ToPtr(time.Now()),
+				StartTime:  lo.ToPtr(duty.Now()),
+				EndTime:    lo.ToPtr(duty.Now()),

(Ensure duty is imported: "github.com/flanksource/duty")

As per coding guidelines, "Use duty.Now() instead of time.Now() for database timestamps and soft deletes".

🧹 Nitpick comments (3)
playbook/events.go (1)

219-226: Fallback chain is reasonable but verify behavior when all lookups are empty.

If playbook_id, id, and EventID are all zero/empty, playbookID will be "00000000-0000-0000-0000-000000000000", and the query on line 287 will return a "record not found" error. This is safe but produces a less descriptive error than an explicit validation. Consider adding a guard.

notification/send.go (1)

81-100: GenerateEventID produces a good deterministic signature, but CustomService.Name alone may not be unique.

If two custom notification services share the same Name but differ in configuration (URL, connection, etc.), they'd produce the same EventID, potentially colliding. This is likely a very rare edge case given that Name is typically unique within a notification's custom services.

playbook/actions/ai.go (1)

291-296: The "id" property is redundant alongside "playbook_id".

Both "id" and "playbook_id" are set to playbook.ID.String(). In playbook/events.go (lines 220-226), onNewRun checks playbook_id first, then falls back to id. The fallback exists for backward compatibility with events from other sources, but no other code in the codebase directly consumes Properties["id"]. Removing "id" from this code is safe and would clarify that playbook_id is the canonical property.

AI actions that trigger multiple context-provider playbooks all used the
same ActionID as event_id. Since event_queue has a unique constraint on
(name, event_id), the second child playbook insert would fail with a
constraint violation, silently preventing all but the first child run.

Derive a deterministic UUID from (ActionID, target playbook ID) so each
child gets its own dedupe slot while remaining idempotent on retry.
Plain Create(&event) on the playbook.run insert means a duplicate
dispatch replay (e.g. async consumer tx.Commit failure) would hard-fail
with a unique constraint error instead of being handled gracefully.

Switch to EventQueueOnConflictClause so replays reset attempts and
created_at rather than crashing.
Include source trigger context in notification.send event_id generation to reduce unintended collisions across event types and recipient configs.

Use EventQueueOnConflictClause when AI actions enqueue playbook.run events so retries are idempotent under (name,event_id) uniqueness.
@flanksource flanksource deleted a comment from coderabbitai bot Feb 11, 2026
@adityathebe adityathebe force-pushed the fix/notification-event-payload branch from df59fbf to 665f6c8 Compare February 11, 2026 07:18
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

Caution

Review failed

The head commit changed during the review from dc6a3fc to 90bc618.

Walkthrough

Adds a stable EventID to the Event model and updates event creation, queuing, resource resolution, notification payloads, playbook actions, and responder logic to use EventID and ResourceID. Introduces events.BuildEventResource and centralizes event-queue conflict handling.

Changes

Cohort / File(s) Summary
Event model
api/event.go
Added EventID uuid.UUID to Event and populated it in ToPostQEvent.
Event queue & resources
events/event_queue.go, events/resource.go
Switched OnConflict clause to dynamic models.EventQueueUniqueConstraint()/events.EventQueueOnConflictClause; added events.EventResource and BuildEventResource(ctx,event) with AsMap().
Notification payloads & send logic
notification/send.go, notification/events.go, notification/job.go, notification/silence.go
Replaced payload ID with ResourceID, added EventID, added GenerateEventID() and ParentEvent(), updated notification construction/lookup/logging to use EventID/ResourceID, and use centralized event-queue clause.
Notification command
cmd/notification.go
Populate EventID in event payloads; switch logs and payload fields from ID to ResourceID and EventID; include config_id when applicable.
Responder & incidents
incidents/responder/responder_events.go
Use event.EventID.String() for responder/comment lookups; create propagated api.Event with explicit EventID and Properties.
Playbook integration
playbook/actions/ai.go, playbook/events.go, playbook/runner/exec.go
Add actionID to NewAIAction, compute EventID for playbook events, remove local EventResource in favor of events.BuildEventResource, and update playbook ID extraction to prefer properties["playbook_id"] then EventID.
Tests
playbook/events_test.go, playbook/scheduler_test.go
Updated tests to use events.EventResource; added end-to-end scheduler tests validating EventResource generation for checks, components, and configs.
Module config
go.mod
Enabled local replace directive for github.com/flanksource/duty.

Sequence Diagram(s)

sequenceDiagram
    participant Source as Event Source
    participant DB as Database
    participant Events as events.BuildEventResource
    participant Notif as Notification Service
    participant Responder as Responder/Playbook

    Source->>DB: Create models.Event (includes EventID)
    DB-->>Source: Persisted event (queued)
    Events->>DB: Load related resources by EventID / properties
    DB-->>Events: Return Component/Config/Check/Canary
    Events->>Notif: Provide EventResource / map
    Notif->>DB: Create NotificationEventPayload (ResourceID + EventID) and ParentEvent()
    Notif->>DB: Enqueue/send notification using EventID
    Responder->>DB: Lookup responder/comment using EventID
    DB-->>Responder: Return responder/comment records
Loading

Possibly related PRs

Suggested labels

ready

Suggested reviewers

  • moshloop
🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'fix: notification event payload' is vague and generic. It does not clearly convey what specific issue is being fixed or what the main change entails. Revise the title to be more specific and descriptive, such as 'fix: refactor event identification to use EventID instead of properties[id]' or 'fix: add EventID field to Event model and update notification payload handling'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/notification-event-payload

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
incidents/responder/responder_events.go (1)

94-119: ⚠️ Potential issue | 🟠 Major

Multiple responders of the same type will collide in the event_queue due to identical (Name, EventID) pairs.

When iterating over responders for comment events (lines 94–119), all events of the same type share the same (Name, EventID). The ON CONFLICT clause resets attempts, last_attempt, created_at, and error, but does not update Properties, so only the first responder's responder_id is preserved. Currently, reconcileCommentEvent re-queries all responders from the database, masking the issue. However, the TODO at line 174 indicates an intended refactoring to use responder_id from event properties, which would expose this as a critical bug where only one responder per type is notified.

The same concern applies to responder-added events (lines 46–63), though having multiple integrations of the same type is less common there.

🤖 Fix all issues with AI agents
In `@events/resource.go`:
- Around line 47-67: The Find() calls are passing pointers to potentially nil
struct fields (e.g., eventResource.Check and eventResource.Canary), causing
possible nil-pointer panics when you later access .ID; instead, allocate local
non-nil variables (e.g., var chk models.Check; var canary models.Canary) and use
ctx.DB().Where(...).Limit(1).Find(&chk) (or First(&chk) and check for
gorm.ErrRecordNotFound), validate chk.ID against uuid.Nil and only then assign
eventResource.Check = chk (same pattern for eventResource.Canary,
eventResource.Component, and eventResource.Config); also preserve the existing
duty.CheckSummary call (use checkID.String()) and return the same dutyAPI errors
when records are missing.
🧹 Nitpick comments (2)
events/resource.go (2)

77-99: Config health events and config created/deleted share identical lookup logic.

Lines 77-83 and 85-91 are exact duplicates — both look up ConfigItem by event.EventID. Consider merging these cases to reduce duplication:

♻️ Suggested consolidation
-	case api.EventConfigHealthy, api.EventConfigUnhealthy, api.EventConfigWarning, api.EventConfigUnknown, api.EventConfigDegraded:
-		if err := ctx.DB().Model(&models.ConfigItem{}).Where("id = ?", event.EventID).Limit(1).Find(&eventResource.Config).Error; err != nil {
-			return eventResource, err
-		}
-		if eventResource.Config.ID == uuid.Nil {
-			return eventResource, dutyAPI.Errorf(dutyAPI.ENOTFOUND, "config(id=%s) not found", event.EventID)
-		}
-
-	case api.EventConfigCreated, api.EventConfigDeleted:
-		if err := ctx.DB().Model(&models.ConfigItem{}).Where("id = ?", event.EventID).Limit(1).Find(&eventResource.Config).Error; err != nil {
-			return eventResource, err
-		}
-		if eventResource.Config.ID == uuid.Nil {
-			return eventResource, dutyAPI.Errorf(dutyAPI.ENOTFOUND, "config(id=%s) not found", event.EventID)
-		}
+	case api.EventConfigHealthy, api.EventConfigUnhealthy, api.EventConfigWarning, api.EventConfigUnknown, api.EventConfigDegraded,
+		api.EventConfigCreated, api.EventConfigDeleted:
+		if err := ctx.DB().Model(&models.ConfigItem{}).Where("id = ?", event.EventID).Limit(1).Find(&eventResource.Config).Error; err != nil {
+			return eventResource, err
+		}
+		if eventResource.Config.ID == uuid.Nil {
+			return eventResource, dutyAPI.Errorf(dutyAPI.ENOTFOUND, "config(id=%s) not found", event.EventID)
+		}

93-99: Missing validation: config_id property must exist for config.changed/config.updated events.

If event.Properties["config_id"] is empty (missing key), the query Where("id = ?", "") will match zero rows and return a generic "not found" error. Consider validating the property exists upfront to provide a more actionable error message.

🛡️ Suggested validation
 	case api.EventConfigChanged, api.EventConfigUpdated:
+		configID := event.Properties["config_id"]
+		if configID == "" {
+			return eventResource, dutyAPI.Errorf(dutyAPI.EINVALID, "config_id property is required for %s event", event.Name)
+		}
-		if err := ctx.DB().Model(&models.ConfigItem{}).Where("id = ?", event.Properties["config_id"]).Limit(1).Find(&eventResource.Config).Error; err != nil {
+		if err := ctx.DB().Model(&models.ConfigItem{}).Where("id = ?", configID).Limit(1).Find(&eventResource.Config).Error; err != nil {
 			return eventResource, err
 		}
 		if eventResource.Config.ID == uuid.Nil {
-			return eventResource, dutyAPI.Errorf(dutyAPI.ENOTFOUND, "config(id=%s) not found", event.Properties["config_id"])
+			return eventResource, dutyAPI.Errorf(dutyAPI.ENOTFOUND, "config(id=%s) not found", configID)
 		}

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@notification/events.go`:
- Around line 849-851: The DB error returned by ctx.DB().Where("id = ?",
event.EventID).Find(&comment).Error is being discarded; update the error return
in the block that queries for comment (using ctx.DB().Where / Find and
event.EventID) to include and wrap the original err (e.g., use fmt.Errorf with
%w or include err in the formatted message) so the underlying database error is
preserved when returning from this function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Too large event queue properties for the PostgreSQL B-tree index event_queue_name_properties

2 participants