-
Notifications
You must be signed in to change notification settings - Fork 260
feat(metrics): add Git sync observability metrics to feature flag backend #4673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…lity by adding detailed synchronization metrics. Currently, failures during Git sync are only logged without metric visibility, limiting proactive monitoring and alerting capabilities. - Introduce new OpenTelemetry metrics for Git sync operations: - Last sync time as an observable gauge (timestamp). - Sync duration histogram. - Counters for number of flags fetched. - Success and failure counts with failure reason attributes. - Instrument the `SnapshotStore.update` method, the core sync loop, to record these metrics accurately on every sync attempt, including partial failures and cleanups. - Extend the `Snapshot` type with `TotalFlagsCount()` to count all flags across namespaces for metric reporting. - Integrate metrics initialization in app startup ensuring consistent telemetry setup. - Improve test coverage by suggesting strategies to verify metric emission and sync behavior. These metric additions enable operators to monitor Git sync health, detect failures promptly, and troubleshoot issues efficiently, significantly improving runtime observability and system reliability. Signed-off-by: Rohit Jaiswal <[email protected]>
583922f
to
35edf0c
Compare
Signed-off-by: Rohit Jaiswal <[email protected]>
internal/metrics/metrics.go
Outdated
func init() { | ||
if otel.GetMeterProvider() == nil { | ||
otel.SetMeterProvider(metricnoop.NewMeterProvider()) | ||
} | ||
} | ||
|
||
func InitMetrics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think these metrics should live here. we already have metrics for git in internal/storage/git/metrics.go
. could we not add the necessary metrics there instead? that way they dont need to be exported and we dont need the Init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry I didn't realize this was for v1 not v2 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markphelps yeah, these changes are for v1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markphelps I notice that the basic workflows like unit tests, lint, etc require approvals from maintainer to run. Could we make them run automatically on each branch push? It would be helpful to speed up the development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohitnarayan internal/cache/metrics.go
is a good example in v1
.
You could run linters and tests locally with mage
. Please run mage -l
to see all available tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do still think these git specific metrics should be moved to the git package (https://github.com/flipt-io/flipt/tree/main/internal/storage/fs/git)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that the basic workflows like unit tests, lint, etc require approvals from maintainer to run. Could we make them run automatically on each branch push? It would be helpful to speed up the development.
@rohitnarayan this is a standard practice in most open source projects on GitHub for first contributors. After your first PR is merged I dont think it will require approvals from maintainers to run the workflows
Signed-off-by: Rohit Jaiswal <[email protected]>
76f986f
to
d399be0
Compare
cmd/flipt/main.go
Outdated
@@ -105,6 +106,8 @@ func exec() error { | |||
return err | |||
} | |||
|
|||
metrics.InitMetrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prefer to not do this init here and just use the regular package level init
in the git metrics package
internal/metrics/metrics.go
Outdated
func init() { | ||
if otel.GetMeterProvider() == nil { | ||
otel.SetMeterProvider(metricnoop.NewMeterProvider()) | ||
} | ||
} | ||
|
||
func InitMetrics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do still think these git specific metrics should be moved to the git package (https://github.com/flipt-io/flipt/tree/main/internal/storage/fs/git)
internal/metrics/metrics.go
Outdated
func init() { | ||
if otel.GetMeterProvider() == nil { | ||
otel.SetMeterProvider(metricnoop.NewMeterProvider()) | ||
} | ||
} | ||
|
||
func InitMetrics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that the basic workflows like unit tests, lint, etc require approvals from maintainer to run. Could we make them run automatically on each branch push? It would be helpful to speed up the development.
@rohitnarayan this is a standard practice in most open source projects on GitHub for first contributors. After your first PR is merged I dont think it will require approvals from maintainers to run the workflows
Signed-off-by: Rohit Jaiswal <[email protected]>
@markphelps @erka I've addressed comments from both of you. Please review when you get time. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive Git synchronization observability metrics to Flipt's feature flag backend, enabling better monitoring and alerting for Git sync operations. Previously, sync failures were only logged without metric visibility.
- Introduces OpenTelemetry metrics including sync duration histograms, flag count counters, success/failure rates, and last sync timestamp gauge
- Instruments the core
SnapshotStore.update
method to emit metrics on every sync attempt - Adds
TotalFlagsCount()
method to theSnapshot
type for accurate flag counting across namespaces
Reviewed Changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
internal/storage/fs/git/metrics.go | New metrics definitions and observation functions for Git sync operations |
internal/storage/fs/git/metrics_test.go | Test coverage for all metric observation functions |
internal/storage/fs/git/store.go | Instrumentation of sync operations with timing, flag counting, and error tracking |
internal/storage/fs/snapshot.go | Addition of TotalFlagsCount method for cross-namespace flag counting |
internal/storage/fs/snapshot_test.go | Test cases for the new TotalFlagsCount functionality |
internal/metrics/metrics.go | Export of Meter function for external metric creation |
DEVELOPMENT.md | Minor formatting improvement to numbered list |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
internal/storage/fs/git/metrics.go
Outdated
lastSyncTimeMu.RLock() | ||
value := lastSyncTimeValue | ||
lastSyncTimeMu.RUnlock() | ||
observer.ObserveInt64(LastTime, value/1e9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 1e9 should be replaced with a named constant like nanosPerSecond = 1e9
to improve code readability and maintainability.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
internal/storage/fs/git/store.go
Outdated
duration := time.Since(syncStart).Seconds() | ||
|
||
if len(errs) > 0 { | ||
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using fmt.Sprintf(\"%v\", errs)
creates unclear error messages. Consider using errors.Join(errs...).Error()
or a more descriptive error formatting approach that provides better insight into the failure reasons.
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs)) | |
ObserveSync(ctx, duration, flagsFetched, false, syncType, errors.Join(errs...).Error()) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those errors probably shouldn’t be included in metrics at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed them.
internal/storage/fs/git/store.go
Outdated
if len(errs) > 0 { | ||
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs)) | ||
return true, errors.Join(errs...) | ||
} | ||
return true, errors.Join(errs...) | ||
|
||
ObserveSync(ctx, duration, flagsFetched, true, syncType, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The success metrics call on line 414 is unreachable code because the function returns on line 411 when errors exist. This means successful syncs after error recovery won't be properly recorded. Move the success case before the error handling block.
if len(errs) > 0 { | |
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs)) | |
return true, errors.Join(errs...) | |
} | |
return true, errors.Join(errs...) | |
ObserveSync(ctx, duration, flagsFetched, true, syncType, "") | |
ObserveSync(ctx, duration, flagsFetched, true, syncType, "") | |
if len(errs) > 0 { | |
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs)) | |
return true, errors.Join(errs...) | |
} |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
internal/storage/fs/git/store.go
Outdated
@@ -294,6 +302,10 @@ func (s *SnapshotStore) View(ctx context.Context, storeRef storage.Reference, fn | |||
return fn(snap) | |||
} | |||
|
|||
func (s *SnapshotStore) Resolve(ref string) (plumbing.Hash, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete it as it isn't in use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
internal/storage/fs/git/store.go
Outdated
duration := time.Since(syncStart).Seconds() | ||
|
||
if len(errs) > 0 { | ||
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those errors probably shouldn’t be included in metrics at all.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4673 +/- ##
=======================================
Coverage 63.83% 63.83%
=======================================
Files 171 172 +1
Lines 17617 17659 +42
=======================================
+ Hits 11245 11273 +28
- Misses 5700 5709 +9
- Partials 672 677 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Rohit Jaiswal <[email protected]>
Signed-off-by: Rohit Jaiswal <[email protected]>
Signed-off-by: Rohit Jaiswal <[email protected]>
Signed-off-by: Roman Dmytrenko <[email protected]>
@erka @markphelps Please can you approve the workflow. Thank you! |
@markphelps @erka All checks have passed. Can we merge this ? |
internal/storage/fs/git/store.go
Outdated
if !updated && fetchErr == nil { | ||
// No update and no error: record metrics for a successful no-change sync | ||
duration := time.Since(syncStart).Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be milliseconds? im not sure which is more common (sub second syncing or it taking longer than a second)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markphelps time.Since(syncStart).Seconds()
already provides sub-second precision as a float64
, capturing durations down to milliseconds and microseconds, which is sufficient for metrics without needing conversion.
Hey @rohitnarayan ! Thank you again for adding Git sync observability! Instead of commenting on each line/block I just figured I'd give an overview of requested changes in this comment, as there are a few inconsistencies with our existing metrics patterns Critical Issues1. Inconsistent Metric Naming Conventions
// Inconsistent - has redundant _count suffix
prometheus.BuildFQName(namespace, subsystem, "success_count")
prometheus.BuildFQName(namespace, subsystem, "failure_count")
// Also inconsistent - no suffix
prometheus.BuildFQName(namespace, subsystem, "flags_fetched") Expected Pattern (based on existing metrics in // Should be consistent without redundant suffixes
prometheus.BuildFQName(namespace, subsystem, "success")
prometheus.BuildFQName(namespace, subsystem, "error") // See next issue
prometheus.BuildFQName(namespace, subsystem, "flags_fetched") 2. Wrong Error Terminology
Should be: prometheus.BuildFQName(namespace, subsystem, "error") Major Issues3. Complex Observable Gauge Implementation
Current approach: func init() {
m := metrics.Meter()
// Complex manual setup with panic handling
} Existing pattern (see other metrics files): Simple variable declarations using helpers. 4. Missing Unit Specification
Duration = metrics.MustFloat64().
Histogram(
prometheus.BuildFQName(namespace, subsystem, "duration_seconds"),
metric.WithDescription("The duration of git sync operations in seconds"),
metric.WithUnit("s"), // Add this
) Minor Issues5. Missing Attribute Constants
// Add to top of file
var (
AttributeSyncType = attribute.Key("sync_type")
)
// Then use consistently
Success.Add(ctx, 1, metric.WithAttributeSet(
attribute.NewSet(AttributeSyncType.String(typ)),
)) 6. Inconsistent API Usage
Summary of Required Changes
|
Signed-off-by: Rohit Jaiswal <[email protected]>
Signed-off-by: Rohit Jaiswal <[email protected]>
@markphelps thanks for those comments. I've addressed them, please check. Thank you! |
refactor: simplify git sync metrics
@erka @markphelps please can you re-run the approval workflows. Thank you! |
Signed-off-by: Rohit Jaiswal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great to me! thank you @rohitnarayan for bearing with us!! and thank you for the contribution
Signed-off-by: Roman Dmytrenko <[email protected]>
@all-contributors please add @rohitnarayan for code |
I've put up a pull request to add @rohitnarayan! 🎉 |
This change enhances Flipt's Git-based feature flag backend observability by adding detailed synchronization metrics. Currently, failures during Git sync are only logged without metric visibility, limiting proactive monitoring and alerting capabilities.
These metric additions enable operators to monitor Git sync health, detect failures promptly, and troubleshoot issues efficiently, significantly improving runtime observability and system reliability