Skip to content

Conversation

rohitnarayan
Copy link

This change enhances Flipt's Git-based feature flag backend observability by adding detailed synchronization metrics. Currently, failures during Git sync are only logged without metric visibility, limiting proactive monitoring and alerting capabilities.

- Introduce new OpenTelemetry metrics for Git sync operations:
  - Last sync time as an observable gauge (timestamp).
  - Sync duration histogram.
  - Counters for number of flags fetched.
  - Success and failure counts with failure reason attributes.

- Instrument the `SnapshotStore.update` method, the core sync loop, to record these metrics accurately on every sync attempt, including partial failures and cleanups.

- Extend the `Snapshot` type with `TotalFlagsCount()` to count all flags across namespaces for metric reporting.

- Integrate metrics initialization in app startup ensuring consistent telemetry setup.

- Improve test coverage by suggesting strategies to verify metric emission and sync behavior.

These metric additions enable operators to monitor Git sync health, detect failures promptly, and troubleshoot issues efficiently, significantly improving runtime observability and system reliability

@rohitnarayan rohitnarayan requested a review from a team as a code owner September 1, 2025 06:53
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Sep 1, 2025
@rohitnarayan rohitnarayan changed the base branch from v2 to main September 1, 2025 06:53
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Sep 1, 2025
…lity by adding detailed synchronization metrics. Currently, failures during Git sync are only logged without metric visibility, limiting proactive monitoring and alerting capabilities.

    - Introduce new OpenTelemetry metrics for Git sync operations:
      - Last sync time as an observable gauge (timestamp).
      - Sync duration histogram.
      - Counters for number of flags fetched.
      - Success and failure counts with failure reason attributes.

    - Instrument the `SnapshotStore.update` method, the core sync loop, to record these metrics accurately on every sync attempt, including partial failures and cleanups.

    - Extend the `Snapshot` type with `TotalFlagsCount()` to count all flags across namespaces for metric reporting.

    - Integrate metrics initialization in app startup ensuring consistent telemetry setup.

    - Improve test coverage by suggesting strategies to verify metric emission and sync behavior.

    These metric additions enable operators to monitor Git sync health, detect failures promptly, and troubleshoot issues efficiently, significantly improving runtime observability and system reliability.

Signed-off-by: Rohit Jaiswal <[email protected]>
@rohitnarayan rohitnarayan force-pushed the flipt-4590_git_metrics branch from 583922f to 35edf0c Compare September 1, 2025 11:49
func init() {
if otel.GetMeterProvider() == nil {
otel.SetMeterProvider(metricnoop.NewMeterProvider())
}
}

func InitMetrics() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think these metrics should live here. we already have metrics for git in internal/storage/git/metrics.go. could we not add the necessary metrics there instead? that way they dont need to be exported and we dont need the Init

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I didn't realize this was for v1 not v2 😬

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markphelps yeah, these changes are for v1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markphelps I notice that the basic workflows like unit tests, lint, etc require approvals from maintainer to run. Could we make them run automatically on each branch push? It would be helpful to speed up the development.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohitnarayan internal/cache/metrics.go is a good example in v1.

You could run linters and tests locally with mage. Please run mage -l to see all available tasks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do still think these git specific metrics should be moved to the git package (https://github.com/flipt-io/flipt/tree/main/internal/storage/fs/git)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that the basic workflows like unit tests, lint, etc require approvals from maintainer to run. Could we make them run automatically on each branch push? It would be helpful to speed up the development.

@rohitnarayan this is a standard practice in most open source projects on GitHub for first contributors. After your first PR is merged I dont think it will require approvals from maintainers to run the workflows

@rohitnarayan rohitnarayan force-pushed the flipt-4590_git_metrics branch from 76f986f to d399be0 Compare September 2, 2025 01:41
@@ -105,6 +106,8 @@ func exec() error {
return err
}

metrics.InitMetrics()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would prefer to not do this init here and just use the regular package level init in the git metrics package

func init() {
if otel.GetMeterProvider() == nil {
otel.SetMeterProvider(metricnoop.NewMeterProvider())
}
}

func InitMetrics() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do still think these git specific metrics should be moved to the git package (https://github.com/flipt-io/flipt/tree/main/internal/storage/fs/git)

func init() {
if otel.GetMeterProvider() == nil {
otel.SetMeterProvider(metricnoop.NewMeterProvider())
}
}

func InitMetrics() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that the basic workflows like unit tests, lint, etc require approvals from maintainer to run. Could we make them run automatically on each branch push? It would be helpful to speed up the development.

@rohitnarayan this is a standard practice in most open source projects on GitHub for first contributors. After your first PR is merged I dont think it will require approvals from maintainers to run the workflows

@rohitnarayan
Copy link
Author

@markphelps @erka I've addressed comments from both of you. Please review when you get time. Thank you!

@erka erka requested a review from Copilot September 3, 2025 12:59
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive Git synchronization observability metrics to Flipt's feature flag backend, enabling better monitoring and alerting for Git sync operations. Previously, sync failures were only logged without metric visibility.

  • Introduces OpenTelemetry metrics including sync duration histograms, flag count counters, success/failure rates, and last sync timestamp gauge
  • Instruments the core SnapshotStore.update method to emit metrics on every sync attempt
  • Adds TotalFlagsCount() method to the Snapshot type for accurate flag counting across namespaces

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/storage/fs/git/metrics.go New metrics definitions and observation functions for Git sync operations
internal/storage/fs/git/metrics_test.go Test coverage for all metric observation functions
internal/storage/fs/git/store.go Instrumentation of sync operations with timing, flag counting, and error tracking
internal/storage/fs/snapshot.go Addition of TotalFlagsCount method for cross-namespace flag counting
internal/storage/fs/snapshot_test.go Test cases for the new TotalFlagsCount functionality
internal/metrics/metrics.go Export of Meter function for external metric creation
DEVELOPMENT.md Minor formatting improvement to numbered list

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

lastSyncTimeMu.RLock()
value := lastSyncTimeValue
lastSyncTimeMu.RUnlock()
observer.ObserveInt64(LastTime, value/1e9)
Copy link
Preview

Copilot AI Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 1e9 should be replaced with a named constant like nanosPerSecond = 1e9 to improve code readability and maintainability.

Copilot uses AI. Check for mistakes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

duration := time.Since(syncStart).Seconds()

if len(errs) > 0 {
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs))
Copy link
Preview

Copilot AI Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using fmt.Sprintf(\"%v\", errs) creates unclear error messages. Consider using errors.Join(errs...).Error() or a more descriptive error formatting approach that provides better insight into the failure reasons.

Suggested change
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs))
ObserveSync(ctx, duration, flagsFetched, false, syncType, errors.Join(errs...).Error())

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those errors probably shouldn’t be included in metrics at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed them.

Comment on lines 407 to 412
if len(errs) > 0 {
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs))
return true, errors.Join(errs...)
}
return true, errors.Join(errs...)

ObserveSync(ctx, duration, flagsFetched, true, syncType, "")
Copy link
Preview

Copilot AI Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The success metrics call on line 414 is unreachable code because the function returns on line 411 when errors exist. This means successful syncs after error recovery won't be properly recorded. Move the success case before the error handling block.

Suggested change
if len(errs) > 0 {
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs))
return true, errors.Join(errs...)
}
return true, errors.Join(errs...)
ObserveSync(ctx, duration, flagsFetched, true, syncType, "")
ObserveSync(ctx, duration, flagsFetched, true, syncType, "")
if len(errs) > 0 {
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs))
return true, errors.Join(errs...)
}

Copilot uses AI. Check for mistakes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -294,6 +302,10 @@ func (s *SnapshotStore) View(ctx context.Context, storeRef storage.Reference, fn
return fn(snap)
}

func (s *SnapshotStore) Resolve(ref string) (plumbing.Hash, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete it as it isn't in use

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

duration := time.Since(syncStart).Seconds()

if len(errs) > 0 {
ObserveSync(ctx, duration, flagsFetched, false, syncType, fmt.Sprintf("%v", errs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those errors probably shouldn’t be included in metrics at all.

Copy link

codecov bot commented Sep 3, 2025

Codecov Report

❌ Patch coverage is 70.00000% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.83%. Comparing base (94735fc) to head (b8039c7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/storage/fs/git/store.go 57.14% 7 Missing and 2 partials ⚠️
internal/metrics/metrics.go 40.00% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4673   +/-   ##
=======================================
  Coverage   63.83%   63.83%           
=======================================
  Files         171      172    +1     
  Lines       17617    17659   +42     
=======================================
+ Hits        11245    11273   +28     
- Misses       5700     5709    +9     
- Partials      672      677    +5     
Flag Coverage Δ
unittests 63.83% <70.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Rohit Jaiswal added 2 commits September 3, 2025 22:16
Rohit Jaiswal and others added 2 commits September 4, 2025 07:06
Signed-off-by: Rohit Jaiswal <[email protected]>
@rohitnarayan
Copy link
Author

@erka @markphelps Please can you approve the workflow. Thank you!

@rohitnarayan
Copy link
Author

@markphelps @erka All checks have passed. Can we merge this ?

if !updated && fetchErr == nil {
// No update and no error: record metrics for a successful no-change sync
duration := time.Since(syncStart).Seconds()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be milliseconds? im not sure which is more common (sub second syncing or it taking longer than a second)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markphelps time.Since(syncStart).Seconds() already provides sub-second precision as a float64, capturing durations down to milliseconds and microseconds, which is sufficient for metrics without needing conversion.

@markphelps
Copy link
Collaborator

Hey @rohitnarayan !

Thank you again for adding Git sync observability! Instead of commenting on each line/block I just figured I'd give an overview of requested changes in this comment, as there are a few inconsistencies with our existing metrics patterns

Critical Issues

1. Inconsistent Metric Naming Conventions

internal/storage/fs/git/metrics.go Lines 40, 47: The metrics use inconsistent suffix patterns:

// Inconsistent - has redundant _count suffix
prometheus.BuildFQName(namespace, subsystem, "success_count")
prometheus.BuildFQName(namespace, subsystem, "failure_count") 

// Also inconsistent - no suffix  
prometheus.BuildFQName(namespace, subsystem, "flags_fetched")

Expected Pattern (based on existing metrics in /internal/server/metrics/metrics.go and /internal/cache/metrics.go):

// Should be consistent without redundant suffixes
prometheus.BuildFQName(namespace, subsystem, "success")
prometheus.BuildFQName(namespace, subsystem, "error")  // See next issue
prometheus.BuildFQName(namespace, subsystem, "flags_fetched")

2. Wrong Error Terminology

internal/storage/fs/git/metrics.go Line 47: Uses "failure_count" but our codebase consistently uses "error" for error metrics:

  • internal/cache/metrics.go:33: "error"
  • internal/server/metrics/metrics.go:21: "errors"
  • internal/server/metrics/metrics.go:38: "errors"

Should be:

prometheus.BuildFQName(namespace, subsystem, "error")

Major Issues

3. Complex Observable Gauge Implementation

internal/storage/fs/git/metrics.go Lines 59-85: The init() function with manual gauge creation and global state management doesn't follow our existing patterns. All other metrics in the codebase use simple variable declarations with metrics.Must*() helpers.

Current approach:

func init() {
    m := metrics.Meter()
    // Complex manual setup with panic handling
}

Existing pattern (see other metrics files): Simple variable declarations using helpers.

4. Missing Unit Specification

internal/storage/fs/git/metrics.go Line 26: The duration metric should specify units:

Duration = metrics.MustFloat64().
    Histogram(
        prometheus.BuildFQName(namespace, subsystem, "duration_seconds"),
        metric.WithDescription("The duration of git sync operations in seconds"),
        metric.WithUnit("s"), // Add this
    )

Minor Issues

5. Missing Attribute Constants

internal/storage/fs/git/metrics.go Lines 89, 96, 103, 110: Should define attribute keys as constants following the pattern in /internal/server/metrics/metrics.go:58-64:

// Add to top of file
var (
    AttributeSyncType = attribute.Key("sync_type")
)

// Then use consistently
Success.Add(ctx, 1, metric.WithAttributeSet(
    attribute.NewSet(AttributeSyncType.String(typ)),
))

6. Inconsistent API Usage

internal/storage/fs/git/metrics.go Line 64: Uses direct m.Int64ObservableGauge() while other metrics use metrics.Must*() helpers. Should be consistent with existing patterns.

Summary of Required Changes

  1. internal/storage/fs/git/metrics.go Line 40: Change "success_count""success"
  2. internal/storage/fs/git/metrics.go Line 47: Change "failure_count""error"
  3. internal/storage/fs/git/metrics.go Line 26: Add metric.WithUnit("s") (or ms) if we decide milliseconds make more sense
  4. internal/storage/fs/git/metrics.go Lines 59-85: Simplify observable gauge to match existing patterns
  5. internal/storage/fs/git/metrics.go Lines 89, 96, 103, 110: Define and use attribute constants
  6. Variable name: Rename FailureError for consistency

@rohitnarayan
Copy link
Author

@markphelps thanks for those comments. I've addressed them, please check. Thank you!

@rohitnarayan
Copy link
Author

@erka @markphelps please can you re-run the approval workflows. Thank you!

Copy link
Collaborator

@markphelps markphelps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me! thank you @rohitnarayan for bearing with us!! and thank you for the contribution

Signed-off-by: Roman Dmytrenko <[email protected]>
@erka erka added the automerge Used by Kodiak bot to automerge PRs label Sep 4, 2025
@kodiakhq kodiakhq bot merged commit 3fef0d1 into flipt-io:main Sep 4, 2025
36 checks passed
@erka
Copy link
Contributor

erka commented Sep 4, 2025

@all-contributors please add @rohitnarayan for code

Copy link
Contributor

@erka

I've put up a pull request to add @rohitnarayan! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Used by Kodiak bot to automerge PRs size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants