Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(telemetry)_: replace telemetry with prometheus metrics #6256

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

adklempner
Copy link
Contributor

@adklempner adklempner commented Jan 15, 2025

Replace telemetry with local metrics using prometheus. Add parameters to InitializeApplication for enabling waku metrics over prometheus and specifying which port to use.

This commit removes the telemetry functionality with a Prometheus client. Most of the metrics that were collected by telemetry now have their corresponding Prometheus gauges, counters, and histograms.

They still require a telemetry url to be set in order to be enabled. Additionally, the parameter WakuMetricsEnabled needs to be set as true in the request for InitializeApplication in order to start Prometheus at port 9305 (can be changed using WakuMetricsPort).

status-desktop Dogfooding PR: status-im/status-desktop#17020

@status-im-auto
Copy link
Member

status-im-auto commented Jan 15, 2025

Jenkins Builds

Click to see older builds (88)
Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ 7cf1f93 #1 2025-01-15 21:30:58 ~3 min macos 📦zip
✖️ 7cf1f93 #1 2025-01-15 21:31:03 ~4 min tests 📄log
✔️ 7cf1f93 #1 2025-01-15 21:31:28 ~4 min ios 📦zip
✔️ 7cf1f93 #1 2025-01-15 21:32:17 ~5 min linux 📦zip
✔️ 7cf1f93 #1 2025-01-15 21:32:30 ~5 min android 📦aar
✔️ 7cf1f93 #1 2025-01-15 21:32:32 ~5 min macos 📦zip
✔️ 7cf1f93 #1 2025-01-15 21:32:38 ~5 min windows 📦zip
✔️ 7cf1f93 #1 2025-01-15 21:33:32 ~6 min tests-rpc 📄log
✖️ 9c43971 #2 2025-01-16 00:30:24 ~2 min tests 📄log
✔️ 9c43971 #2 2025-01-16 00:31:32 ~3 min windows 📦zip
✔️ 9c43971 #2 2025-01-16 00:32:11 ~4 min macos 📦zip
✔️ 9c43971 #2 2025-01-16 00:32:28 ~4 min linux 📦zip
✔️ 9c43971 #2 2025-01-16 00:32:43 ~5 min ios 📦zip
✔️ 9c43971 #2 2025-01-16 00:32:57 ~5 min macos 📦zip
✔️ 9c43971 #2 2025-01-16 00:33:15 ~5 min android 📦aar
✔️ 9c43971 #2 2025-01-16 00:34:15 ~6 min tests-rpc 📄log
✔️ 7c4ff40 #3 2025-01-16 00:52:34 ~3 min windows 📦zip
✔️ 7c4ff40 #3 2025-01-16 00:52:55 ~4 min macos 📦zip
✔️ 7c4ff40 #3 2025-01-16 00:53:07 ~4 min ios 📦zip
✔️ 7c4ff40 #3 2025-01-16 00:53:20 ~4 min linux 📦zip
✔️ 7c4ff40 #3 2025-01-16 00:53:45 ~5 min macos 📦zip
✔️ 7c4ff40 #3 2025-01-16 00:54:09 ~5 min android 📦aar
✔️ 7c4ff40 #3 2025-01-16 00:54:41 ~6 min tests-rpc 📄log
✖️ 7c4ff40 #3 2025-01-16 01:18:34 ~29 min tests 📄log
✔️ 3763072 #4 2025-01-16 01:28:53 ~3 min windows 📦zip
✔️ 3763072 #4 2025-01-16 01:29:24 ~4 min macos 📦zip
✔️ 3763072 #4 2025-01-16 01:29:40 ~4 min linux 📦zip
✔️ 3763072 #4 2025-01-16 01:29:49 ~4 min ios 📦zip
✔️ 3763072 #4 2025-01-16 01:30:11 ~5 min macos 📦zip
✔️ 3763072 #4 2025-01-16 01:31:02 ~6 min android 📦aar
✔️ 3763072 #4 2025-01-16 01:31:19 ~6 min tests-rpc 📄log
✔️ 3763072 #4 2025-01-16 01:56:45 ~31 min tests 📄log
✔️ 5f8d5f6 #5 2025-01-16 22:52:02 ~3 min windows 📦zip
✔️ 5f8d5f6 #5 2025-01-16 22:53:00 ~4 min linux 📦zip
✔️ 5f8d5f6 #5 2025-01-16 22:53:17 ~5 min ios 📦zip
✔️ 5f8d5f6 #5 2025-01-16 22:53:29 ~5 min macos 📦zip
✔️ 5f8d5f6 #5 2025-01-16 22:54:17 ~6 min macos 📦zip
✔️ 5f8d5f6 #5 2025-01-16 22:54:32 ~6 min android 📦aar
✔️ 5f8d5f6 #5 2025-01-16 22:54:33 ~6 min tests-rpc 📄log
✖️ 5f8d5f6 #5 2025-01-16 23:18:05 ~29 min tests 📄log
✔️ 07798db #6 2025-01-16 23:05:31 ~3 min windows 📦zip
✔️ 07798db #6 2025-01-16 23:06:31 ~4 min linux 📦zip
✔️ 07798db #6 2025-01-16 23:07:07 ~5 min macos 📦zip
✔️ 07798db #6 2025-01-16 23:07:13 ~5 min ios 📦zip
✔️ 07798db #6 2025-01-16 23:07:28 ~5 min android 📦aar
✔️ 07798db #6 2025-01-16 23:07:35 ~5 min macos 📦zip
✖️ 07798db #6 2025-01-16 23:08:04 ~6 min tests-rpc 📄log
✖️ 07798db #6 2025-01-16 23:48:08 ~29 min tests 📄log
✔️ 534c2df #7 2025-01-21 20:29:02 ~3 min macos 📦zip
✔️ 534c2df #7 2025-01-21 20:30:06 ~5 min ios 📦zip
✔️ 534c2df #7 2025-01-21 20:30:14 ~5 min macos 📦zip
✔️ 534c2df #7 2025-01-21 20:30:35 ~5 min linux 📦zip
✔️ 534c2df #7 2025-01-21 20:31:05 ~6 min android 📦aar
✖️ 534c2df #7 2025-01-21 20:31:22 ~6 min tests-rpc 📄log
✔️ d8f0d5b #8 2025-01-21 20:29:25 ~3 min windows 📦zip
✖️ d8f0d5b #8 2025-01-21 20:55:09 ~29 min tests 📄log
0a81a87 #8 2025-01-21 20:31:12 ~2 min macos 📄log
0a81a87 #9 2025-01-21 20:31:40 ~2 min windows 📄log
0a81a87 #8 2025-01-21 20:31:54 ~1 min ios 📄log
0a81a87 #8 2025-01-21 20:32:55 ~1 min android 📄log
0a81a87 #8 2025-01-21 20:33:22 ~2 min linux 📄log
0a81a87 #8 2025-01-21 20:34:09 ~3 min macos 📄log
✖️ 0a81a87 #8 2025-01-21 20:34:34 ~2 min tests-rpc 📄log
✔️ a3d37c4 #10 2025-01-21 20:43:06 ~3 min windows 📦zip
✔️ a3d37c4 #9 2025-01-21 20:43:29 ~4 min macos 📦zip
✔️ a3d37c4 #9 2025-01-21 20:43:47 ~4 min ios 📦zip
✔️ a3d37c4 #9 2025-01-21 20:44:16 ~5 min linux 📦zip
✔️ a3d37c4 #9 2025-01-21 20:44:24 ~5 min macos 📦zip
✔️ a3d37c4 #9 2025-01-21 20:44:25 ~5 min android 📦aar
✔️ a3d37c4 #9 2025-01-21 20:45:18 ~6 min tests-rpc 📄log
✖️ a3d37c4 #9 2025-01-21 20:58:50 ~3 min tests 📄log
✔️ bdf78e5 #11 2025-01-21 21:41:07 ~3 min windows 📦zip
✔️ bdf78e5 #10 2025-01-21 21:41:39 ~4 min macos 📦zip
✔️ bdf78e5 #10 2025-01-21 21:41:49 ~4 min ios 📦zip
✔️ bdf78e5 #10 2025-01-21 21:42:30 ~5 min linux 📦zip
✔️ bdf78e5 #10 2025-01-21 21:42:39 ~5 min macos 📦zip
✔️ bdf78e5 #10 2025-01-21 21:42:39 ~5 min android 📦aar
✖️ bdf78e5 #10 2025-01-21 21:43:49 ~6 min tests-rpc 📄log
✖️ bdf78e5 #10 2025-01-21 22:07:12 ~29 min tests 📄log
✔️ bdf78e5 #11 2025-01-21 22:51:18 ~29 min tests 📄log
✖️ d31d899 #12 2025-01-21 23:34:51 ~2 min tests 📄log
✔️ d31d899 #12 2025-01-21 23:36:01 ~3 min windows 📦zip
✔️ d31d899 #11 2025-01-21 23:36:36 ~4 min macos 📦zip
✔️ d31d899 #11 2025-01-21 23:36:37 ~4 min ios 📦zip
✔️ d31d899 #11 2025-01-21 23:37:04 ~5 min linux 📦zip
✔️ d31d899 #11 2025-01-21 23:37:13 ~5 min macos 📦zip
✔️ d31d899 #11 2025-01-21 23:37:27 ~5 min android 📦aar
✖️ d31d899 #11 2025-01-21 23:38:05 ~5 min tests-rpc 📄log
Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ 5340c57 #13 2025-01-22 00:18:33 ~3 min windows 📦zip
✔️ 5340c57 #12 2025-01-22 00:18:59 ~4 min macos 📦zip
✔️ 5340c57 #12 2025-01-22 00:19:16 ~4 min ios 📦zip
✔️ 5340c57 #12 2025-01-22 00:19:50 ~5 min macos 📦zip
✔️ 5340c57 #12 2025-01-22 00:19:51 ~5 min linux 📦zip
✔️ 5340c57 #12 2025-01-22 00:20:10 ~5 min android 📦aar
✔️ 5340c57 #12 2025-01-22 00:20:51 ~6 min tests-rpc 📄log
✖️ 5340c57 #13 2025-01-22 00:43:45 ~29 min tests 📄log
✔️ 5340c57 #14 2025-01-22 02:29:45 ~29 min tests 📄log
✔️ 5340c57 #15 2025-01-22 02:58:54 ~29 min tests 📄log
✔️ 2032df2 #13 2025-01-28 00:58:39 ~4 min ios 📦zip
✔️ 2032df2 #13 2025-01-28 00:58:44 ~4 min macos 📦zip
✔️ 2032df2 #13 2025-01-28 00:59:23 ~5 min linux 📦zip
✔️ 2032df2 #14 2025-01-28 00:59:24 ~5 min windows 📦zip
✔️ 2032df2 #13 2025-01-28 00:59:30 ~5 min macos 📦zip
✔️ 2032df2 #13 2025-01-28 00:59:35 ~5 min android 📦aar
✔️ 2032df2 #13 2025-01-28 01:00:34 ~6 min tests-rpc 📄log
✖️ 2032df2 #16 2025-01-28 01:25:00 ~30 min tests 📄log
✔️ 2032df2 #17 2025-01-28 02:06:22 ~30 min tests 📄log

@adklempner adklempner force-pushed the feat/replace-telemetry-prometheus branch 3 times, most recently from 7c4ff40 to 3763072 Compare January 16, 2025 01:24
Copy link

codecov bot commented Jan 16, 2025

Codecov Report

Attention: Patch coverage is 51.23153% with 99 lines in your changes missing coverage. Please review.

Project coverage is 61.71%. Comparing base (3e0b1b2) to head (2032df2).
Report is 17 commits behind head on develop.

Files with missing lines Patch % Lines
metrics/wakumetrics/client.go 59.37% 39 Missing ⚠️
wakuv2/waku.go 12.00% 18 Missing and 4 partials ⚠️
protocol/messenger.go 21.42% 10 Missing and 1 partial ⚠️
metrics/metrics.go 0.00% 10 Missing ⚠️
api/geth_backend.go 0.00% 6 Missing ⚠️
mobile/status.go 0.00% 4 Missing and 1 partial ⚠️
wakuv2/message_publishing.go 0.00% 2 Missing and 1 partial ⚠️
protocol/messenger_pairing_and_syncing.go 0.00% 1 Missing and 1 partial ⚠️
cmd/statusd/main.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #6256      +/-   ##
===========================================
- Coverage    61.92%   61.71%   -0.22%     
===========================================
  Files          843      844       +1     
  Lines       111286   111103     -183     
===========================================
- Hits         68918    68567     -351     
- Misses       34388    34562     +174     
+ Partials      7980     7974       -6     
Flag Coverage Δ
functional 21.57% <3.46%> (-0.01%) ⬇️
unit 60.21% <51.23%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
metrics/wakumetrics/metrics.go 100.00% <100.00%> (ø)
protocol/requests/initialize_application.go 50.00% <ø> (ø)
wakuv2/common/metrics.go 100.00% <100.00%> (ø)
cmd/statusd/main.go 6.54% <0.00%> (ø)
protocol/messenger_pairing_and_syncing.go 59.00% <0.00%> (ø)
wakuv2/message_publishing.go 72.30% <0.00%> (-3.08%) ⬇️
mobile/status.go 9.70% <0.00%> (-0.04%) ⬇️
api/geth_backend.go 54.03% <0.00%> (-0.21%) ⬇️
metrics/metrics.go 0.00% <0.00%> (ø)
protocol/messenger.go 63.88% <21.42%> (-0.31%) ⬇️
... and 2 more

... and 42 files with indirect coverage changes

@adklempner adklempner requested a review from a team January 16, 2025 16:39
)

var (
MessagesSentTotal = prometheus.NewHistogramVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is SomethingSomethingTotal a histogram? Shouldn't it be just a normal counter? How do you plan to use it as histogram?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all metrics to use gauge/counter as appropriate

},
)

PeersByOrigin = prometheus.NewGaugeVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these currently connected peers or overall number? Cause I'd expect PeersByOrigin to be only increasing as new are discovered (i.e. Counter)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on currently connected peers obtained by periodically looking at Waku's peer store

@@ -23,6 +23,9 @@ type InitializeApplication struct {
LogEnabled bool `json:"logEnabled"`
LogLevel string `json:"logLevel"`
APILoggingEnabled bool `json:"apiLoggingEnabled"`

WakuMetricsEnabled bool `json:"wakuMetricsEnabled"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only about waku metrics, right? This already includes libp2p metrics, and we could add more status-go stuff there later.

Maybe make it just metrics and metrics-port?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies to desktop CLI options

@@ -23,6 +23,9 @@ type InitializeApplication struct {
LogEnabled bool `json:"logEnabled"`
LogLevel string `json:"logLevel"`
APILoggingEnabled bool `json:"apiLoggingEnabled"`

WakuMetricsEnabled bool `json:"wakuMetricsEnabled"`
WakuMetricsPort int `json:"wakuMetricsServerPort"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a full address (<host>:<port>) option would be better.

In theory you can run status-backend in cloud (e.g. for community control node) and connect to metrics remotely, then we'd need to listen to 0.0.0.0. Same needed when running in docker.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to include a README with some basic usage instructions, referencing https://github.com/waku-org/status-metrics.
Also, not sure why this repo not public?
Also, wouldn't it be better to store that docker compose directly in status-go? Why store it as a separate repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repo can be made public

I think it makes more sense to keep it separate so if anyone wants to run metrics against a node they can just clone those files and not the entire status-go repository. The other reason is to make it easy for people to push changes to the dashboards or make their own forks.

Copy link
Collaborator

@igor-sirotin igor-sirotin Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, makes sense, let's keep it separate then 👍
But for sure we need to make it public then.

And, perhaps, move it to status-im org? as it's for status-go 🤔 (not a big deal though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might make sense to have a fork in status-im in case waku/status team members use different dashboards

Comment on lines 8 to 14
MessagesSentTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "statusgo_waku_messages_sent_total",
Help: "Frequency of Waku messages sent by this node",
},
[]string{"publish_method"},
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to display this in Grafana not as constantly growing value, but as derivative? Like the amount of messages sent in given time period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes grafana provides all sorts of aggregate and derivative functions to apply to the metric and visualize it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, is it actually wakumetrics? Not just any metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics in this file are specifically related to waku

wakuv2/waku.go Outdated
@@ -1176,6 +1182,7 @@ func (w *Waku) Start() error {
if err != nil {
w.logger.Error("OnNewEnvelopes error", zap.Error(err))
}
w.logger.Info("Got a missing message!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is Debug or Warning level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not mean to add this, probably a mistake when rebasing. Removed.

wakuv2/waku.go Outdated
@@ -1176,6 +1182,7 @@ func (w *Waku) Start() error {
if err != nil {
w.logger.Error("OnNewEnvelopes error", zap.Error(err))
}
w.logger.Info("Got a missing message!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a convention in Go to begin log messages with lower case letter.

Suggested change
w.logger.Info("Got a missing message!")
w.logger.Info("got a missing message!")

v2protocol "github.com/waku-org/go-waku/waku/v2/protocol"
)

type ReceivedMessages struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose prometheus marshals these types to json?
Then we should explicitly define the json tags for each of these structs fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just used for the function arguments in PushReceivedMetrics, the data pushed to prometheus is a numerical value and string labels. Each metric defines its own keys for the labels.

)

var (
MessagesSentTotal = prometheus.NewCounterVec(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look good to have so many global (and public!) variables.
They don't seem to be used outside this package, so must at least not be exported.

In fact, they reason for them to be global is because they're used in a free function:

func RegisterMetrics() error {
collectors := []prometheus.Collector{
MessagesSentTotal,

But this one is only called from the Client class.
So we could move all these global variables to the Client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with non-global variables

mobile/status.go Outdated
@@ -55,6 +56,10 @@ import (
"github.com/status-im/status-go/signal"
)

var (
metricsServer *metrics.Server
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if it's possible to encapsulate this variable into GethStatusBackend to avoid adding another global variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved metrics.Server to GethStatusBackend

Replace telemetry with local metrics using prometheus client.
Add parameters to InitializeApplication for enabling waku metrics
over prometheus and specifying which port to use.
@adklempner adklempner force-pushed the feat/replace-telemetry-prometheus branch from 5340c57 to 2032df2 Compare January 28, 2025 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants