-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(validator): add support to validate essential metrics produced by Kepler #1834
base: main
Are you sure you want to change the base?
feat(validator): add support to validate essential metrics produced by Kepler #1834
Conversation
🤖 SeineSailor Here's a concise summary of the pull request changes: Summary: This pull request enhances the
Impact: These changes expand the validator's capabilities for handling and validating Kepler metrics, but do not affect exported function signatures or global data structures. The external interface and behavior are modified, requiring configuration file updates. Observations/Suggestions:
|
827aefd
to
564fa4c
Compare
|
||
validations: | ||
# absolute power comparison | ||
- name: Total - absolute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also validate these invariants in the same version of dev
- kepler_node_<pkg|core|uncore|dram|other>{dev} = sum of ( process_<pkg|core|uncore|dram|other>{dev} )
- kepler_node_<pkg|core|..> = node_exporter_rapl_<pkg|core...>
*sum( kepler_process_bpf_cpu ) = node_exporter_cpu_time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for kepler_node_<pkg|core|dram...>{dev} = sum of (process_<pkg|core|dram....>){dev}
do you mean
MAE of sum(rate(kepler_node<pkg|core|dram>){dev}[20s]) and sum(rate(process_<pkg|core|dram>{dev}[20s]))
?
33d2963
to
de1649f
Compare
de1649f
to
5fa8028
Compare
5fa8028
to
3c994bb
Compare
metal: metal # Job name for metal metrics, default is metal | ||
|
||
url: http://localhost:9090 # Prometheus server URL | ||
rate_interval: 60s # Rate interval for Promql, default is 20s, typically 4 x $scrape_interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly using rate interval as 60s because:
Prometheus scrape Interval = 3s
Data points for 12s Interval(i.e 4* scrape interval) = 12/3 = 4 data points
Data points for 60s interval = 60/3 = 20 data points
With 20 data points, we get a smoother and more reliable estimate. When comparing two sum(rate(...))
a stable rate reduces the variability in MAE calculations leading to more accurate assessments.
28889fe
to
fe32d16
Compare
@@ -1,5 +1,5 @@ | |||
global: | |||
scrape_interval: 5s # Set the scrape interval to every 5 seconds. Default is every 1 minute. | |||
scrape_interval: 3s # Set the scrape interval to every 5 seconds. Default is every 1 minute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check why changed scrape interval, and update comment accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting scrape every 3 seconds rather than every 5 seconds, over a typical time window will collect significantly more data
fe32d16
to
990c844
Compare
Here is sample CI run that would look like for reference once we have this merged: https://github.com/sustainable-computing-io/kepler-metal-ci/actions/runs/12366281744/job/34512777104 My idea is to use the equinix runners on demand on PR's. Reviewers or authors can add a comment in the PR something like |
…y Kepler This commit introduces functionality to validate essential metrics produced by Kepler The following comparisons are included: - Node Exporter Comparison - Validates `node_rapl_<package|core|dram>` metrics against `kepler_node_<package|core|dram>{dev}` - Kepler Process Comparison - Compares `kepler_process_<package|core|dram|platform|other|uncore>{latest}` metrics to `kepler_process_<package|core|dram|platform|other|uncore>{dev}` - Kepler Node Comparison - Validates `kepler_node_<package|core|dram|platform|other|uncore>{latest}` against `kepler_node_<package|core|dram|platform|other|uncore>{dev}` Additionally, the following changes are made to existing functionality: - Adds a new `metric_validations.yaml` file which includes promql queries for comparisons along with threshold values - Update the existing `stressor.sh` script to now support few more parameters to make it more flexible - warmup time: time to wait before starting the stressor - cooldown time: time to wait after the stressor is finished - repeats: number of times to repeat the stressor. Since for regression test we don't want to repeat the stressor multiple times - Adds a new `validator-regression.yaml` file which includes the configuration for the regression test Signed-off-by: vprashar2929 <[email protected]>
990c844
to
b06242b
Compare
This commit introduces functionality to validate essential metrics produced by Kepler
The following comparisons are included:
Node Exporter Comparison
node_rapl_<package|core|dram>
metrics againstkepler_node_<package|core|dram>{dev}
Kepler Process Comparison
kepler_process_<package|core|dram|platform|other|uncore>{latest}
metrics tokepler_process_<package|core|dram|platform|other|uncore>{dev}
Kepler Node Comparison
kepler_node_<package|core|dram|platform|other|uncore>{latest}
againstkepler_node_<package|core|dram|platform|other|uncore>{dev}
Additionally, the following changes are made to existing functionality:
metric_validations.yaml
file which includes promql queries for comparisons along with threshold valuesstressor.sh
script to now support few more parameters to make it more flexibleregression test we don't want to repeat the stressor multiple times
validator-regression.yaml
file which includes the configuration for the regression test