Aperf: Add support for analytics #241

janaknat · 2024-12-09T19:09:19Z

Add base support for Aperf Analytics. The stats are formed and created in the Aperf Rust backend. These values are then acted on by the rules in the Javascript frontend. Basic implementation for the keys in 'SystemInfo' and 'CPU Utilization' show how the rules can operate using the API provided.

Attached is a report comparing two runs. The output of the rules are shown in the landing page 'SystemInfo' tab.

build_metrics.tar.gz

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

wash-amzn · 2024-12-11T14:31:27Z

src/lib.rs

@@ -180,6 +187,7 @@ impl PerformanceData {
        let meta_data_handle = fs::OpenOptions::new()
            .create(true)
            .write(true)
+            .truncate(true)


Given it's an error when any output directory already exists, in what scenario is this code encountering an existing file?

Good catch. Removed. It was a suggestion from a different version of clippy.

wash-amzn · 2024-12-11T14:33:50Z

src/utils.rs

+    pub p50: f64,
+    pub mean: f64,


mean and p50 are virtually interchangeable, I'd dump p50 and add p90.

Okay. Changing it to p90.

wash-amzn · 2024-12-11T16:58:57Z

src/html_files/analytics.ts

+
+class Finding {
+    text: string;
+    status: string;


Why is the type 'string' instead of 'Status'?

Yeah. Good catch. It's simpler to have the Status everywhere.

wash-amzn · 2024-12-11T17:00:19Z

src/html_files/analytics.ts

+function is_unique_map(values_map) {
+    return new Set([...values_map.values()]).size == 1;
+}
+
+function is_unique_array(values_array) {
+    return new Set(values_array).size == 1;
+}


These are odd definitions of 'unique'. It suggests that somewhere else you're using the wrong data type if you have to call into something to de-duplicate and check the size.

Yeah. We're using arrays everywhere now. We were using arrays and maps. This is cleaner.

wash-amzn · 2024-12-11T17:10:50Z

src/html_files/cpu_utilization.ts

+    rules: [
+        {
+            name: "User",
+            func: function (ruleOpts: RuleOpts) {
+                let system_util = get_data_key(ruleOpts.data_type, "System");
+                let findings = [];
+                let init_key = ruleOpts.runs[0];
+                let init_total_util: number = ruleOpts.per_run_data.get(init_key) + system_util.get(init_key);
+                for (const [key, value] of ruleOpts.per_run_data) {
+                    if (key == init_key) {
+                        continue;
+                    }
+                    let run_total_util: number = value + system_util.get(key);
+                    let cpu_diff = Math.ceil(Math.abs(run_total_util - init_total_util));
+                    findings.push(new Finding(
+                        `Average CPU Utilization difference between ${init_key} and ${key} is ${cpu_diff}%.`,
+                        cpu_diff > 10 ? Status.NotGood : Status.Good,
+                    ));
+                }
+                return findings;
+            },
+            good: "",
+            bad: "",
+        },
+        {
+            name: "idle",
+            func: function (ruleOpts: RuleOpts) {
+                let findings = [];
+                let init_key = ruleOpts.runs[0];
+                for (const [key, value] of ruleOpts.per_run_data) {
+                    if (key == init_key) {
+                        continue;
+                    }
+                    let idle_diff = Math.abs(ruleOpts.per_run_data.get(key) - ruleOpts.per_run_data.get(init_key));
+                    if (idle_diff > 10) {
+                        findings.push(new Finding(
+                            `Difference in Average 'Idle time' between ${key} and ${init_key} is ${idle_diff}.`,
+                        ));
+                    }
+                }
+                return findings;
+            },
+            good: "",
+            bad: "",
+        }
+    ],


So rules don't start turning into copy-and-paste-fests of eachother, start refactoring this now so that things as simple as comparing aggregates across runs is not duplicated in multiple rules's 'func's.

I'm splitting it up into 3 types of rules.

Runs == 1, Rules to point out things. Rules are implemented in single_run_rules[].

Runs > 1,
a. Set base run = Runs[0]. Iterate over the other runs and pass in base_run and other run to a function. Prevents needing repeat for loops for all rules. Rules are implemented in per_run_rules[].
b. Provide all details to a different set of functions which needs all the details. Rules are implemented in all_run_rules[].

wash-amzn · 2024-12-11T17:12:29Z

src/html_files/system_info.ts

+            func: function (ruleOpts: RuleOpts) {
+                return is_unique_map(ruleOpts.per_run_data);
+            },


The cpu utilization rules are returning arrays of Findings, and this is returning boolean. Lock this function signature down or it will be a nightmare to maintain having to support a wide variety of return types.

Okay. I'm changing it so that only Finding should be returned.

This is in preparation for adding rules for SystemInfo and CPU Utilization.

Form the metrics which will be used by the front-end for analytics work. Generate it for SystemInfo and CPU Utilization.

janaknat · 2024-12-17T23:07:54Z

Attached a newer version of the report. The SUT Config and Findings are now clubbed together in a single tab.

cmp.tar.gz

geoffreyblake · 2025-01-03T20:51:15Z

src/utils.rs

+    pub mean: f64,
+}
+
+impl Stats {


Will only saving P99, P90 and mean be flexible enough for people writing rules in Javascript? Could this be a histogram instead that is lighter weight to traverse than the full data-set but you can reconstitute any stat you wanted?

Which stat do you anticipate we would need to have? We can calculate any stat in the backend and have it brought to the UI. The other approach is to have all the data points brought to the front-end and let the user decide which works for them. That will be a rework of the way we are doing things.

I anticipate this could be any stat: trimmed mean, different percentiles etc. The frontend already has all the data present correct?

Right now all aggregations are computed at report generation time. The frontend has the data (required in order to graph the data), but it is not computing these aggregates itself.

I'd recommend mirroring p99 and p90 with p01 and p10. I expect there will be some counters you'd like to see when they fall off.

Yeah. This structure can be updated as needed with any stats we need in the future.

lrbison · 2025-01-07T16:37:04Z

src/html_files/analytics.ts

+enum Status {
+    Good = '✅',
+    NotGood = '❌',
+}


I suspect that soon you are going to discover there is frequently a grey area between good and ~~evil~~ NotGood.

Yeah. We can add an 'Ehhh' when we need it.

Can we add an item for "Mediocre" sooner rather than later. I don't want us to be handling tickets with NotGood cpu at 49%

When we add a rule for 'Mediocre' we can add that item. I don't want to add code which isn't being explicitly used.

lrbison · 2025-01-07T16:42:41Z

src/utils.rs

+    pub mean: f64,
+}
+
+impl Stats {


I'd recommend mirroring p99 and p90 with p01 and p10. I expect there will be some counters you'd like to see when they fall off.

lrbison · 2025-01-07T16:46:45Z

src/html_files/cpu_utilization.ts

+let cpu_utilization_rules = {
+    data_type: "cpu_utilization",
+    pretty_name: "CPU Utilization",
+    single_run_rules: [


How long before these are config files?

Is the scenario you see where users can provide their own rules?

No, I'm just thinking about ease of adding/subtracting rules. For these simple ones it's basically just "metric" and "threshold" and maybe "reduction function". Seems kind of silly to make a new function for each of them. If we had a more reliable way to look up the data it might be easier to break these into some sort of config structure.

On the other hand more complex rules probably deserve their own function.

Config file probably isn't right (especially as I'm realizing this is typescript not rust), but maintaining many different very similar functions is going to become painful.

geoffreyblake · 2025-01-17T22:28:52Z

src/html_files/cpu_utilization.ts

+            },
+        }
+    ],
+    per_run_rules: [


There is still quite a bit of copy-paste here, could make more sense to have 1 rule, but multiple functions to implement the logic for each way to compare the run(s) to each other. Then the rule running engine just iterates over the functions in each rule to generate the report.

I've updated it to have a per_run, all_run and a single_run attached to one rule.

geoffreyblake · 2025-01-17T22:43:24Z

src/html_files/utils.ts

+        return data['String'];
+    } else if ('Stats' in data) {
+        if (comparator == 'mean') {
+            return data['Stats']['mean'];


This is going to keep growing if we add more default aggregations, can leverage this typescript syntax to return a default if not found:
data.get('Stats').get('p99.9') ?? undefined

Makes sense. I'll update all the aggregations to use this.

geoffreyblake · 2025-01-21T23:08:21Z

src/html_files/cpu_utilization.ts

+    rules: [
+        {
+            name: "User",
+            single_run_rule: function* (opts): Generator<Finding, void, any> {


Just thinking a bit more on this, all_run_rule and single_run_rule are not much different. If I'm looking at CPU utilization, I'd be just as interested knowing all have low utilization in addition to just the base run. We can eliminate the "single run" case and just apply the rules that look at all runs independently.

The aim is to logically separate 3 use cases. Consider 3 runs A, B and C. all_run_rule will allow us to compare the data across all runs. per_run_rule allows us to compare A B and A C. single_run_rule will allow us to write rules for each run. This will apply when comparison is not possible when a report is generated for a single run in addition to applying it for multiple runs. Ex: A config is KConfig is known to cause performance regressions. We want that flagged in both cases. The split is a logical separation helper for implementers.

Additionally, each single_run_rule will need an if check to make sure the other comparison run exists. If we have 3 separate rule calls it allows basic assumptions.

geoffreyblake · 2025-01-21T23:09:44Z

src/html_files/utils.ts

+        return data['String'] ?? undefined;
+    } else if ('Stats' in data) {
+        if (comparator == 'mean') {
+            return data['Stats']['mean'] ?? undefined;


I think you can simply do return data['Stats'][comparator] ?? undefined

Ah. Got it. Changing.

janaknat · 2025-01-24T02:08:15Z

@geoffreyblake Updated to include comments describing each type of rule.

geoffreyblake · 2025-01-25T02:12:40Z

Thanks. LGTM.

geoffreyblake

LGTM.

janaknat requested a review from a team as a code owner December 9, 2024 19:09

janaknat force-pushed the analytics branch from ff1d212 to a28e778 Compare December 10, 2024 16:36

wash-amzn reviewed Dec 11, 2024

View reviewed changes

janaknat added 2 commits December 13, 2024 19:10

Aperf: Add infrastructure for Analytics

89575cb

This is in preparation for adding rules for SystemInfo and CPU Utilization.

Analytics: Form data for 2 types

dd767ed

Form the metrics which will be used by the front-end for analytics work. Generate it for SystemInfo and CPU Utilization.

janaknat force-pushed the analytics branch from a28e778 to a1e3163 Compare December 17, 2024 23:04

geoffreyblake reviewed Jan 3, 2025

View reviewed changes

lrbison reviewed Jan 7, 2025

View reviewed changes

janaknat force-pushed the analytics branch from a1e3163 to 32b4390 Compare January 17, 2025 17:41

geoffreyblake reviewed Jan 17, 2025

View reviewed changes

janaknat force-pushed the analytics branch from 32b4390 to 82ec660 Compare January 21, 2025 16:29

geoffreyblake reviewed Jan 21, 2025

View reviewed changes

janaknat force-pushed the analytics branch from 82ec660 to d00be0a Compare January 22, 2025 03:46

Analytics: Add the front-end for analytics

73b293a

janaknat force-pushed the analytics branch from d00be0a to 73b293a Compare January 24, 2025 02:07

geoffreyblake approved these changes Jan 25, 2025

View reviewed changes

janaknat merged commit 202c756 into aws:main Jan 27, 2025
6 checks passed

Aperf: Add support for analytics #241

Aperf: Add support for analytics #241

Conversation

janaknat commented Dec 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janaknat commented Dec 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janaknat commented Jan 24, 2025

geoffreyblake commented Jan 25, 2025

geoffreyblake left a comment

Choose a reason for hiding this comment