Sandersaarond/mysql model metrics #98

SandersAaronD · 2024-09-27T20:29:44Z

WIP

Meant to transition writing and reading model metrics from loki to mysql, currently has a writer implemented and the reader function is there but not attached to an endpoint.

TODO for this:

make sure the reader function works, and serves an endpoint
have the UI query that endpoint instead of loki
switch the python client to write to this (format was changed slightly to allow for debouncing so that we don't end up with too many http connections open at once for frequent logging jobs)

Unclear if this can be done as part of this PR, but: I have a dangling concern about authentication, where it isn't clear to me if tenantID will always be populated correctly in production, since our local setup just defaults to 0, making this hard to test locally. Either we need to find a way to test this locally or we won't know about that until we deploy it.

DO NOT MERGE

annanay25

Left an initial round of comments, nice work so far!

annanay25 · 2024-10-02T00:19:45Z

ai-training-api/model/model_metrics.go

+    StackID     uint64   `json:"stack_id" gorm:"not null;primaryKey"`
+    ProcessID   uuid.UUID `json:"process_id" gorm:"type:char(36);not null;primaryKey;foreignKey:ProcessID;references:ID"` // Foreign key
+    MetricName  string   `json:"metric_name" gorm:"size:32;not null;primaryKey"`
+    StepName    string   `json:"step_name" gorm:"size:32;not null;primaryKey"`


It's not entirely obvious what StepName is and why we need it. We should add comments explaining the fields.

Added a block comment is explaining this structure as a whole, because it seemed too interconnected to meaningfully do line-by-line. Hopefully it is clear enough

ai-training-api/model/model_metrics.go

ai-training-api/app/model_metrics_test.go

annanay25 · 2024-10-03T19:05:23Z

ai-training-api/model/model_metrics.go

+)
+
+type ModelMetrics struct {
+    StackID     uint64   `json:"stack_id" gorm:"not null;primaryKey"`


Can we use TenantID as that's the consistent key we are using in other structs? Or can we update the other structs to use StackID as a uint64?

I think I am fine doing it either way as long as it's consistent, I think uint64 (or really, any fixed-size field) gives us nicer performance so definitely sticking with that. Will push through something to swap everything to StackID later probably, it seems easier than having two different names for something that is always 1:1.

ai-training-api/app/model_metrics.go

annanay25 · 2024-10-03T19:11:09Z

ai-training-api/app/model_metrics.go

+// Incoming format is an array of these
+type ModelMetricsSeries struct {
+	MetricName string `json:"metric_name"`
+	StepName   string `json:"step_name"`
+	Points     []struct {
+		Step  uint32 `json:"step"`
+		Value json.Number `json:"value"`
+	} `json:"points"`
+}


Why are we storing metrics as individual rows (one row each for a step and value) in mySQL? Why not store an array like in this struct? Is it not performant? Curious if there was some benchmarking done to choose the former.

Agreed, it would also limit us from compressing the points in the future which could end up with a lot more disk and network usage. Do we ever not pull back all of the points at once?

Not 100% sure if it's more performant or not to store a bunch of points in one row. Generally the advice is one bit of data = one row and the sql engine is pretty optimized for it.

Right now we pull back all data at once, as a scaling concern we can also step through data and only grab every Nth point for some N and that shouldn't be a bad query pattern ... but it's hard to be sure until you try it.

If there's a cleaner way to store + query this I'd prefer to switch to that later.

Oh, a note: because people can log arbitrary-precision python numbers (and decimals) and storage needs to be exact this is going into and coming out of the database as a string, which means it won't compress super well.

annanay25 · 2024-10-03T19:12:44Z

ai-training-api/model/model_metrics.go

+    MetricValue string   `json:"metric_value" gorm:"size:64;not null"`
+
+    Process Process `gorm:"foreignKey:ProcessID;references:ID"` // Relationship definition
+}


should we also track timestamp for each step? i feel that might be helpful in tracking time between training runs, performance tracking, etc.

I think this is a good idea but I would want to move it to separate issue.

DO NOT MERGE

…tempt)

…ackend

SandersAaronD · 2024-10-18T14:52:24Z

I think this is ready for an actual review, I have these TODO that I'd like to do before merging:

Standardize references to TenantID or StackID, I think I will favor StackID and try to make it a uint64 in all cases.
Spin off separate issues for things that are definitely good ideas but that seem likely to bog down this already rather heavy PR. These are:

Centralize boilerplate for testing, especially around db mocking/setup
Add timestamp storage for each incoming datapoint

csmarchbanks · 2024-10-18T14:58:53Z

Standardize references to TenantID or StackID, I think I will favor StackID and try to make it a uint64 in all cases.

I would say to be more consistent with other Grafana services (Mimir, Loki, Tempo) we should have it be a string. The number just happens to be how we identify instances in Grafana Cloud.

Spinning off separate issues for followup seems like a good idea.

SandersAaronD · 2024-10-18T16:01:25Z

Standardize references to TenantID or StackID, I think I will favor StackID and try to make it a uint64 in all cases.

I would say to be more consistent with other Grafana services (Mimir, Loki, Tempo) we should have it be a string. The number just happens to be how we identify instances in Grafana Cloud.

Spinning off separate issues for followup seems like a good idea.

Separate issues are spun off.

I was actually going off of "StackID as an integer" because it's what gcom gives back but that appears to be the anomaly here. I'll set this to TenantID and a string, might take a little fiddling because code around types here is coupled to this choice.

csmarchbanks

A few small comments, might need some help fully setting up the dev experience to test everything out.

ai-training-api/app/model_metrics.go

annanay25

This is looking good. I am approving to unblock but do consider going through and addressing the remaining comments/questions on this PR!

annanay25 · 2024-10-22T23:27:07Z

o11y/src/o11y/exported/log.py

@@ -1,42 +1,48 @@
 from typing import Dict, Union, Optional


just checking if logging the process IDs (and in the current format) was intentional:

It wasn't but I am kind of inclined to leave it in, I think will push to separate issue/pr to figure out what we want to log from the python exporter

annanay25 · 2024-10-22T23:29:10Z

grafana-aitraining-app/src/pages/Home/components/GraphsView.tsx

@@ -1,10 +1,14 @@
-import React, { useEffect, useRef } from 'react';


How can I see other metadata about the processes? For ex: which one was test/train/val etc?

ai-training-api/app/model_metrics.go

annanay25 · 2024-10-22T23:44:25Z

grafana-aitraining-app/src/pages/Home/components/GraphsView.tsx

-            timeRange: tmpTimeRange,
+  }, [rows, getModelMetrics]);
+
+  function makePanelFromData(panelData: any) {


Testing from the jupyter notebook where two metrics are being logged - loss and accuracy. How do I chart both of them? It seems like just loss is hardcoded for rendering in the charts..

There was a bug with this that I think I have fixed now. (I was assuming sort order and if I sort by more than one column mysql doesn't work, but if I sort by only one column it does, because ... ????)

SandersAaronD added 9 commits September 19, 2024 15:33

add model metrics table

ee71829

Pin air version to continue supporting go 1.22

d2879e0

switch model_metrics endpoint to write to mysql

2b3326a

factor model metrics endpoint into its own file, add tests

dfd7ce8

First stab at a reader for model metrics from mysql

f26f957

add model metrics table

6c041ab

Merge branch 'main' into sandersaarond/mysql-model-metrics

858c354

DO NOT MERGE

Unbroken, almost there ...

34a59ce

Some cleanup

e068d1e

SandersAaronD force-pushed the sandersaarond/mysql-model-metrics branch from 53d8df6 to e068d1e Compare October 1, 2024 15:41

SandersAaronD added 4 commits October 1, 2024 14:38

working tests, bugfix

2ef0aca

Refactor some tests

0f7d9c8

Tidy up frontend and backend contracts to store model metrics

84f708a

Remove debug logging

c1e05bc

SandersAaronD marked this pull request as draft October 2, 2024 20:31

Merge branch 'main' into sandersaarond/mysql-model-metrics

d9c5ea3

annanay25 reviewed Oct 3, 2024

View reviewed changes

SandersAaronD added 13 commits October 7, 2024 15:50

Add getter for model metrics from backend service

33552fd

DO NOT MERGE

Committing a clean-ish WIP

530baea

Add config field

07046ff

Remove debug logging

e178512

Slight fix to error handler

be998de

Viz working somewhat finally

83dd594

Some cleanup

07163b0

Move some logic into zustand

7cfab48

Reformatting data to what the frontend needs on the backend (first at…

fa4624a

…tempt)

Move the important sorting + null filtering into sql

96780b4

Refactor model metrics to do data transformation for display on the b…

00b0e2e

…ackend

Dupe filter in sql

ffee44f

Add test case for data transformation (currently failing)

7ba6f0d

SandersAaronD added 11 commits October 17, 2024 02:19

Fix some bugs ...

5476aed

Make process ids within panels deterministic

2654c4d

Trying to fetch from the endpoint to the frontend

3caef8d

UI ~90% working again

a3f05c3

Display fix

8964894

Some cleanup

0f6fcb5

Merge branch 'main' into sandersaarond/mysql-model-metrics

95193c1

Fixup merge

47af4b9

More cleanup

12e237c

Add comment explaining the ModelMetrics structure

a1a792f

Get rid of unused AfeterCreate

40d8b62

SandersAaronD marked this pull request as ready for review October 18, 2024 14:50

This was referenced Oct 18, 2024

Centralize boilerplate for testing of API #109

Open

Add timestamp field for incoming model metrics #110

Open

SandersAaronD added 6 commits October 18, 2024 11:02

Continued frontend cleanup

599a038

Standardize back on TenantID as a string, passes tests but not used yet

5758152

Get rid of unused loki query logic

b683801

Give specific processes consistent colors again

71a0c15

Fix small bug, make panel titles indicate units

13711a0

Fix ingest bug

c38d10b

csmarchbanks reviewed Oct 22, 2024

View reviewed changes

ai-training-api/app/model_metrics.go Show resolved Hide resolved

ai-training-api/app/model_metrics.go Show resolved Hide resolved

annanay25 approved these changes Oct 22, 2024

View reviewed changes

SandersAaronD merged commit 562b5d9 into main Oct 23, 2024
4 checks passed

SandersAaronD mentioned this pull request Oct 23, 2024

Scaling investigation for model metric queries #88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sandersaarond/mysql model metrics #98

Sandersaarond/mysql model metrics #98

SandersAaronD commented Sep 27, 2024 •

edited

Loading

annanay25 left a comment

annanay25 Oct 2, 2024

SandersAaronD Oct 18, 2024

annanay25 Oct 3, 2024

SandersAaronD Oct 17, 2024

annanay25 Oct 3, 2024

csmarchbanks Oct 8, 2024

SandersAaronD Oct 17, 2024

annanay25 Oct 3, 2024

SandersAaronD Oct 18, 2024

SandersAaronD commented Oct 18, 2024

csmarchbanks commented Oct 18, 2024

SandersAaronD commented Oct 18, 2024

csmarchbanks left a comment

annanay25 left a comment

annanay25 Oct 22, 2024

SandersAaronD Oct 23, 2024

annanay25 Oct 22, 2024

annanay25 Oct 22, 2024

SandersAaronD Oct 23, 2024

		@@ -1,10 +1,14 @@
		import React, { useEffect, useRef } from 'react';

Sandersaarond/mysql model metrics #98

Sandersaarond/mysql model metrics #98

Conversation

SandersAaronD commented Sep 27, 2024 • edited Loading

annanay25 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SandersAaronD commented Oct 18, 2024

csmarchbanks commented Oct 18, 2024

SandersAaronD commented Oct 18, 2024

csmarchbanks left a comment

Choose a reason for hiding this comment

annanay25 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SandersAaronD commented Sep 27, 2024 •

edited

Loading