-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sandersaarond/mysql model metrics #98
Conversation
53d8df6
to
e068d1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left an initial round of comments, nice work so far!
StackID uint64 `json:"stack_id" gorm:"not null;primaryKey"` | ||
ProcessID uuid.UUID `json:"process_id" gorm:"type:char(36);not null;primaryKey;foreignKey:ProcessID;references:ID"` // Foreign key | ||
MetricName string `json:"metric_name" gorm:"size:32;not null;primaryKey"` | ||
StepName string `json:"step_name" gorm:"size:32;not null;primaryKey"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not entirely obvious what StepName
is and why we need it. We should add comments explaining the fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a block comment is explaining this structure as a whole, because it seemed too interconnected to meaningfully do line-by-line. Hopefully it is clear enough
) | ||
|
||
type ModelMetrics struct { | ||
StackID uint64 `json:"stack_id" gorm:"not null;primaryKey"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use TenantID as that's the consistent key we are using in other structs? Or can we update the other structs to use StackID as a uint64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am fine doing it either way as long as it's consistent, I think uint64 (or really, any fixed-size field) gives us nicer performance so definitely sticking with that. Will push through something to swap everything to StackID later probably, it seems easier than having two different names for something that is always 1:1.
ai-training-api/app/model_metrics.go
Outdated
// Incoming format is an array of these | ||
type ModelMetricsSeries struct { | ||
MetricName string `json:"metric_name"` | ||
StepName string `json:"step_name"` | ||
Points []struct { | ||
Step uint32 `json:"step"` | ||
Value json.Number `json:"value"` | ||
} `json:"points"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we storing metrics as individual rows (one row each for a step and value) in mySQL? Why not store an array like in this struct? Is it not performant? Curious if there was some benchmarking done to choose the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it would also limit us from compressing the points in the future which could end up with a lot more disk and network usage. Do we ever not pull back all of the points at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure if it's more performant or not to store a bunch of points in one row. Generally the advice is one bit of data = one row and the sql engine is pretty optimized for it.
Right now we pull back all data at once, as a scaling concern we can also step through data and only grab every Nth point for some N and that shouldn't be a bad query pattern ... but it's hard to be sure until you try it.
If there's a cleaner way to store + query this I'd prefer to switch to that later.
Oh, a note: because people can log arbitrary-precision python numbers (and decimals) and storage needs to be exact this is going into and coming out of the database as a string, which means it won't compress super well.
MetricValue string `json:"metric_value" gorm:"size:64;not null"` | ||
|
||
Process Process `gorm:"foreignKey:ProcessID;references:ID"` // Relationship definition | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also track timestamp for each step? i feel that might be helpful in tracking time between training runs, performance tracking, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea but I would want to move it to separate issue.
DO NOT MERGE
I think this is ready for an actual review, I have these TODO that I'd like to do before merging:
|
I would say to be more consistent with other Grafana services (Mimir, Loki, Tempo) we should have it be a string. The number just happens to be how we identify instances in Grafana Cloud. Spinning off separate issues for followup seems like a good idea. |
Separate issues are spun off. I was actually going off of "StackID as an integer" because it's what gcom gives back but that appears to be the anomaly here. I'll set this to TenantID and a string, might take a little fiddling because code around types here is coupled to this choice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments, might need some help fully setting up the dev experience to test everything out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. I am approving to unblock but do consider going through and addressing the remaining comments/questions on this PR!
@@ -1,42 +1,48 @@ | |||
from typing import Dict, Union, Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't but I am kind of inclined to leave it in, I think will push to separate issue/pr to figure out what we want to log from the python exporter
@@ -1,10 +1,14 @@ | |||
import React, { useEffect, useRef } from 'react'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeRange: tmpTimeRange, | ||
}, [rows, getModelMetrics]); | ||
|
||
function makePanelFromData(panelData: any) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing from the jupyter notebook where two metrics are being logged - loss and accuracy. How do I chart both of them? It seems like just loss is hardcoded for rendering in the charts..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a bug with this that I think I have fixed now. (I was assuming sort order and if I sort by more than one column mysql doesn't work, but if I sort by only one column it does, because ... ????)
WIP
Meant to transition writing and reading model metrics from loki to mysql, currently has a writer implemented and the reader function is there but not attached to an endpoint.
TODO for this:
Unclear if this can be done as part of this PR, but: I have a dangling concern about authentication, where it isn't clear to me if tenantID will always be populated correctly in production, since our local setup just defaults to 0, making this hard to test locally. Either we need to find a way to test this locally or we won't know about that until we deploy it.