Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sandersaarond/mysql model metrics #98

Merged
merged 44 commits into from
Oct 23, 2024

Conversation

SandersAaronD
Copy link
Collaborator

@SandersAaronD SandersAaronD commented Sep 27, 2024

WIP

Meant to transition writing and reading model metrics from loki to mysql, currently has a writer implemented and the reader function is there but not attached to an endpoint.

TODO for this:

  • make sure the reader function works, and serves an endpoint
  • have the UI query that endpoint instead of loki
  • switch the python client to write to this (format was changed slightly to allow for debouncing so that we don't end up with too many http connections open at once for frequent logging jobs)

Unclear if this can be done as part of this PR, but: I have a dangling concern about authentication, where it isn't clear to me if tenantID will always be populated correctly in production, since our local setup just defaults to 0, making this hard to test locally. Either we need to find a way to test this locally or we won't know about that until we deploy it.

@SandersAaronD SandersAaronD force-pushed the sandersaarond/mysql-model-metrics branch from 53d8df6 to e068d1e Compare October 1, 2024 15:41
@SandersAaronD SandersAaronD marked this pull request as draft October 2, 2024 20:31
Copy link
Contributor

@annanay25 annanay25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left an initial round of comments, nice work so far!

StackID uint64 `json:"stack_id" gorm:"not null;primaryKey"`
ProcessID uuid.UUID `json:"process_id" gorm:"type:char(36);not null;primaryKey;foreignKey:ProcessID;references:ID"` // Foreign key
MetricName string `json:"metric_name" gorm:"size:32;not null;primaryKey"`
StepName string `json:"step_name" gorm:"size:32;not null;primaryKey"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not entirely obvious what StepName is and why we need it. We should add comments explaining the fields.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a block comment is explaining this structure as a whole, because it seemed too interconnected to meaningfully do line-by-line. Hopefully it is clear enough

ai-training-api/model/model_metrics.go Outdated Show resolved Hide resolved
ai-training-api/app/model_metrics_test.go Show resolved Hide resolved
)

type ModelMetrics struct {
StackID uint64 `json:"stack_id" gorm:"not null;primaryKey"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use TenantID as that's the consistent key we are using in other structs? Or can we update the other structs to use StackID as a uint64?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am fine doing it either way as long as it's consistent, I think uint64 (or really, any fixed-size field) gives us nicer performance so definitely sticking with that. Will push through something to swap everything to StackID later probably, it seems easier than having two different names for something that is always 1:1.

ai-training-api/app/model_metrics.go Outdated Show resolved Hide resolved
Comment on lines 19 to 27
// Incoming format is an array of these
type ModelMetricsSeries struct {
MetricName string `json:"metric_name"`
StepName string `json:"step_name"`
Points []struct {
Step uint32 `json:"step"`
Value json.Number `json:"value"`
} `json:"points"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we storing metrics as individual rows (one row each for a step and value) in mySQL? Why not store an array like in this struct? Is it not performant? Curious if there was some benchmarking done to choose the former.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it would also limit us from compressing the points in the future which could end up with a lot more disk and network usage. Do we ever not pull back all of the points at once?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure if it's more performant or not to store a bunch of points in one row. Generally the advice is one bit of data = one row and the sql engine is pretty optimized for it.

Right now we pull back all data at once, as a scaling concern we can also step through data and only grab every Nth point for some N and that shouldn't be a bad query pattern ... but it's hard to be sure until you try it.

If there's a cleaner way to store + query this I'd prefer to switch to that later.

Oh, a note: because people can log arbitrary-precision python numbers (and decimals) and storage needs to be exact this is going into and coming out of the database as a string, which means it won't compress super well.

MetricValue string `json:"metric_value" gorm:"size:64;not null"`

Process Process `gorm:"foreignKey:ProcessID;references:ID"` // Relationship definition
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also track timestamp for each step? i feel that might be helpful in tracking time between training runs, performance tracking, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea but I would want to move it to separate issue.

@SandersAaronD SandersAaronD marked this pull request as ready for review October 18, 2024 14:50
@SandersAaronD
Copy link
Collaborator Author

I think this is ready for an actual review, I have these TODO that I'd like to do before merging:

  1. Standardize references to TenantID or StackID, I think I will favor StackID and try to make it a uint64 in all cases.
  2. Spin off separate issues for things that are definitely good ideas but that seem likely to bog down this already rather heavy PR. These are:
  • Centralize boilerplate for testing, especially around db mocking/setup
  • Add timestamp storage for each incoming datapoint

@csmarchbanks
Copy link
Contributor

Standardize references to TenantID or StackID, I think I will favor StackID and try to make it a uint64 in all cases.

I would say to be more consistent with other Grafana services (Mimir, Loki, Tempo) we should have it be a string. The number just happens to be how we identify instances in Grafana Cloud.

Spinning off separate issues for followup seems like a good idea.

@SandersAaronD
Copy link
Collaborator Author

Standardize references to TenantID or StackID, I think I will favor StackID and try to make it a uint64 in all cases.

I would say to be more consistent with other Grafana services (Mimir, Loki, Tempo) we should have it be a string. The number just happens to be how we identify instances in Grafana Cloud.

Spinning off separate issues for followup seems like a good idea.

Separate issues are spun off.

I was actually going off of "StackID as an integer" because it's what gcom gives back but that appears to be the anomaly here. I'll set this to TenantID and a string, might take a little fiddling because code around types here is coupled to this choice.

Copy link
Contributor

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments, might need some help fully setting up the dev experience to test everything out.

ai-training-api/app/model_metrics.go Show resolved Hide resolved
ai-training-api/app/model_metrics.go Show resolved Hide resolved
Copy link
Contributor

@annanay25 annanay25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. I am approving to unblock but do consider going through and addressing the remaining comments/questions on this PR!

@@ -1,42 +1,48 @@
from typing import Dict, Union, Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking if logging the process IDs (and in the current format) was intentional:
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't but I am kind of inclined to leave it in, I think will push to separate issue/pr to figure out what we want to log from the python exporter

@@ -1,10 +1,14 @@
import React, { useEffect, useRef } from 'react';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I see other metadata about the processes? For ex: which one was test/train/val etc?
image

ai-training-api/app/model_metrics.go Show resolved Hide resolved
timeRange: tmpTimeRange,
}, [rows, getModelMetrics]);

function makePanelFromData(panelData: any) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing from the jupyter notebook where two metrics are being logged - loss and accuracy. How do I chart both of them? It seems like just loss is hardcoded for rendering in the charts..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bug with this that I think I have fixed now. (I was assuming sort order and if I sort by more than one column mysql doesn't work, but if I sort by only one column it does, because ... ????)

@SandersAaronD SandersAaronD merged commit 562b5d9 into main Oct 23, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants