-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm Collector: Handle job state / missing timestamps #811
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Fist part of the problem
Currently the collector naively takes a list of job states from the config file and tries to collect all corresponding jobs.
But obviously not all job states make sense in this context. E.g.
Pending
makes no sense.I think that we should document a list of job states that can be sensibly used with this collector.
Second part
There are job states that are a little more involved, like
Cancelled
. You might want to account for cancelled jobs when they were cancelled after running for a few days.On the other hand there is no guarantee that a cancelled job was ever started. In this case the
start_time
isUnknown
and tokenization of the sacct output will fail atAUDITOR/collectors/slurm/src/sacctcaller.rs
Line 206 in 256395c
I believe we need to define which fields might be missing for what job states and have the collector ignore certain entries. (Like a
Cancelled
job with nostart_time
instead of crashing.I didn't think the list through (https://slurm.schedmd.com/sacct.html#SECTION_JOB-STATE-CODES).
Cancelled
might be the only problem.The text was updated successfully, but these errors were encountered: