[WIP] plugin: add new jobtap plugin to track job usage across associations' jobs #770
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
As mentioned by @vsoch in #650, there is currently no way to enforce a limit on an association's ability to run jobs on a certain instance type, which flux-accounting already tracks in its SQLite database by calculating job usage (which is a product of the number of nodes allocated to a job and its duration). However, there is currently no way to enforce this using flux-accounting's mf_priority jobtap plugin.
This PR begins to lay the groundwork for this "usage" limit enforcement by adding a new jobtap plugin (called
compute_hours_limits) which, for now, just tracks an association's current usage across all of their running jobs (by calculating the job's anticipated usage by multiplying the job'snnodesby its requested duration) and computes the jobs actual usage when the job completes. The workflow looks something like this:The user submits a job and specifies its size and duration (or a default duration is set on the job):
$ flux submit -N4 -S duration=3600 my_jobThe plugin takes these resource specifications and calculates an expected usage by multiplying the both of these numbers together:
When the job transitions to
RUN, theexpected_usageis added to the association'scurrent_usageattribute:When the job completes, the job's actual usage is calculated and added to the association's
total_usageattribute:and the expected usage from the job is subtracted from the association's
current_usageattribute:The associations'
total_usageattributes can be reset to0.0by sending a"clear"rpc to the plugin:To avoid making the plugin code a lot to review, I've only added tracking of an association's job usage in this PR. If this looks like the right way we want to track an association's usage (and eventually enforce a limit), I can submit follow-up PRs to add that functionality.
I've added some basic tests to showcase how the plugin tracks job usage for a given association by submitting one or multiple jobs and ensuring the
current_usageandtotal_usagevalues are calculated correctly.still some things to take care of
Jobobject after it has transitioned tojob.state.inactivetotal_usagevalue so it doesn't just increase indefinitely