Skip to content

Commit 0b25de9

Browse files
committed
updating resource estimation
1 parent e2adfc2 commit 0b25de9

File tree

7 files changed

+243
-69
lines changed

7 files changed

+243
-69
lines changed

_config.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ sched:
7373
interactive: "srun"
7474
info: "sinfo"
7575
comment: "#SBATCH"
76-
hist: "sacct -u $USER"
76+
hist: "sacct"
7777
hist_filter: ""
7878

7979
episode_order:

_episodes/16-transferring-files.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -227,12 +227,12 @@ There are two sets of buttons in the File Explorer.
227227
* Under the three vertical dots menu next to each filename:
228228
{% include figure.html url="" max-width="50%"
229229
file="/fig/file_explorer_btn1.png"
230-
alt="Connect to cluster" caption="" %}
230+
alt="OnDemand File Explorer buttons next to filename" caption="" %}
231231
Those buttons allow you to View, Edit, Rename, Download, or Delete a file.
232232
* At the top of the window, on the right side:
233233
{% include figure.html url="" max-width="50%"
234234
file="/fig/file_explorer_btn2.png"
235-
alt="Connect to cluster" caption="" %}
235+
alt="OnDemand File Explorer buttons in top right menu" caption="" %}
236236

237237
| Button | Function |
238238
| ------ | -------- |

_episodes/19-resources.md

+239-65
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ might matter.
2020

2121
## Estimating Required Resources Using the Scheduler
2222

23-
Although we covered requesting resources from the scheduler earlier with the
24-
π code, how do we know what type of resources the software will need in
23+
Although we covered requesting resources from the scheduler earlier,
24+
how do we know what type of resources the software will need in
2525
the first place, and its demand for each? In general, unless the software
2626
documentation or user testimonials provide some idea, we won't know how much
2727
memory or compute time a program will need.
@@ -34,89 +34,263 @@ memory or compute time a program will need.
3434
> written up guidance for getting the most out of it.
3535
{: .callout}
3636

37-
A convenient way of figuring out the resources required for a job to run
38-
successfully is to submit a test job, and then ask the scheduler about its
39-
impact using `{{ site.sched.hist }}`. You can use this knowledge to set up the
40-
next job with a closer estimate of its load on the system. A good general rule
41-
is to ask the scheduler for 20% to 30% more time and memory than you expect the
42-
job to need. This ensures that minor fluctuations in run time or memory use
43-
will not result in your job being cancelled by the scheduler. Keep in mind that
44-
if you ask for too much, your job may not run even though enough resources are
45-
available, because the scheduler will be waiting for other people's jobs to
46-
finish and free up the resources needed to match what you asked for.
37+
Why estimate resources accurately when you can just ask for the maximum
38+
CPUs/GPUs/RAM/time?
4739

48-
## Stats
40+
* The more you can fine-tune your requests, the less resources you will need
41+
to request and thus the less time your jobs will wait in the queue.
42+
* The less your [Fair Share] score will be impacted.
43+
* Less time to research results!
4944

50-
Since we already submitted `amdahl` to run on the cluster, we can query the
51-
scheduler to see how long our job took and what resources were used. We will
52-
use `{{ site.sched.hist }}` to get statistics about `parallel-job.sh`.
45+
> ## Will my code run faster if I use more than 1 CPU/GPU?
46+
>
47+
> Only if your code can use > 1 CPU/GPU. Please read your code’s documentation!
48+
> Look for the software’s flags/options for CPUs/threads/cores and match these to
49+
> sbatch parameters (-c or -n)
50+
{: .callout}
51+
52+
## Method 1: `ssh` to compute node and monitor performance with `htop`
53+
54+
This method can be done before you scale up and run your code with an `sbatch` script.
55+
56+
1. `srun --pty bash`
57+
58+
2. load modules, run your code in the background
59+
60+
```
61+
[SUNetID@wheat01:~]$ python3 mycode.py > /dev/null 2>&1 &
62+
[SUNetID@wheat01:~]$ htop -u $USER
63+
```
64+
{: .language-bash}
65+
66+
You will see how many CPUs, threads and how much RAM your code is using in real-time
67+
by running the htop or top command on the compute node as your code runs in the
68+
background.
69+
70+
More info: [htop], [top]
71+
72+
_Note: `> /dev/null 2>&1 &` will redirect all code output away from the terminal and keep the command line prompt available for you to run htop/top._
73+
74+
### `htop` example on a compute node: showing all 4 requested CPUs used
5375
5476
```
55-
{{ site.remote.prompt }} {{ site.sched.hist }}
77+
{{ site.remote.prompt }} srun -c 4 --pty bash
78+
[SUNetID@wheat01:~]$ ml load matlab
79+
[SUNetID@wheat01:~]$ matlab -batch "pfor" > /dev/null 2>&1&
80+
[SUNetID@wheat01:~]$ htop
5681
```
5782
{: .language-bash}
5883
59-
{% include {{ site.snippets }}/resources/account-history.snip %}
84+
{% include figure.html url="" max-width="65%"
85+
file="/fig/htop.png"
86+
alt="htop with matlab" caption="" %}
87+
88+
## Method 2: Use `seff` to look at resource usage after a job completes
89+
90+
`seff` displays statistics related to the efficiency of resource usage by a
91+
completed job. These are approximations based on SLURMs job sampling rate of one
92+
sample every 5 minutes.
6093
61-
This shows all the jobs we ran today (note that there are multiple entries per
62-
job).
63-
To get info about a specific job (for example, 347087), we change command
64-
slightly.
94+
### `seff <jobid>`
6595
6696
```
67-
{{ site.remote.prompt }} {{ site.sched.hist }} {{ site.sched.flag.histdetail }} 347087
97+
{{ site.remote.prompt }} seff 66594168
6898
```
6999
{: .language-bash}
70100
71-
It will show a lot of info; in fact, every single piece of info collected on
72-
your job by the scheduler will show up here. It may be useful to redirect this
73-
information to `less` to make it easier to view (use the left and right arrow
74-
keys to scroll through fields).
101+
```
102+
Job ID: 66594168
103+
Cluster: sherlock
104+
User/Group: mpiercy/<PI_SUNETID>
105+
State: COMPLETED (exit code 0)
106+
Nodes: 1
107+
Cores per node: 12
108+
CPU Utilized: 00:02:31
109+
CPU Efficiency: 20.97% of 00:12:00 core-walltime
110+
Job Wall-clock time: 00:01:00
111+
Memory Utilized: 5.79 GB
112+
Memory Efficiency: 12.35% of 46.88 GB
113+
```
114+
{: .output}
115+
116+
- Look at **CPU Efficiency** and compare to the number of CPUs you requested.
117+
For example if you requested 4 CPUs and **CPU Efficiency** was 20-25% you probably only
118+
needed to request 1 CPU.
119+
120+
- The same estimation can be used with memory request by looking at **Memory Efficiency.**
121+
122+
### Example 1: Job 43498042
75123
76124
```
77-
{{ site.remote.prompt }} {{ site.sched.hist }} {{ site.sched.flag.histdetail }} 347087 | less -S
125+
{{ site.remote.prompt }} seff 43498042
78126
```
79127
{: .language-bash}
80128
81-
> ## Discussion
82-
>
83-
> This view can help compare the amount of time requested and actually
84-
> used, duration of residence in the queue before launching, and memory
85-
> footprint on the compute node(s).
86-
>
87-
> How accurate were our estimates?
88-
{: .discussion}
89-
90-
## Improving Resource Requests
91-
92-
From the job history, we see that `amdahl` jobs finished executing in
93-
at most a few minutes, once dispatched. The time estimate we provided
94-
in the job script was far too long! This makes it harder for the
95-
queuing system to accurately estimate when resources will become free
96-
for other jobs. Practically, this means that the queuing system waits
97-
to dispatch our `amdahl` job until the full requested time slot opens,
98-
instead of "sneaking it in" a much shorter window where the job could
99-
actually finish. Specifying the expected runtime in the submission
100-
script more accurately will help alleviate cluster congestion and may
101-
get your job dispatched earlier.
102-
103-
> ## Narrow the Time Estimate
129+
```
130+
Job ID: 43498042
131+
Cluster: sherlock
132+
User/Group: mpiercy/<PI_SUNETID>
133+
State: TIMEOUT (exit code 0)
134+
Nodes: 1
135+
Cores per node: 2
136+
CPU Utilized: 00:58:51
137+
CPU Efficiency: 96.37% of 01:01:04 core-walltime
138+
Job Wall-clock time: 00:30:32
139+
Memory Utilized: 3.63 GB
140+
Memory Efficiency: 90.84% of 4.00 GB
141+
```
142+
{: .output}
143+
144+
So, this job ran on all CPUs for the entire time requested, 30 minutes.
145+
Thus core-walltime is 2x30 minutes or 1 hour. 96.37%, pretty efficient!
146+
Note that `seff` is for completed jobs only.
147+
148+
### Example 2 (over-requested resources): Job 43507209
149+
150+
```
151+
{{ site.remote.prompt }} seff 43507209
152+
```
153+
{: .language-bash}
154+
155+
```
156+
Job ID: 43507209
157+
Cluster: sherlock
158+
User/Group: mpiercy/ruthm
159+
State: TIMEOUT (exit code 0)
160+
Nodes: 1
161+
Cores per node: 2
162+
CPU Utilized: 00:29:15
163+
CPU Efficiency: 48.11% of 01:00:48 core-walltime
164+
Job Wall-clock time: 00:30:24
165+
Memory Utilized: 2.65 GB
166+
Memory Efficiency: 66.17% of 4.00 GB
167+
```
168+
{: .output}
169+
170+
Because 2 CPUs were requested for 30 minutes (Job Wall-clock time) but only one was used
171+
by the code (CPU Utilized), we get a CPU Efficiency of 48.11% - basically 50%.
172+
2 CPUs where requested for 30 minutes each, so we see 1 hour total core-walltime
173+
requested but only 30 minutes was used.
174+
175+
So in this case there was no logical reason to request 2 CPUs, 1 would have been sufficient.
176+
177+
The memory was sufficiently utilized and we did not get an out of memory error,
178+
so we probably don’t need to request any extra.
179+
180+
## Method 3: Use `sacct` to look at resource usage after a job completes
181+
182+
We can also use [sacct] to estimate a job's resource requirements. `sacct` provides
183+
much more detail about our job than `seff` does, and we can customize the output
184+
with the `-o` flag.
185+
186+
### `sacct -o reqmem,maxrss,averss,elapsed,alloccpu -j <jobid>`
187+
188+
Let's compare two jobs, `20292` and `426651`.
189+
190+
```
191+
{{ site.remote.prompt }} sacct -o reqmem,maxrss,averss,elapsed,alloccpu -j 20292
192+
```
193+
{: .language-bash}
194+
195+
```
196+
ReqMem MaxRSS AveRSS Elapsed AllocCPUS
197+
---------- ---------- ---------- ---------- ----------
198+
4Gn 00:08:53 1
199+
4Gn 3552K 5976K 00:08:57 1
200+
4Gn 921256K 921256K 00:08:49 1
201+
```
202+
{: .output}
203+
204+
**Here, the job only used .92 GB but 4 GB was requested. So, about 3 GB was needlessly
205+
requested so the job waited in the queue longer than needed before it ran.**
206+
Note that SLURM only samples a job’s resources every few minutes, so this is an average.
207+
Jobs with a MaxRSS close to ReqMem can still get an out of memory (OOM event) error and
208+
die. When this happens, request more memory in your sbatch with `--mem=`
209+
210+
- **ReqMem** = memory that you asked from SLURM. If it has type Mn, it is per node in MB,
211+
if Mc, then it is MB per core
212+
- **MaxRSS** = maximum amount of memory used at any time by any process in that job.
213+
This applies directly for serial jobs. For parallel jobs you need to multiply with the
214+
number of cores
215+
- **AveRSS** = the average memory used per process (or core). To get the total memory
216+
used, multiply this with number of cores used
217+
- **Elapsed** = time it took to run your job
218+
219+
```
220+
{{ site.remote.prompt }} sacct -o reqmem,maxrss,averss,elapsed,alloccpu -j 426651
221+
```
222+
{: .language-bash}
223+
224+
```
225+
ReqMem MaxRSS AveRSS Elapsed AllocCPUS
226+
---------- ---------- ---------- ---------- ----------
227+
4Gn 00:08:53 1
228+
4Gn 3552K 5976K 00:08:57 1
229+
4Gn 2921256K 2921256K 00:08:49 1
230+
```
231+
{: .output}
232+
233+
Here, the job came close to hitting the requested memory, 4 GB, 2.92 GB was used.
234+
This was a pretty accurate request.
235+
236+
> ## sacct accuracy and sampling rate
104237
>
105-
> Edit `parallel_job.sh` to set a better time estimate. How close can
106-
> you get?
238+
> sacct memory values are based on a sampling of the applications memory at a specific
239+
> time.
107240
>
108-
> Hint: use `{{ site.sched.flag.time }}`.
241+
> Remember that sacct results for memory usage (MaxVMSize, AveRSS, MaxRSS) are often not
242+
> accurate for Out Of Memory (OOM) jobs.
109243
>
110-
> > ## Solution
111-
> >
112-
> > The following line tells {{ site.sched.name }} that our job should
113-
> > finish within 2 minutes:
114-
> >
115-
> > ```
116-
> > {{ site.sched.comment }} {{ site.sched.flag.time }}{% if site.sched.name == "Slurm" %} {% else %}={% endif %}00:02:00
117-
> > ```
118-
> > {: .language-bash}
119-
> {: .solution}
120-
{: .challenge}
244+
> This is because the job is often terminated prior to next sacct sampling and also
245+
> terminated prior to it reaching full memory allocation.
246+
{: .callout}
247+
248+
### Example 1: Job 43498042
249+
250+
Here we will use `sacct` to look at resource usage for job 43498042 (the same job that
251+
is in Example 1 of the `seff` section).
252+
253+
```
254+
{{ site.remote.prompt }} sacct --format=JobID,state,elapsed,MaxRss,AveRSS,MaxVMSize,TotalCPU,ReqCPUS,ReqMem -j 43498042
255+
```
256+
{: .language-bash}
257+
258+
{% include figure.html url="" max-width="100%"
259+
file="/fig/sacct-ex1.png"
260+
alt="sacct output for example 1" caption="" %}
261+
262+
So, sacct shows that this job used 3.8GB of memory (MaxRSS and AveRSS) and 4GB was
263+
requested, an accurate request!
264+
265+
**MaxVMSize**, virtual memory size is usually not representative of the memory used by
266+
your application, it aggregates potential memory ranges from shared libraries, memory
267+
that has been allocated but not used, in addition to swap memory, etc. It's usually much
268+
greater than what your job’s application actually used. **MaxRSS** and **AveRSS** are
269+
much better metrics.
270+
271+
There are times when a very large **MaxVMSize** does indeed indicate that insufficient
272+
memory was requested.
273+
274+
## Example 2: Job 43507209
275+
276+
Here we will use `sacct` to look at resource usage for job 43507209 (the same job that
277+
is in Example 2 of the `seff` section).
278+
279+
```
280+
{{ site.remote.prompt }} sacct --format=JobID,state,elapsed,MaxRss,AveRSS,MaxVMSize,TotalCPU,ReqCPUS,ReqMem -j 43507209
281+
```
282+
{: .language-bash}
283+
284+
{% include figure.html url="" max-width="100%"
285+
file="/fig/sacct-ex1.png"
286+
alt="sacct output for example 2" caption="" %}
287+
288+
So **MaxRSS**, the maximum resident memory set size of all tasks in the job was 2.77 GB
289+
and **AveRSS** was also 2.77 GB. 4GB was requested so this was a pretty accurate request.
121290
122291
{% include links.md %}
292+
293+
[Fair Share]: https://slurm.schedmd.com/fair_tree.html#enduser
294+
[htop]: https://hisham.hm/htop/
295+
[top]: https://en.wikipedia.org/wiki/Top_(software)
296+
[sacct]: https://slurm.schedmd.com/sacct.html

_includes/snippets_library/SRC_FarmShare_slurm/_config_options.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ sched:
5050
interactive: "srun"
5151
info: "sinfo"
5252
comment: "#SBATCH"
53-
hist: "sacct -u $USER"
53+
hist: "sacct"
5454
hist_filter: ""
5555

5656
episode_order:

fig/htop.png

353 KB
Loading

fig/sacct-ex1.png

72.8 KB
Loading

fig/sacct-ex2.png

73.5 KB
Loading

0 commit comments

Comments
 (0)