stanford-rc
diff --git a/‎_config.yml
+1-1 b/‎_config.yml
+1-1
diff --git a/‎_episodes/16-transferring-files.md
+2-2 b/‎_episodes/16-transferring-files.md
+2-2
diff --git a/‎_episodes/19-resources.md
+239-65 b/‎_episodes/19-resources.md
+239-65
diff --git a/‎_includes/snippets_library/SRC_FarmShare_slurm/_config_options.yml
+1-1 b/‎_includes/snippets_library/SRC_FarmShare_slurm/_config_options.yml
+1-1
diff --git a/‎fig/htop.png
353 KB b/‎fig/htop.png
353 KB
diff --git a/‎fig/sacct-ex1.png
72.8 KB b/‎fig/sacct-ex1.png
72.8 KB
diff --git a/‎fig/sacct-ex2.png
73.5 KB b/‎fig/sacct-ex2.png
73.5 KB
@@ -73,7 +73,7 @@ sched:
   interactive: "srun"
   info: "sinfo"
   comment: "#SBATCH"
-  hist: "sacct -u $USER"
+  hist: "sacct"
   hist_filter: ""
 
 episode_order:
 
@@ -227,12 +227,12 @@ There are two sets of buttons in the File Explorer.
 * Under the three vertical dots menu next to each filename:
 {% include figure.html url="" max-width="50%"
    file="/fig/file_explorer_btn1.png"
-   alt="Connect to cluster" caption="" %}
+   alt="OnDemand File Explorer buttons next to filename" caption="" %}
 Those buttons allow you to View, Edit, Rename, Download, or Delete a file.
 * At the top of the window, on the right side:
 {% include figure.html url="" max-width="50%"
    file="/fig/file_explorer_btn2.png"
-   alt="Connect to cluster" caption="" %}
+   alt="OnDemand File Explorer buttons in top right menu" caption="" %}
 
 | Button | Function |
 | ------ | -------- |
 
@@ -20,8 +20,8 @@ might matter.
 
 ## Estimating Required Resources Using the Scheduler
 
-Although we covered requesting resources from the scheduler earlier with the
-π code, how do we know what type of resources the software will need in
+Although we covered requesting resources from the scheduler earlier,
+how do we know what type of resources the software will need in
 the first place, and its demand for each? In general, unless the software
 documentation or user testimonials provide some idea, we won't know how much
 memory or compute time a program will need.
@@ -34,89 +34,263 @@ memory or compute time a program will need.
 > written up guidance for getting the most out of it.
 {: .callout}
 
-A convenient way of figuring out the resources required for a job to run
-successfully is to submit a test job, and then ask the scheduler about its
-impact using `{{ site.sched.hist }}`. You can use this knowledge to set up the
-next job with a closer estimate of its load on the system. A good general rule
-is to ask the scheduler for 20% to 30% more time and memory than you expect the
-job to need. This ensures that minor fluctuations in run time or memory use
-will not result in your job being cancelled by the scheduler. Keep in mind that
-if you ask for too much, your job may not run even though enough resources are
-available, because the scheduler will be waiting for other people's jobs to
-finish and free up the resources needed to match what you asked for.
+Why estimate resources accurately when you can just ask for the maximum
+CPUs/GPUs/RAM/time?
 
-## Stats
+* The more you can fine-tune your requests, the less resources you will need
+to request and thus the less time your jobs will wait in the queue.
+* The less your [Fair Share] score will be impacted.
+* Less time to research results!
 
-Since we already submitted `amdahl` to run on the cluster, we can query the
-scheduler to see how long our job took and what resources were used. We will
-use `{{ site.sched.hist }}` to get statistics about `parallel-job.sh`.
+> ## Will my code run faster if I use more than 1 CPU/GPU?
+>
+> Only if your code can use > 1 CPU/GPU. Please read your code’s documentation!
+> Look for the software’s flags/options for CPUs/threads/cores and match these to
+> sbatch parameters (-c or -n)
+{: .callout}
+
+## Method 1: `ssh` to compute node and monitor performance with `htop`
+
+This method can be done before you scale up and run your code with an `sbatch` script.
+
+1. `srun --pty bash`
+
+2. load modules, run your code in the background
+
+    ```
+    [SUNetID@wheat01:~]$ python3 mycode.py > /dev/null 2>&1 &
+    [SUNetID@wheat01:~]$ htop -u $USER
+    ```
+    {: .language-bash}
+
+You will see how many CPUs, threads and how much RAM your code is using in real-time
+by running the htop or top command on the compute node as your code runs in the
+background.
+
+More info: [htop], [top]
+
+_Note: `> /dev/null 2>&1 &` will redirect all code output away from the terminal and keep the command line prompt available for you to run htop/top._
+
+### `htop` example on a compute node: showing all 4 requested CPUs used
 
 ```
-{{ site.remote.prompt }} {{ site.sched.hist }}
+{{ site.remote.prompt }} srun -c 4 --pty bash
+[SUNetID@wheat01:~]$ ml load matlab
+[SUNetID@wheat01:~]$ matlab -batch "pfor" > /dev/null 2>&1&
+[SUNetID@wheat01:~]$ htop
 ```
 {: .language-bash}
 
-{% include {{ site.snippets }}/resources/account-history.snip %}
+{% include figure.html url="" max-width="65%"
+   file="/fig/htop.png"
+   alt="htop with matlab" caption="" %}
+
+## Method 2: Use `seff` to look at resource usage after a job completes
+
+`seff` displays statistics related to the efficiency of resource usage by a
+completed job.  These are approximations based on SLURMs job sampling rate of one
+sample every 5 minutes.
 
-This shows all the jobs we ran today (note that there are multiple entries per
-job).
-To get info about a specific job (for example, 347087), we change command
-slightly.
+### `seff <jobid>`
 
 ```
-{{ site.remote.prompt }} {{ site.sched.hist }} {{ site.sched.flag.histdetail }} 347087
+{{ site.remote.prompt }} seff 66594168
 ```
 {: .language-bash}
 
-It will show a lot of info; in fact, every single piece of info collected on
-your job by the scheduler will show up here. It may be useful to redirect this
-information to `less` to make it easier to view (use the left and right arrow
-keys to scroll through fields).
+```
+Job ID: 66594168
+Cluster: sherlock
+User/Group: mpiercy/<PI_SUNETID>
+State: COMPLETED (exit code 0)
+Nodes: 1
+Cores per node: 12
+CPU Utilized: 00:02:31
+CPU Efficiency: 20.97% of 00:12:00 core-walltime
+Job Wall-clock time: 00:01:00
+Memory Utilized: 5.79 GB
+Memory Efficiency: 12.35% of 46.88 GB
+```
+{: .output}
+
+- Look at **CPU Efficiency** and compare to the number of CPUs you requested.
+For example if you requested 4 CPUs and **CPU Efficiency** was 20-25% you probably only
+needed to request 1 CPU.
+
+- The same estimation can be used with memory request by looking at **Memory Efficiency.**
+
+### Example 1: Job 43498042
 
 ```
-{{ site.remote.prompt }} {{ site.sched.hist }} {{ site.sched.flag.histdetail }} 347087 | less -S
+{{ site.remote.prompt }} seff 43498042
 ```
 {: .language-bash}
 
-> ## Discussion
->
-> This view can help compare the amount of time requested and actually
-> used, duration of residence in the queue before launching, and memory
-> footprint on the compute node(s).
->
-> How accurate were our estimates?
-{: .discussion}
-
-## Improving Resource Requests
-
-From the job history, we see that `amdahl` jobs finished executing in
-at most a few minutes, once dispatched. The time estimate we provided
-in the job script was far too long! This makes it harder for the
-queuing system to accurately estimate when resources will become free
-for other jobs. Practically, this means that the queuing system waits
-to dispatch our `amdahl` job until the full requested time slot opens,
-instead of "sneaking it in" a much shorter window where the job could
-actually finish. Specifying the expected runtime in the submission
-script more accurately will help alleviate cluster congestion and may
-get your job dispatched earlier.
-
-> ## Narrow the Time Estimate
+```
+Job ID: 43498042
+Cluster: sherlock
+User/Group: mpiercy/<PI_SUNETID>
+State: TIMEOUT (exit code 0)
+Nodes: 1
+Cores per node: 2
+CPU Utilized: 00:58:51
+CPU Efficiency: 96.37% of 01:01:04 core-walltime
+Job Wall-clock time: 00:30:32
+Memory Utilized: 3.63 GB
+Memory Efficiency: 90.84% of 4.00 GB
+```
+{: .output}
+
+So, this job ran on all CPUs for the entire time requested, 30 minutes.
+Thus core-walltime is 2x30 minutes or 1 hour.  96.37%, pretty efficient!
+Note that `seff` is for completed jobs only.
+
+### Example 2 (over-requested resources): Job 43507209
+
+```
+{{ site.remote.prompt }} seff 43507209
+```
+{: .language-bash}
+
+```
+Job ID: 43507209
+Cluster: sherlock
+User/Group: mpiercy/ruthm
+State: TIMEOUT (exit code 0)
+Nodes: 1
+Cores per node: 2
+CPU Utilized: 00:29:15
+CPU Efficiency: 48.11% of 01:00:48 core-walltime
+Job Wall-clock time: 00:30:24
+Memory Utilized: 2.65 GB
+Memory Efficiency: 66.17% of 4.00 GB
+```
+{: .output}
+
+Because 2 CPUs were requested for 30 minutes (Job Wall-clock time) but only one was used
+by the code (CPU Utilized), we get a CPU Efficiency of 48.11% - basically 50%.
+2 CPUs where requested for 30 minutes each, so we see 1 hour total core-walltime
+requested but only 30 minutes was used.
+
+So in this case there was no logical reason to request 2 CPUs, 1 would have been sufficient.
+
+The memory was sufficiently utilized and we did not get an out of memory error,
+so we probably don’t need to request any extra.
+
+## Method 3: Use `sacct` to look at resource usage after a job completes
+
+We can also use [sacct] to estimate a job's resource requirements. `sacct` provides
+much more detail about our job than `seff` does, and we can customize the output
+with the `-o` flag.
+
+### `sacct -o reqmem,maxrss,averss,elapsed,alloccpu -j <jobid>`
+
+Let's compare two jobs, `20292` and `426651`.
+
+```
+{{ site.remote.prompt }} sacct -o reqmem,maxrss,averss,elapsed,alloccpu -j 20292
+```
+{: .language-bash}
+
+```
+    ReqMem     MaxRSS     AveRSS    Elapsed  AllocCPUS
+---------- ---------- ---------- ---------- ----------
+       4Gn                         00:08:53          1
+       4Gn      3552K      5976K   00:08:57          1
+       4Gn    921256K    921256K   00:08:49          1
+```
+{: .output}
+
+**Here, the job only used .92 GB but 4 GB was requested. So, about 3 GB was needlessly
+requested so the job waited in the queue longer than needed before it ran.**
+Note that SLURM only samples a job’s resources every few minutes, so this is an average.
+Jobs with a MaxRSS close to ReqMem can still get an out of memory (OOM event) error and
+die. When this happens, request more memory in your sbatch with `--mem=`
+
+- **ReqMem** = memory that you asked from SLURM. If it has type Mn, it is per node in MB,
+if Mc, then it is MB per core
+- **MaxRSS** = maximum amount of memory used at any time by any process in that job.
+This applies directly for serial jobs. For parallel jobs you need to multiply with the
+number of cores
+- **AveRSS** = the average memory used per process (or core). To get the total memory
+used, multiply this with number of cores used
+- **Elapsed** = time it took to run your job
+
+```
+{{ site.remote.prompt }} sacct -o reqmem,maxrss,averss,elapsed,alloccpu -j 426651
+```
+{: .language-bash}
+
+```
+    ReqMem     MaxRSS     AveRSS    Elapsed  AllocCPUS
+---------- ---------- ---------- ---------- ----------
+       4Gn                         00:08:53          1
+       4Gn      3552K      5976K   00:08:57          1
+       4Gn   2921256K   2921256K   00:08:49          1
+```
+{: .output}
+
+Here, the job came close to hitting the requested memory, 4 GB,  2.92 GB was used.
+This was a pretty accurate request.
+
+> ## sacct accuracy and sampling rate
 >
-> Edit `parallel_job.sh` to set a better time estimate. How close can
-> you get?
+> sacct memory values are based on a sampling of the applications memory at a specific
+> time.
 >
-> Hint: use `{{ site.sched.flag.time }}`.
+> Remember that sacct results for memory usage (MaxVMSize, AveRSS, MaxRSS) are often not
+> accurate for Out Of Memory (OOM) jobs.
 >
-> > ## Solution
-> >
-> > The following line tells {{ site.sched.name }} that our job should
-> > finish within 2 minutes:
-> >
-> > ```
-> > {{ site.sched.comment }} {{ site.sched.flag.time }}{% if site.sched.name == "Slurm" %} {% else %}={% endif %}00:02:00
-> > ```
-> > {: .language-bash}
-> {: .solution}
-{: .challenge}
+> This is because the job is often terminated prior to next sacct sampling and also
+> terminated prior to it reaching full memory allocation.
+{: .callout}
+
+### Example 1: Job 43498042
+
+Here we will use `sacct` to look at resource usage for job 43498042 (the same job that
+is in Example 1 of the `seff` section).
+
+```
+{{ site.remote.prompt }} sacct --format=JobID,state,elapsed,MaxRss,AveRSS,MaxVMSize,TotalCPU,ReqCPUS,ReqMem -j 43498042
+```
+{: .language-bash}
+
+{% include figure.html url="" max-width="100%"
+   file="/fig/sacct-ex1.png"
+   alt="sacct output for example 1" caption="" %}
+
+So, sacct shows that this job used 3.8GB of memory (MaxRSS and AveRSS) and 4GB was
+requested, an accurate request!
+
+**MaxVMSize**, virtual memory size is usually not representative of the memory used by
+your application, it aggregates potential memory ranges from shared libraries, memory
+that has been allocated but not used, in addition to swap memory, etc. It's usually much
+greater than what your job’s application actually used. **MaxRSS** and **AveRSS** are
+much better metrics.
+
+There are times when a very large **MaxVMSize** does indeed indicate that insufficient
+memory was requested.
+
+## Example 2: Job 43507209
+
+Here we will use `sacct` to look at resource usage for job 43507209 (the same job that
+is in Example 2 of the `seff` section).
+
+```
+{{ site.remote.prompt }} sacct --format=JobID,state,elapsed,MaxRss,AveRSS,MaxVMSize,TotalCPU,ReqCPUS,ReqMem -j 43507209
+```
+{: .language-bash}
+
+{% include figure.html url="" max-width="100%"
+   file="/fig/sacct-ex1.png"
+   alt="sacct output for example 2" caption="" %}
+
+So **MaxRSS**, the maximum resident memory set size of all tasks in the job was 2.77 GB
+and **AveRSS** was also 2.77 GB.   4GB was requested so this was a pretty accurate request.
 
 {% include links.md %}
+
+[Fair Share]: https://slurm.schedmd.com/fair_tree.html#enduser
+[htop]: https://hisham.hm/htop/
+[top]: https://en.wikipedia.org/wiki/Top_(software)
+[sacct]: https://slurm.schedmd.com/sacct.html
@@ -50,7 +50,7 @@ sched:
   interactive: "srun"
   info: "sinfo"
   comment: "#SBATCH"
-  hist: "sacct -u $USER"
+  hist: "sacct"
   hist_filter: ""
 
 episode_order: