Conduct a model run review and brainstorm LiveOcean performance improvements #82

Michael-Lalime · 2024-04-19T18:53:52Z

Performance was slower than expected during a 1-year run. This was based on a 2-day run where each day took about 38 minutes to complete but during the longer run this slowed down to about 60 minutes per model day. I was running with 4, hpc6a.48xlarge instance types.

We met with AWS during the technical meeting, Friday 4/19/2024, and they had several ideas. Here are the meeting notes:

===============================================================================
April 19, 2024

AWS folks joined us (Rayette Abdulah-Toles, Aaron Bucher, John Kolman, Austin Park, Dev Jodhrun, Matt Dowling)
Discussing the LiveOcean slowdown 40min/day to 60/day.
Rayette suggests trying a HPC7a instance
There were some questions about moving from the VM instance type to a .metal instance type - Aaron B. says not worth it
Aaron asked about I/O but our impression is that all the data was loaded at the start. Data on EFS shared disk. I/O should be stable day to day.
Do we need a second adaptor for each instance - nope, that won’t help.
AWS reviewed machines and discussed best configurations (Naming conventions ..i is Intel, …a is AMD):
7a does allow a second adaptor - test 96 and 48xlarge types
c7i, r7iz (high frequency processor) - test highest in each family
r7iz will have highest clock speed
Aaron will look at our region (east 2b) and see what better instances might be available.
Michael screen-shared the benchmark testing spreadsheet and explained what we did.
Zach and Aaron discussed the different configurations and data
transfer latencies:
EBS vs EFS burst credits
Aaron was not sure why we are seeing an abrupt slow down when processing the 1 year run:
Doesn’t make sense it would suddenly take that much longer (see above - I/O should be stable day to day)
Michael reviewed the run - would jump from about 44 to 50+ minutes
Memory to bandwidth scaling?

Status of Live Ocean run
Try some new benchmarks… and do a follow up with AWS to review (see below To Do)
AWS suggestions (see details above):
Parallel processing - run a bunch of jobs and see what work fastest.
Try HPC7a, r7iz, c7i (see details above)
Change up the storage (EFS may not be optimal in all cases, Lustre could be an option):
Run 6a but change storage type to Lustre
Focus on testing just the largest instance type for each family
Next steps
12 years in one year chunks via Parker’s description
NODD application
Status of CORA
test data for arrived yet?
plan for east coast test run
documentation
Status of the CO-OPS/OCS space
Possible OCS ?ADCIRC? run - convo with Saeed and Co.
Data sources
Code

To Do:
Follow up with AWS group in 1 week (Tiffany added them to the weekly working session, so they can join whenever!)

===============================================================================

Michael-Lalime added the LiveOcean 12-year reanalysis label Apr 19, 2024

Michael-Lalime assigned ZacharyWills and Michael-Lalime Apr 19, 2024

KatherinePowell-NOAA changed the title ~~LiveOcean performance~~ Conduct a model run review and brainstorm LiveOcean performance improvements Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conduct a model run review and brainstorm LiveOcean performance improvements #82

Conduct a model run review and brainstorm LiveOcean performance improvements #82

Michael-Lalime commented Apr 19, 2024

Conduct a model run review and brainstorm LiveOcean performance improvements #82

Conduct a model run review and brainstorm LiveOcean performance improvements #82

Comments

Michael-Lalime commented Apr 19, 2024