Cobalt adaptor - 'total_cpu_count' job description parameter may be irrelevant/unneeded #620

itomaldonado · 2017-02-18T16:17:56Z

Creating an issue more for discussion than anything else, since I'd like to get some input from you @andre-merzky ~

Currently the Cobalt adaptor has these three job description parameters to control the number of nodes, cpus, mpi ranks.

# Example, 16-ranks-per-node, total of 8192 ranks, equate 8192 cpus (1 per rank)
jd.processes_per_host   = 16
jd.number_of_processes  = 8192
jd.total_cpu_count      = 8192 # 16 cores per node == 512 nodes

Here, we want to run a task in 512 nodes at 16-ranks per node and 1 rank per core. Cobalt (specifically Mira) has a static cores-per-node count of 16.

The original idea was to use the jd.total_cpu_count parameter since you can have multiple ranks per core (up to 4 ranks per core which mean up to 64 ranks per node), so it would make node calculations easy. But after some thinking I believe we may be able to just rely on jd.processes_per_host and jd.number_of_processes since the number of nodes can be extracted from these parameters (with some assumptions~).

Here is what I mean:

Example 1: Single rank per node

Ranks-per-node: 1
Total Number of Ranks: 512

Here, since we have only one rank per node and a total of 512 ranks, we will use 512 nodes and since the number of cores is static in Mira (16 cores-per-node as per documentation) we can assume we need a total of 8192 cores.

We get that:
cores = 16 * (jd.number_of_processes / jd.processes_per_host)

Example 2: Single rank per core

Ranks-per-node: 16
Total Number of Ranks: 8192

Similar as above with the slight difference that now all each rank will use up a core (instead of a rank using all cores in a node), but the calculations use the same formula.

cores = 16 * ( 8192 / 16 ) = 8192

Example 3: Multiple ranks per core

Ranks-per-node: 64
Total Number of Ranks: 32,768

This example is also similar, but now we have 4 ranks per core (this is the upper limit in Mira). The calculations use the same formula.

cores = 16 * ( 32,768 / 64 ) = 8192

Example 4: Multiple ranks per core, node is not fully utilized

Ranks-per-node: 64
Total Number of Ranks: 32,705

This example is different, now the total number of ranks do not use all available cores in a node, which actually breaks our previous formula.

cores = 16 * (32,705 / 64) = 511 # should be 512 !!!

This is were documentation helps again, in Mira you cannot request partial nodes so in cases where part of the node is needed, we can use the following generalized formula.

cores = 16 * ( ceiling[32,705 / 64] ) = 512 # Correct!

or specifically for Python 2.7~

import math
cores = int( 16 * math.ceil( float( 32705 ) / float( 64 ) ) ) # = 512

Example 5: for completeness' sake. Multiple ranks per node, node is not fully utilized

Ranks-per-node: 2
Total Number of Ranks: 1,023
cores = 16 * ( ceiling[1,023 / 2] ) = 512 # Correct!
Ranks-per-node: 4
Total Number of Ranks: 2,046
cores = 16 * ( ceiling[2,046 / 4] ) = 512 # Correct!
Ranks-per-node: 8
Total Number of Ranks: 4,091
cores = 16 * ( ceiling[4,091 / 8] ) = 512 # Correct!
Ranks-per-node: 16
Total Number of Ranks: 8,177
cores = 16 * ( ceiling[8,177 / 16] ) = 512 # Correct!
Ranks-per-node: 32
Total Number of Ranks: 16,353
cores = 16 * ( ceiling[16,353 / 32] ) = 512 # Correct!

In short

We can drop the parameter jd.total_cpu_count and calculate it with the other two parameters since the following are reliable assumptions for Mira Blue Gene/Q:

Cores per node: 16
Max ranks per node: 64
- Acceptable values for ranks per node: 1, 2, 4, 8, 16, 32, 64
Nodes cannot be partially scheduled

We can use the following general formula to calculate the number of nodes:
cores = 16 * ( ceiling[ total-number-of-ranks / ranks-per-node ] )

Finally, we can also provide defaults for the other two parameters (as Mira does).

# Defaults
jd.processes_per_host   = 1
jd.number_of_processes  = ( jd.processes_per_host )

Source: Mira Documentation

The text was updated successfully, but these errors were encountered:

iparask · 2019-03-21T15:36:50Z

This is also an interesting issue and remains open

iparask · 2019-03-26T22:08:08Z

Check all adaptors and make sure they ceil the division of total number of CPUs with CPUs per node.

andre-merzky assigned andre-merzky and itomaldonado Feb 18, 2017

andre-merzky added the question label Feb 18, 2017

andre-merzky added this to the Backburner milestone Feb 18, 2017

andre-merzky added the comp:cobalt label Feb 18, 2017

andre-merzky added type:question comp:cobalt and removed question labels Mar 4, 2018

andre-merzky removed their assignment Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cobalt adaptor - 'total_cpu_count' job description parameter may be irrelevant/unneeded #620

Cobalt adaptor - 'total_cpu_count' job description parameter may be irrelevant/unneeded #620

itomaldonado commented Feb 18, 2017 •

edited

Loading

iparask commented Mar 21, 2019

iparask commented Mar 26, 2019

Cobalt adaptor - 'total_cpu_count' job description parameter may be irrelevant/unneeded #620

Cobalt adaptor - 'total_cpu_count' job description parameter may be irrelevant/unneeded #620

Comments

itomaldonado commented Feb 18, 2017 • edited Loading

Example 1: Single rank per node

Example 2: Single rank per core

Example 3: Multiple ranks per core

Example 4: Multiple ranks per core, node is not fully utilized

Example 5: for completeness' sake. Multiple ranks per node, node is not fully utilized

In short

iparask commented Mar 21, 2019

iparask commented Mar 26, 2019

itomaldonado commented Feb 18, 2017 •

edited

Loading