Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cobalt adaptor - 'total_cpu_count' job description parameter may be irrelevant/unneeded #620

Open
itomaldonado opened this issue Feb 18, 2017 · 2 comments

Comments

@itomaldonado
Copy link
Contributor

itomaldonado commented Feb 18, 2017

Creating an issue more for discussion than anything else, since I'd like to get some input from you @andre-merzky ~

Currently the Cobalt adaptor has these three job description parameters to control the number of nodes, cpus, mpi ranks.

# Example, 16-ranks-per-node, total of 8192 ranks, equate 8192 cpus (1 per rank)
jd.processes_per_host   = 16
jd.number_of_processes  = 8192
jd.total_cpu_count      = 8192 # 16 cores per node == 512 nodes

Here, we want to run a task in 512 nodes at 16-ranks per node and 1 rank per core. Cobalt (specifically Mira) has a static cores-per-node count of 16.

The original idea was to use the jd.total_cpu_count parameter since you can have multiple ranks per core (up to 4 ranks per core which mean up to 64 ranks per node), so it would make node calculations easy. But after some thinking I believe we may be able to just rely on jd.processes_per_host and jd.number_of_processes since the number of nodes can be extracted from these parameters (with some assumptions~).

Here is what I mean:

Example 1: Single rank per node

  • Ranks-per-node: 1
  • Total Number of Ranks: 512

Here, since we have only one rank per node and a total of 512 ranks, we will use 512 nodes and since the number of cores is static in Mira (16 cores-per-node as per documentation) we can assume we need a total of 8192 cores.

We get that:
cores = 16 * (jd.number_of_processes / jd.processes_per_host)

Example 2: Single rank per core

  • Ranks-per-node: 16
  • Total Number of Ranks: 8192

Similar as above with the slight difference that now all each rank will use up a core (instead of a rank using all cores in a node), but the calculations use the same formula.

cores = 16 * ( 8192 / 16 ) = 8192

Example 3: Multiple ranks per core

  • Ranks-per-node: 64
  • Total Number of Ranks: 32,768

This example is also similar, but now we have 4 ranks per core (this is the upper limit in Mira). The calculations use the same formula.

cores = 16 * ( 32,768 / 64 ) = 8192

Example 4: Multiple ranks per core, node is not fully utilized

  • Ranks-per-node: 64
  • Total Number of Ranks: 32,705

This example is different, now the total number of ranks do not use all available cores in a node, which actually breaks our previous formula.

cores = 16 * (32,705 / 64) = 511 # should be 512 !!!

This is were documentation helps again, in Mira you cannot request partial nodes so in cases where part of the node is needed, we can use the following generalized formula.

cores = 16 * ( ceiling[32,705 / 64] ) = 512 # Correct!

or specifically for Python 2.7~

import math
cores = int( 16 * math.ceil( float( 32705 ) / float( 64 ) ) ) # = 512

Example 5: for completeness' sake. Multiple ranks per node, node is not fully utilized

  • Ranks-per-node: 2

  • Total Number of Ranks: 1,023
    cores = 16 * ( ceiling[1,023 / 2] ) = 512 # Correct!

  • Ranks-per-node: 4

  • Total Number of Ranks: 2,046
    cores = 16 * ( ceiling[2,046 / 4] ) = 512 # Correct!

  • Ranks-per-node: 8

  • Total Number of Ranks: 4,091
    cores = 16 * ( ceiling[4,091 / 8] ) = 512 # Correct!

  • Ranks-per-node: 16

  • Total Number of Ranks: 8,177
    cores = 16 * ( ceiling[8,177 / 16] ) = 512 # Correct!

  • Ranks-per-node: 32

  • Total Number of Ranks: 16,353
    cores = 16 * ( ceiling[16,353 / 32] ) = 512 # Correct!

In short

We can drop the parameter jd.total_cpu_count and calculate it with the other two parameters since the following are reliable assumptions for Mira Blue Gene/Q:

  • Cores per node: 16
  • Max ranks per node: 64
    • Acceptable values for ranks per node: 1, 2, 4, 8, 16, 32, 64
  • Nodes cannot be partially scheduled

We can use the following general formula to calculate the number of nodes:
cores = 16 * ( ceiling[ total-number-of-ranks / ranks-per-node ] )

Finally, we can also provide defaults for the other two parameters (as Mira does).

# Defaults
jd.processes_per_host   = 1
jd.number_of_processes  = ( jd.processes_per_host )

Source: Mira Documentation

@iparask
Copy link
Contributor

iparask commented Mar 21, 2019

This is also an interesting issue and remains open

@iparask
Copy link
Contributor

iparask commented Mar 26, 2019

Check all adaptors and make sure they ceil the division of total number of CPUs with CPUs per node.

@andre-merzky andre-merzky removed their assignment Apr 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants