Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization #17

Open
Stefan-Endres opened this issue Mar 21, 2018 · 10 comments
Open

Parallelization #17

Stefan-Endres opened this issue Mar 21, 2018 · 10 comments

Comments

@Stefan-Endres
Copy link
Owner

Overview

There are many sub-routines in shgo that are low hanging fruit for parallelization, most importantly the sampling mapping of the objective function itself (in cases where it is possible to parallelize the problem). In addition we can optimize the sampling size of an iteration based on the computational resources available.

We need to be careful with both dependencies and code structure changes we include for several reasons. First our dependencies on scipy.optimize.minimize and scipy.spatial.Delaunay. Secondly our ambition to include shgo in scipy.optimize means it should ideally have the same dependencies and structure. Finally we want to minimize reliance on maintenance from other packages which can lead to such issues that we had with using multiprocessing_on_dill in tgo.

My suggestion is to use numba to avoid needing to change the code structure at all. We can do this using tricks such as the one used in poliastro:
https://github.com/poliastro/poliastro/blob/0.6.x/src/poliastro/jit.py
https://github.com/poliastro/poliastro/blob/master/setup.py#L40

which simply maps the decorator to the dependency if it is installed or does nothing. Obviously it is possible that we can use the same tricks for other libraries and methods by redefining range functions etc in our code.

While it is possible that SciPy will eventually include numba as a dependency, based on discussions in the scipy-dev mailing lists this will not happen in the near future:
https://mail.python.org/pipermail/scipy-dev/2018-March/022576.html

However, for now we should be able to maintain numba as an optional dependency as describe in the rest of this post. My idea is to provide two main modes of parallelization, based on both CPU and GPU parallelization. So therefore using numba for GPU parallelization would be ideal since it avoids extra dependencies. Finally numba provides us with access to LLVM that can be used in the sampling generation.

GPU

User architecture can be found or specified and then we can map it to our own decorator.

Nvidia

We can use @numba.cuda.jit, I propose an early test of the objective function so that the user can be warned if this fails.
https://devblogs.nvidia.com/seven-things-numba/
https://numba.pydata.org/numba-doc/latest/cuda/index.html

AMD

We can use @hsa.jit(device=True)
https://numba.pydata.org/numba-doc/latest/hsa/overview.html

CPU

Our options for CPU parallelization
http://numba.pydata.org/numba-doc/dev/user/parallel.html
or multiprocessing_on_dill etc.

However, it appears that parallelization with numba isn't as simple as just adding @jit(parallel = True)
https://stackoverflow.com/questions/45610292/how-to-make-numba-jit-use-all-cpu-cores-parallelize-numba-jit

So we should also do a few tests on non-trivial functions to see if it is worth implementing.

@microprediction
Copy link

Not sure if the status has changed a lot already for this topic, but I have a pretty expensive function (2-10 minutes) so I'd be interested in helping devise something.

@fcela
Copy link

fcela commented Oct 26, 2020

Same here. My main interest are computational and algorithmic changes to better address situations where (1) the objective function is very expensive, but heavily vectorized [i.e. the cost of evaluating a single point is very similar to the cost of evaluating a large set of points at the same time]; and (2) we have large numbers of computing nodes at our disposal that can work in parallel.

@microprediction
Copy link

microprediction commented Oct 26, 2020 via email

@Stefan-Endres
Copy link
Owner Author

Hi @microprediction @fcela.

In the most recent update (7e83bb8) I've added the workers argument for shgo to allow for basic parallelization.

I would greatly appreciate any feedback and/or error reports using the argument. Currently I have only tested it with very simple Python objective functions. I suspect there might be issues such as pickling errors with more complex functions. Since all the unittests are passing I have also uploaded it to PyPi for a more convenient install, but would I like to test the implementation more before expanding the code and the documentation for downstream repositories.

Minimum working example:

from shgo import shgo
import time

# Toy problem
def f(x):
    time.sleep(0.1)
    return x[0] ** 2 + x[1] ** 2

bounds = np.array([[0, 1],]*2)

ts = time.time()
res = shgo(f, bounds, n=50, iters=2)
print(f'Total time serial: {time.time()- ts}')
print('-')
print(f'res = {res}')
ts = time.time()
res = shgo(f, bounds, n=50, iters=2, workers=8)
print('=')
print(f'Total time par: {time.time()- ts}')
print('-')
print(f'res = {res}')

CLI output:

Total time serial: 10.341249465942383
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 103
     nit: 2
   nlfev: 3
   nlhev: 0
   nljev: 1
 success: True
    tnev: 103
       x: array([0., 0.])
      xl: array([[0., 0.]])
=
Total time par: 1.9465992450714111
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 103
     nit: 2
   nlfev: 3
   nlhev: 0
   nljev: 1
 success: True
    tnev: 103
       x: array([0., 0.])
      xl: array([[0., 0.]])

Relevant code snippet (uses the multiprocessing library):

def pproc_fpool_nog(self):
#TODO: Ensure that .f is not already computed? (it shouldn't be addable
# to the self.fpool if it is).
self.wfield.func
fpool_l = []
for v in self.fpool:
fpool_l.append(v.x_a)
F = self.pool.map(self.wfield.func, fpool_l)
for va, f in zip(fpool_l, F):
vt = tuple(va)
self[vt].f = f # set vertex object attribute v.f = f
self.nfev += 1
# Clean the pool
self.fpool = set()

The parallelization occurs while evaluating the functions during the sampling stage. During the local minimization step serial evaluations are still used. In the future I would like to add parallelization here that provides each core with a starting point plus the chosen local minimisation function, ideally only using the standard library and scipy dependencies.

@microprediction
Copy link

microprediction commented Oct 27, 2020 via email

@fcela
Copy link

fcela commented Oct 28, 2020 via email

@microprediction
Copy link

microprediction commented Oct 28, 2020 via email

@fcela
Copy link

fcela commented Nov 2, 2020

Ray parallelization appears to work without any problem -- just replaing import multiprocessing as mp with import ray.util.multiprocessing as mp in shgo/_shgo_lib/_vertex.py

This is what I get for the minimal example above, multiprocessing vs ray.

Multiprocessing

Total time serial: 10.44602346420288
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 104
     nit: 2
   nlfev: 4
   nlhev: 0
   nljev: 1
 success: True
    tnev: 104
       x: array([0., 0.])
      xl: array([[0., 0.]])
=
Total time par: 2.0377724170684814
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 104
     nit: 2
   nlfev: 4
   nlhev: 0
   nljev: 1
 success: True
    tnev: 104
       x: array([0., 0.])
      xl: array([[0., 0.]])

Ray

Total time serial: 10.438774108886719
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 104
     nit: 2
   nlfev: 4
   nlhev: 0
   nljev: 1
 success: True
    tnev: 104
       x: array([0., 0.])
      xl: array([[0., 0.]])
2020-11-02 14:55:04,351	INFO services.py:1166 -- View the Ray dashboard at http://127.0.0.1:8265
=
Total time par: 3.972384452819824
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 104
     nit: 2
   nlfev: 4
   nlhev: 0
   nljev: 1
 success: True
    tnev: 104
       x: array([0., 0.])
      xl: array([[0., 0.]])

Of course for a problem this small, ray's parallelization overhead is overkill.

@microprediction
Copy link

microprediction commented Nov 9, 2020

I created this example https://github.com/microprediction/humpday/blob/main/Embarrassingly_SHGO.ipynb

Seems smooth enough, but still a toy example I suppose.

Taking a look at your code now to see what your concern might be re: pickle, but at least for my use case the objective function (or "pre-objective" as I've called it) is probably going to ssh off somewhere and shell out.

Perhaps...

  • my case is too simple, since I am content to have my "delegator processes" on a single machine, and they boss around processes elsewhere.
  • I should/could change embarrassingly to use ray
  • I'm missing something

@lcontinuum
Copy link

Hello,
I allow myself to post here a benchmark result referring to

In the future I would like to add parallelization here that provides each core with a starting point plus the chosen local minimisation function, ideally only using the standard library and scipy dependencies.

Please delete my post or point me to a better discussion platform if this is not the right place.

I use shgo to find the minimum of the eggholder function but I put some additional meaningless computation in the eggholder function to burn in each function call about 1 second computational time. In a first run I set workers=1 with timing

Number of function evaluations: 1099
elaspsed time: 1103.4 seconds

and in a second run workers=16. I'm using scipy 1.14.1 on a linux box with num_workers=16.

Number of function evaluations: 1099
elaspsed time: 983.0 seconds

Not a very big difference, 1103.4 vs 983.0 seconds in this case. I plotted the CPU usage for both cases and it shows that the 16 workers are only active for a short time in the sampling phase in the first some seconds. Afterwards, the CPU usage drops to approx. 6.25% which is 1 out of 16 cores. Which confirms

The parallelization occurs while evaluating the functions during the sampling stage. In the future I would like to add parallelization here that provides each core with a starting point plus the chosen local minimisation function, ideally only using the standard library and scipy dependencies.

cpu_usage_over_time

Are there plans to extend the parallel calls to local minimization from the different starting points? That would be really great. The computational time for my real objective function (numerical solution of a PDE with a MPI based software) is more in the range of several minutes to several 10 min. But I would have access to more machines/cores.

I like the fact, that shgo gives also local minima. As they have quite some physical meaning in my case.

My Python code

import time
import numpy as np
import scipy.optimize as opt


def eggholder(x, delay):
    time_start = time.time()
    fun = -(x[1] + 47.0) * np.sin(np.sqrt(abs(x[0] / 2.0 + (x[1] + 47.0)))) - x[0] * np.sin(np.sqrt(abs(x[0] - (x[1] + 47.0))))

    if delay > 0:
        # do some meaningless computation to burn CPU cycles
        a = np.random.rand(1000000, 1)
        sum = 0.0
        while time.time() - time_start < delay:
            sum += np.sum(np.sin(a))

    return fun


def optim(delay):
    start_time = time.time()
    bounds = [(-512, 512), (-512, 512)]

    # Only COBYLA, COBYQA, SLSQP, and trust-constr local minimize methods currently support constraint arguments.
    result = opt.shgo(
        eggholder,
        bounds,
        args=(delay,),
        iters=1,
        minimizer_kwargs={"method": "COBYQA"},
        sampling_method="sobol",
        workers=16,
    )

    print("result.x:", result.x)
    print("result.fun:", result.fun)

    print("Number of function evaluations:", result.nfev)
    print("elaspsed time:", time.time() - start_time, "seconds")


if __name__ == "__main__":
    # sar -u 1 1300 > cpu_usage.txt
    delay = 1.0  # delay in seconds
    optim(delay)

I use sar -u 1 1300 > cpu_usage.txt in a second terminal to measure the CPU usage every second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants