partr thread support for openblas #43984

ViralBShah · 2019-08-04T19:29:36Z

ViralBShah
Aug 4, 2019
Maintainer

Here are some notes from digging into the openblas codebase (with @stevengj) to enable partr threading support.

exec_blas is called by all the routines. The code pattern followed is setting up the work queue and calling exec_blas to do all the work through an openmp pragma.
The exception is lapack routines, which also use the exec_blas_async functions.
The openmp backend doesn’t seem to implement the async and thus I believe that it will not multi-thread the lapack calls.
Windows has its own threading backend

The easiest way may be to modify the openmp threading backend, which seems amenable to something like the fftw partr backend. To start with, we should ignore lapack threading. We could probably just implement an exec_blas_async fallback that calls exec_blas (and make exec_blas_async_wait a no-op).

All of this should work on windows too, although the going through the openmp build route may need some work on the makefiles.

The patch to FFTW should be indicative of something similar to be done for the openblas build.

ChrisRackauckas · 2019-08-16T05:07:53Z

ChrisRackauckas
Aug 16, 2019
Collaborator

We now have algorithms in DifferentialEquations.jl which utilize simultaneous implicit methods to enhance the parallelizability of small stiff ODEs and DAEs (i.e. <= 20 ODEs). Right now we'll just document that the user should probably set the BLAS threads to 1, but once this PR is in this algorithm can serve as a very good test case / showcase of why PARTR mixed into BLAS is useful.

SciML/OrdinaryDiffEq.jl#872

0 replies

ViralBShah · 2019-08-16T06:12:38Z

ViralBShah
Aug 16, 2019
Maintainer Author

This is a fairly straightforward project for someone who doesn't mind diving in and seeing how it was done in FFTW. I will certainly try it out if nobody gives it a shot in a few weeks.

0 replies

stevengj · 2019-08-16T14:36:41Z

stevengj
Aug 16, 2019
Collaborator

In the long run, it would be good if partr had a documented C API for spawn/wait, which would give us a lot more flexibility in integrating it with external libraries like this.

0 replies

nalimilan · 2019-08-20T07:55:25Z

nalimilan
Aug 20, 2019
Collaborator

Do you think this something that will require changes to OpenBLAS upstream and/or compiling OpenBLAS with specific options? Just checking from a packager perspective.

0 replies

vchuravy · 2019-08-20T08:23:00Z

vchuravy
Aug 20, 2019
Maintainer

Do you think this something that will require changes to OpenBLAS upstream and/or compiling OpenBLAS with specific options? Just checking from a packager perspective.

Yes, we probably have to work with OpenBLAS upstream

0 replies

stevengj · 2019-09-05T14:34:08Z

stevengj
Sep 5, 2019
Collaborator

I'm also implementing the FFTW strategy of a pluggable threading backend for Blosc (Blosc/c-blosc2#81).

I think we can make a strong argument to upstream developers that their libraries should use this kind of strategy where possible, because it allows easy composability not only with Julia's partr, but also with Intel's TBB and other threading schedulers. It also seems possible to do this with minimal patches in cases where they have already implemented their own threading.

0 replies

stevengj · 2019-09-12T17:24:47Z

stevengj
Sep 12, 2019
Collaborator

I think it's attractive to implement this as a runtime option, in addition to existing threading options rather than instead of them, as I did for FFTW and Blosc. That is, we add a single if statement to the existing exec_blas functions:

exec_blas(num, queue) {
    if (threads_callback) {
        // pass work to the callback function
        return;
    }
    // parallelize normally
}

This has three advantages:

You can install a single library on your system, and it can be used both by programs that need a custom threading backend and programs that don't.
If OpenBLAS is using things like pthread or win32 mutex locks to make access to shared resources thread-safe, those will continue to work.
We don't need to add a new configuration option to OpenBLAS … the optional plug-in backend will be used automatically.

0 replies

stevengj · 2019-09-12T17:33:51Z

stevengj
Sep 12, 2019
Collaborator

Regarding the exec_blas_async and exec_blas_async_wait, my hope is that the LAPACK code that calls this could be refactored. My understanding is that it looks something like:

exec_blas_async(queue);
// do some other work
exec_blas_async_wait(queue);

I'm not sure why the "other work" can't simply be added to the queue of parallel tasks, and let the runtime worry about load-balancing.

0 replies

stevengj · 2019-09-12T19:26:22Z

stevengj
Sep 12, 2019
Collaborator

I posted a very early draft of the requisite changes at OpenMathLib/OpenBLAS#2255

0 replies

stevengj · 2019-09-12T20:22:05Z

stevengj
Sep 12, 2019
Collaborator

Actually, I thought of an even easier way to implement exec_blas_async: the Julia callback can just spawn the tasks and return. The parallel tasks can set pthread mutex values to indicate that they are complete, just as they do now, and exec_blas_async_wait can wait on those mutexes as it does not, without modification to it or the LAPACK source code.

0 replies

KristofferC · 2019-11-28T16:07:13Z

KristofferC
Nov 28, 2019
Maintainer Sponsor

Removing milestone since this certainly wasn't release blocking for 1.3 and neither will be for 1.4 or 1.x.

0 replies

AzamatB · 2019-11-28T22:38:53Z

AzamatB
Nov 28, 2019

I'm confused. I thought that now that we switched to a time-based release schedule with 1.x releases, nothing is release-blocking, so should then all the remaining issues be removed from 1.4 milestone as well?

0 replies

chriscoey · 2019-12-17T08:03:05Z

chriscoey
Dec 17, 2019

friendly bump on this one. new AMD processors have a ton of threads but I can't take much advantage of PARTR until it works nicely with OpenBLAS since my loops all have various LAPACK calls in them (and I also have standalone LAPACK calls outside of loops that ought to still use all threads)

0 replies

ViralBShah · 2020-04-30T23:04:20Z

ViralBShah
Apr 30, 2020
Maintainer Author

Increasingly, a lot of libraries in Yggdrasil BB are using openmp, and many of them call BLAS. I suspect that we are increasingly going to see multi-threading clashes between julia threads, pthreaded libraries (openblas), and openmp. The fewer of these we can use the better! I also learnt that if MKL enters the picture, it is yet another library - tbb.

0 replies

ViralBShah · 2020-04-30T23:04:43Z

ViralBShah
Apr 30, 2020
Maintainer Author

cc @kpamnany

0 replies

vtjnash · 2022-01-30T06:32:48Z

vtjnash
Jan 30, 2022
Maintainer Sponsor

It was described to me that this thread pool is actually only relevant for a small number of LAPACK functions, so we could probably reimplement them in a julia better and faster than trying to integrate with the existing system in BLAS. @Keno is that accurate?

0 replies

ViralBShah · 2022-01-30T16:24:46Z

ViralBShah
Jan 30, 2022
Maintainer Author

I have a multi-threaded LU factorization and linear solve here: https://github.com/ViralBShah/HPL.jl/blob/master/src/hpl_shared.jl

It is reasonable on performance, and may be the better way to do multi-threading.

0 replies

ChrisRackauckas · 2022-01-30T16:28:19Z

ChrisRackauckas
Jan 30, 2022
Collaborator

I have a multi-threaded LU factorization and linear solve here: https://github.com/ViralBShah/HPL.jl/blob/master/src/hpl_shared.jl

https://github.com/YingboMa/RecursiveFactorization.jl is multi-threaded and already outperforms BLAS, both OpenBLAS and MKL. SciML has been defaulting to this for over a year with great success. However, achieving that performance relied on using Polyester.jl and thus opting out of the composable multithreading. Adding the non-polyester threads option to that and calling it a day would be a fitting end to the story at least for the LU case.

2 replies

jpsamaroo Jan 30, 2022
Collaborator

Once #42302 is ready to go, we should be able to trivially allow users to swap out which threadpool BLAS operations use for one which is running Polyester threads, so composability wouldn't even need to be lost in the short term (assuming threadpools make it into the same release).

tkf Jan 30, 2022

I don't think supporting multiple thread pools improves the composability of throughput-oriented parallel programs. I think #42302 is a good step toward handling latency-oriented sub-programs well. But we need to centralize the pool for throughput-oriented computation for composability; not the other way around.

Adding the non-polyester threads option to that and calling it a day would be a fitting end to the story at least for the LU case.

This sounds like the most promising direction to me.

stevengj · 2024-03-25T13:59:44Z

stevengj
Mar 25, 2024
Collaborator

OpenMathLib/OpenBLAS#4577 is a new PR to allow pluggable thread backends into OpenBLAS (currently they tried TBB), which hopefully will make it easy to add partr support. Would be good for some Julia folks to take a look.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partr thread support for openblas #43984

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 19 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

partr thread support for openblas #43984

ViralBShah Aug 4, 2019 Maintainer

Replies: 19 comments · 2 replies

ChrisRackauckas Aug 16, 2019 Collaborator

ViralBShah Aug 16, 2019 Maintainer Author

stevengj Aug 16, 2019 Collaborator

nalimilan Aug 20, 2019 Collaborator

vchuravy Aug 20, 2019 Maintainer

stevengj Sep 5, 2019 Collaborator

stevengj Sep 12, 2019 Collaborator

stevengj Sep 12, 2019 Collaborator

stevengj Sep 12, 2019 Collaborator

stevengj Sep 12, 2019 Collaborator

KristofferC Nov 28, 2019 Maintainer Sponsor

AzamatB Nov 28, 2019

chriscoey Dec 17, 2019

ViralBShah Apr 30, 2020 Maintainer Author

ViralBShah Apr 30, 2020 Maintainer Author

vtjnash Jan 30, 2022 Maintainer Sponsor

ViralBShah Jan 30, 2022 Maintainer Author

ChrisRackauckas Jan 30, 2022 Collaborator

jpsamaroo Jan 30, 2022 Collaborator

tkf Jan 30, 2022

stevengj Mar 25, 2024 Collaborator

ViralBShah
Aug 4, 2019
Maintainer

Replies: 19 comments 2 replies

ChrisRackauckas
Aug 16, 2019
Collaborator

ViralBShah
Aug 16, 2019
Maintainer Author

stevengj
Aug 16, 2019
Collaborator

nalimilan
Aug 20, 2019
Collaborator

vchuravy
Aug 20, 2019
Maintainer

stevengj
Sep 5, 2019
Collaborator

stevengj
Sep 12, 2019
Collaborator

stevengj
Sep 12, 2019
Collaborator

stevengj
Sep 12, 2019
Collaborator

stevengj
Sep 12, 2019
Collaborator

KristofferC
Nov 28, 2019
Maintainer Sponsor

AzamatB
Nov 28, 2019

chriscoey
Dec 17, 2019

ViralBShah
Apr 30, 2020
Maintainer Author

ViralBShah
Apr 30, 2020
Maintainer Author

vtjnash
Jan 30, 2022
Maintainer Sponsor

ViralBShah
Jan 30, 2022
Maintainer Author

ChrisRackauckas
Jan 30, 2022
Collaborator

jpsamaroo Jan 30, 2022
Collaborator

stevengj
Mar 25, 2024
Collaborator