Skip to content

Commit

Permalink
Merge pull request #1201 from lattice/release/1.1.x
Browse files Browse the repository at this point in the history
Release/1.1.x
  • Loading branch information
maddyscientist authored Oct 28, 2021
2 parents b6b9da7 + 4ad5fcf commit d2320dd
Show file tree
Hide file tree
Showing 536 changed files with 74,045 additions and 53,948 deletions.
1 change: 1 addition & 0 deletions .clang-format
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
BasedOnStyle: Webkit
IndentWidth: 2
AccessModifierOffset: -2
AlignAfterOpenBracket: Align
AlignTrailingComments: true
AllowShortBlocksOnASingleLine: true
Expand Down
1,185 changes: 614 additions & 571 deletions CMakeLists.txt

Large diffs are not rendered by default.

39 changes: 34 additions & 5 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

Copyright (c) 2009-2017, QUDA Developers
Copyright (c) 2009-2019, QUDA Developers

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -240,7 +240,36 @@ following license:

Additional Notices

QUDA utilizes Maxim Milakov's int_fastdiv library for fast run-time
integer division. This is distributed under the Apache License,
Version 2.0. See declaration at top of int_fastdiv.h for license
specifics.
QUDA utilizes Maxim Milakov's int_fastdiv library for fast run-time
integer division. This is distributed under the Apache License,
Version 2.0. See declaration at top of int_fastdiv.h for license
specifics.

QUDA uses CLI11 for command line parsing. THE CLI11.hpp file is provided under
following license:

CLI11 1.8 Copyright (c) 2017-2019 University of Cincinnati, developed by Henry
Schreiner under NSF AWARD 1414736. All rights reserved.

Redistribution and use in source and binary forms of CLI11, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
90 changes: 90 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
@@ -1,3 +1,93 @@
Version 1.1.0 - October 2021

- Add support for NVSHMEM communication for the Dslash operators, for
significantly improved strong scaling. See
https://github.com/lattice/quda/wiki/Multi-GPU-with-NVSHMEM for more
details.

- Addition of the MSPCG preconditioned CG solver for Möbius
fermions. See
https://github.com/lattice/quda/wiki/The-Multi-Splitting-Preconditioned-Conjugate-Gradient-(MSPCG),-an-application-of-the-additive-Schwarz-Method
for more details.

- Addition of the Exact One Flavor Algorithm (EOFA) for Möbius
fermions. See
https://github.com/lattice/quda/wiki/The-Exact-One-Flavor-Algorithm-(EOFA)
for more details.

- Addition of a fully GPU native Implicitly Restarted Arnoldi
eigensolver (as opposed to partially relying on ARPACK). See
https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers#implicitly-restarted-arnoldi-eigensolver
for more details.

- Significantly reduced latency for reduction kernels through the use
of heterogeneous atomics. Requires CUDA 11.0+.

- Addition of support for a split-grid multi-RHS solver. See
https://github.com/lattice/quda/wiki/Split-Grid for more details.

- Continued work on enhancing and refining the staggered multigrid
algorithm. The MILC interface can now drive the staggered multigrid
solver.

- Multigrid setup can now use tensor cores on Volta, Turing and Ampere
GPUs to accelerate the calculation. Enable with the
`QudaMultigridParam::use_mma` parameter.

- Improved support of managed memory through the addition of a
prefetch API. This can dramatically improve the performance of the
multigrid setup when oversubscribing the memory.

- Improved the performance of using MILC RHMC with QUDA

- Add support for a new internal data order FLOAT8. This is the
default data order for nSpin=4 half and quarter precision fields,
though the prior FLOAT4 order can be enabled with the cmake option
QUDA_FLOAT8=OFF.

- Remove of the singularity from the reconstruct-8 and reconstruct-9
compressed gauge field ordering. This enables support for free
fields with these orderings.

- The clover parameter convention has been codified: one can either
1.) pass in QudaInvertParam::kappa and QudaInvertParam::csw
separately, and QUDA will infer the necessary clover coefficient, or
2.) pass an explicit value of QudaInvertParam::clover_coeff
(e.g. CHROMA's use case) and that will override the above inference.

- QUDA now includes fast-compilation options (QUDA_FAST_COMPILE_DSLASH
and QUDA_FAST_COMPILE_REUDCE) which enable much faster build times
for development at the expense of reduced performance.

- Add support for compiling QUDA using clang for both the host and
device compiler.

- While the bulk of the work associated with making QUDA portable to
different architectures will form the soul of QUDA 2.0, some of the
initial refactoring associated with this has been applied.

- Significant cleanup of the tests directory to reduce boiler plate.

- General improvements to the cmake build system using modern cmake
features. We now require cmake 3.15.

- Extended the ctest list to include some optional benchmarks.

- Fix a long-standing issue with multi-node Kepler GPU and Intel dual
socket systems.

- Improved ASAN integration: SANITIZE builds now work out of the box
with no need to set the ASAN_OPTIONS environment variable.

- Add support for the extended QIO branch (now required for MILC).

- Bump QMP version to 2.5.3.

- Updated to Eigen 3.3.9.

- Multiple bug fixes and clean up to the library. Many of these are
listed here: https://github.com/lattice/quda/milestone/24?closed=1

Version 1.0.0 - 10 January 2020

- Add support for CUDA 10.2: QUDA 1.0.0 is supported on CUDA 7.5-10.2
Expand Down
102 changes: 62 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# QUDA 1.0.0
# QUDA 1.1.0

## Overview

Expand Down Expand Up @@ -29,21 +29,27 @@ QUDA includes an implementations of adaptive multigrid for the Wilson,
clover-improved, twisted-mass and twisted-clover fermion actions. We
note however that this is undergoing continued evolution and
improvement and we highly recommend using adaptive multigrid use the
latest develop branch. More details can be found at
https://github.com/lattice/quda/wiki/Multigrid-Solver.
latest develop branch. More details can be found [here]
(https://github.com/lattice/quda/wiki/Multigrid-Solver).

Support for eigen-vector deflation solvers is also included through
the Thick Restarted Lanczos Method (TRLM). For more details we refer
the user to the wiki
(https://github.com/lattice/quda/wiki/Deflated-Solvers).
the Thick Restarted Lanczos Method (TRLM), and we offer an Implicitly
Restarted Arnoldi for observing non-hermitian operator spectra.
For more details we refer the user to the wiki:
[QUDA's eigensolvers]
(https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers)
[Deflating coarse grid solves in Multigrid]
(https://github.com/lattice/quda/wiki/Multigrid-Solver#multigrid-inverter--lanczos)

## Software Compatibility:

The library has been tested under Linux (CentOS 7 and Ubuntu 18.04)
using releases 7.5 through 10.2 of the CUDA toolkit. Earlier versions
using releases 10.1 through 11.4 of the CUDA toolkit. Earlier versions
of the CUDA toolkit will not work, and we highly recommend the use of
10.x. QUDA has been tested in conjunction with x86-64, IBM
POWER8/POWER9 and ARM CPUs. CMake 3.11 or greater to required to build QUDA.
11.x. QUDA has been tested in conjunction with x86-64, IBM
POWER8/POWER9 and ARM CPUs. Both GCC and Clang host compilers are
supported, with the mininum recommended versions being 7.x and 6, respectively.
CMake 3.15 or greater to required to build QUDA.

See also Known Issues below.

Expand All @@ -59,25 +65,25 @@ capability" of your card, either from NVIDIA's documentation or by
running the deviceQuery example in the CUDA SDK, and pass the
appropriate value to the `QUDA_GPU_ARCH` variable in cmake.

QUDA 1.0.0, supports devices of compute capability 3.0 or greater.
While QUDA is no longer supported on the older Fermi architecture, it
may continue to work (assuming the user disables the use of textures
(QUDA_TEX=OFF).
QUDA 1.1.0, supports devices of compute capability 3.0 or greater.
QUDA is no longer supported on the older Tesla (1.x) and Fermi (2.x)
architectures.

See also "Known Issues" below.


## Installation:

The recommended method for compiling QUDA is to use cmake, and build
QUDA in a separate directory from the source directory. For
instructions on how to build QUDA using cmake see this page
https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake. Note that
this requires cmake version 3.11 or later. You can obtain cmake from
https://cmake.org/download/. On Linux the binary tar.gz archives unpack
into a cmake directory and usually run fine from that directory.
It is recommended to build QUDA in a separate directory from the
source directory. For instructions on how to build QUDA using cmake
see this page
https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake. Note
that this requires cmake version 3.15 or later. You can obtain cmake
from https://cmake.org/download/. On Linux the binary tar.gz archives
unpack into a cmake directory and usually run fine from that
directory.

The basic steps for building cmake are:
The basic steps for building with cmake are:

1. Create a build dir, outside of the quda source directory.
2. In your build-dir run `cmake <path-to-quda-src>`
Expand All @@ -94,16 +100,26 @@ or specify e.g. -DQUDA_GPU_ARCH=sm_60 for a Pascal GPU in step 2.

### Multi-GPU support

QUDA supports using multiple GPUs through MPI and QMP.
To enable multi-GPU support either set `QUDA_MPI` or `QUDA_QMP` to ON when configuring QUDA through cmake.
QUDA supports using multiple GPUs through MPI and QMP, together with
the optional use of NVSHMEM GPU-initiated communication for improved
strong scaling of the Dirac operators. To enable multi-GPU support
either set `QUDA_MPI` or `QUDA_QMP` to ON when configuring QUDA
through cmake.

Note that in any case cmake will automatically try to detect your MPI installation. If you need to specify a particular MPI please set `MPI_C_COMPILER` and `MPI_CXX_COMPILER` in cmake.
See also https://cmake.org/cmake/help/v3.9/module/FindMPI.html for more help.
Note that in any case cmake will automatically try to detect your MPI
installation. If you need to specify a particular MPI please set
`MPI_C_COMPILER` and `MPI_CXX_COMPILER` in cmake. See also
https://cmake.org/cmake/help/v3.9/module/FindMPI.html for more help.

For QMP please set `QUDA_QMP_HOME` to the installation directory of QMP.

For more details see https://github.com/lattice/quda/wiki/Multi-GPU-Support

To enable NVSHMEM support set `QUDA_NVSHMEM` to ON, and set the
location of the local NVSHMEM installation with `QUDA_NVSHMEM_HOME`.
For more details see
https://github.com/lattice/quda/wiki/Multi-GPU-with-NVSHMEM

### External dependencies

The eigen-vector solvers (eigCG and incremental eigCG) by default will
Expand All @@ -113,7 +129,7 @@ details). MAGMA is available from
http://icl.cs.utk.edu/magma/index.html. MAGMA is enabled using the
cmake option `QUDA_MAGMA=ON`.

Version 1.0.0 of QUDA includes interface for the external (P)ARPACK
Version 1.1.0 of QUDA includes interface for the external (P)ARPACK
library for eigenvector computing. (P)ARPACK is available, e.g., from
https://github.com/opencollab/arpack-ng. (P)ARPACK is enabled using
CMake option `QUDA_ARPACK=ON`. Note that with a multi-GPU option, the
Expand Down Expand Up @@ -168,7 +184,7 @@ communication and exterior update).
## Using the Library:

Include the header file include/quda.h in your application, link against
lib/libquda.a, and study tests/invert_test.cpp (for Wilson, clover,
lib/libquda.so, and study tests/invert_test.cpp (for Wilson, clover,
twisted-mass, or domain wall fermions) or
tests/staggered_invert_test.cpp (for asqtad/HISQ fermions) for examples
of the solver interface. The various solver options are enumerated in
Expand All @@ -188,7 +204,7 @@ used on all GPUs and binary reproducibility.

## Getting Help:

Please visit http://lattice.github.com/quda for contact information. Bug
Please visit http://lattice.github.io/quda for contact information. Bug
reports are especially welcome.


Expand All @@ -209,7 +225,7 @@ Performance Computing, Networking, Storage and Analysis (SC), 2011

When taking advantage of adaptive multigrid, please also cite:

M. A. Clark, A. Strelchenko, M. Cheng, A. Gambhir, and R. Brower,
M. A. Clark, B. Joo, A. Strelchenko, M. Cheng, A. Gambhir, and R. Brower,
"Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained
Parallelization," International Conference for High Performance
Computing, Networking, Storage and Analysis (SC), 2016
Expand All @@ -220,10 +236,14 @@ When taking advantage of block CG, please also cite:
M. A. Clark, A. Strelchenko, A. Vaquero, M. Wagner, and E. Weinberg,
"Pushing Memory Bandwidth Limitations Through Efficient
Implementations of Block-Krylov Space Solvers on GPUs,"
To be published in Comput. Phys. Commun. (2018) [arXiv:1710.09745 [hep-lat]].
Comput. Phys. Commun. 233 (2018), 29-40 [arXiv:1710.09745 [hep-lat]].

When taking advantage of the Möbius MSPCG solver, please also cite:

Several other papers that might be of interest are listed at
http://lattice.github.com/quda .
Jiqun Tu, M. A. Clark, Chulwoo Jung, Robert Mawhinney, "Solving DWF
Dirac Equation Using Multi-splitting Preconditioned Conjugate Gradient
with Tensor Cores on NVIDIA GPUs," published in the Platform of
Advanced Scientific Computing (PASC21) [arXiv:2104.05615[hep-lat]].


## Authors:
Expand All @@ -237,27 +257,29 @@ http://lattice.github.com/quda .
* Kate Clark (NVIDIA)
* Michael Cheng (Boston University)
* Carleton DeTar (Utah University)
* Justin Foley (NIH)
* Joel Giedt (Rensselaer Polytechnic Institute)
* Justin Foley (NIH)
* Arjun Gambhir (William and Mary)
* Joel Giedt (Rensselaer Polytechnic Institute)
* Steven Gottlieb (Indiana University)
* Kyriakos Hadjiyiannakou (Cyprus)
* Dean Howarth (Boston University)
* Balint Joo (Jefferson Laboratory)
* Dean Howarth (Lawrence Livermore Lab, Lawrence Berkeley Lab)
* Balint Joo (OLCF, Oak Ridge National Laboratory, formerly Jefferson Lab)
* Hyung-Jin Kim (Samsung Advanced Institute of Technology)
* Bartek Kostrzewa (Bonn)
* Eloy Romero (William and Mary)
* James Osborn (Argonne National Laboratory)
* Claudio Rebbi (Boston University)
* Guochun Shi (NCSA)
* Eloy Romero (William and Mary)
* Hauke Sandmeyer (Bielefeld)
* Mario Schröck (INFN)
* Guochun Shi (NCSA)
* Alexei Strelchenko (Fermi National Accelerator Laboratory)
* Jiqun Tu (Columbia)
* Jiqun Tu (NVIDIA)
* Alejandro Vaquero (Utah University)
* Mathias Wagner (NVIDIA)
* Andre Walker-Loud (Lawrence Berkley Laboratory)
* Evan Weinberg (NVIDIA)
* Frank Winter (Jlab)
* Frank Winter (Jefferson Lab)
* Yi-Bo Yang (Chinese Academy of Sciences)


Portions of this software were developed at the Innovative Systems Lab,
Expand Down
1 change: 1 addition & 0 deletions cmake/FindCUDALibs.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -183,4 +183,5 @@ if(NOT CUDA_VERSION VERSION_LESS "7.0")
endif()

find_cuda_helper_libs(cuda)
set(CUDA_cuda_driver_LIBRARY ${CUDA_cuda_LIBRARY})
find_cuda_helper_libs(nvToolsExt)
Loading

0 comments on commit d2320dd

Please sign in to comment.