Merge pull request #1201 from lattice/release/1.1.x

Release/1.1.x
lattice · Oct 28, 2021 · d2320dd · d2320dd
2 parents b6b9da7 + 4ad5fcf
commit d2320dd
Show file tree

Hide file tree

Showing 536 changed files with 74,045 additions and 53,948 deletions.
diff --git a/.clang-format b/.clang-format
@@ -1,6 +1,7 @@
 ---
 BasedOnStyle: Webkit
 IndentWidth: 2
+AccessModifierOffset: -2
 AlignAfterOpenBracket: Align
 AlignTrailingComments: true
 AllowShortBlocksOnASingleLine: true

diff --git a/CMakeLists.txt b/CMakeLists.txt
diff --git a/LICENSE b/LICENSE
@@ -1,5 +1,5 @@
 
-Copyright (c) 2009-2017, QUDA Developers
+Copyright (c) 2009-2019, QUDA Developers
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -240,7 +240,36 @@ following license:
 
   Additional Notices
 
-  QUDA utilizes Maxim Milakov's int_fastdiv library for fast run-time
-  integer division.  This is distributed under the Apache License,
-  Version 2.0.  See declaration at top of int_fastdiv.h for license
-  specifics.
+QUDA utilizes Maxim Milakov's int_fastdiv library for fast run-time
+integer division.  This is distributed under the Apache License,
+Version 2.0.  See declaration at top of int_fastdiv.h for license
+specifics.
+
+QUDA uses CLI11 for command line parsing. THE CLI11.hpp file is provided under 
+following license:
+
+  CLI11 1.8 Copyright (c) 2017-2019 University of Cincinnati, developed by Henry
+  Schreiner under NSF AWARD 1414736. All rights reserved.
+
+  Redistribution and use in source and binary forms of CLI11, with or without
+  modification, are permitted provided that the following conditions are met:
+
+  1. Redistributions of source code must retain the above copyright notice, this
+    list of conditions and the following disclaimer.
+  2. Redistributions in binary form must reproduce the above copyright notice,
+    this list of conditions and the following disclaimer in the documentation
+    and/or other materials provided with the distribution.
+  3. Neither the name of the copyright holder nor the names of its contributors
+    may be used to endorse or promote products derived from this software without
+    specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
+  ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
+  ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/NEWS b/NEWS
@@ -1,3 +1,93 @@
+Version 1.1.0 - October 2021
+
+- Add support for NVSHMEM communication for the Dslash operators, for
+  significantly improved strong scaling.  See
+  https://github.com/lattice/quda/wiki/Multi-GPU-with-NVSHMEM for more
+  details.
+
+- Addition of the MSPCG preconditioned CG solver for Möbius
+  fermions. See
+  https://github.com/lattice/quda/wiki/The-Multi-Splitting-Preconditioned-Conjugate-Gradient-(MSPCG),-an-application-of-the-additive-Schwarz-Method
+  for more details.
+
+- Addition of the Exact One Flavor Algorithm (EOFA) for Möbius
+  fermions.  See
+  https://github.com/lattice/quda/wiki/The-Exact-One-Flavor-Algorithm-(EOFA)
+  for more details.
+
+- Addition of a fully GPU native Implicitly Restarted Arnoldi
+  eigensolver (as opposed to partially relying on ARPACK).  See
+  https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers#implicitly-restarted-arnoldi-eigensolver
+  for more details.
+
+- Significantly reduced latency for reduction kernels through the use
+  of heterogeneous atomics.  Requires CUDA 11.0+.
+
+- Addition of support for a split-grid multi-RHS solver.  See
+  https://github.com/lattice/quda/wiki/Split-Grid for more details.
+
+- Continued work on enhancing and refining the staggered multigrid
+  algorithm.  The MILC interface can now drive the staggered multigrid
+  solver.
+
+- Multigrid setup can now use tensor cores on Volta, Turing and Ampere
+  GPUs to accelerate the calculation.  Enable with the
+  `QudaMultigridParam::use_mma` parameter.
+
+- Improved support of managed memory through the addition of a
+  prefetch API.  This can dramatically improve the performance of the
+  multigrid setup when oversubscribing the memory.
+
+- Improved the performance of using MILC RHMC with QUDA
+
+- Add support for a new internal data order FLOAT8.  This is the
+  default data order for nSpin=4 half and quarter precision fields,
+  though the prior FLOAT4 order can be enabled with the cmake option
+  QUDA_FLOAT8=OFF.
+
+- Remove of the singularity from the reconstruct-8 and reconstruct-9
+  compressed gauge field ordering.  This enables support for free
+  fields with these orderings.
+
+- The clover parameter convention has been codified: one can either
+  1.) pass in QudaInvertParam::kappa and QudaInvertParam::csw
+  separately, and QUDA will infer the necessary clover coefficient, or
+  2.) pass an explicit value of QudaInvertParam::clover_coeff
+  (e.g. CHROMA's use case) and that will override the above inference.
+
+- QUDA now includes fast-compilation options (QUDA_FAST_COMPILE_DSLASH
+  and QUDA_FAST_COMPILE_REUDCE) which enable much faster build times
+  for development at the expense of reduced performance.
+
+- Add support for compiling QUDA using clang for both the host and
+  device compiler.
+
+- While the bulk of the work associated with making QUDA portable to
+  different architectures will form the soul of QUDA 2.0, some of the
+  initial refactoring associated with this has been applied.
+
+- Significant cleanup of the tests directory to reduce boiler plate.
+
+- General improvements to the cmake build system using modern cmake
+  features.  We now require cmake 3.15.
+
+- Extended the ctest list to include some optional benchmarks.
+
+- Fix a long-standing issue with multi-node Kepler GPU and Intel dual
+  socket systems.
+
+- Improved ASAN integration: SANITIZE builds now work out of the box
+  with no need to set the ASAN_OPTIONS environment variable.
+
+- Add support for the extended QIO branch (now required for MILC).
+
+- Bump QMP version to 2.5.3.
+
+- Updated to Eigen 3.3.9.
+
+- Multiple bug fixes and clean up to the library.  Many of these are
+  listed here: https://github.com/lattice/quda/milestone/24?closed=1
+
 Version 1.0.0 - 10 January 2020
 
 - Add support for CUDA 10.2: QUDA 1.0.0 is supported on CUDA 7.5-10.2

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# QUDA 1.0.0
+# QUDA 1.1.0
 
 ## Overview
 
@@ -29,21 +29,27 @@ QUDA includes an implementations of adaptive multigrid for the Wilson,
 clover-improved, twisted-mass and twisted-clover fermion actions.  We
 note however that this is undergoing continued evolution and
 improvement and we highly recommend using adaptive multigrid use the
-latest develop branch.  More details can be found at
-https://github.com/lattice/quda/wiki/Multigrid-Solver.
+latest develop branch.  More details can be found [here]
+(https://github.com/lattice/quda/wiki/Multigrid-Solver).
 
 Support for eigen-vector deflation solvers is also included through
-the Thick Restarted Lanczos Method (TRLM).  For more details we refer
-the user to the wiki
-(https://github.com/lattice/quda/wiki/Deflated-Solvers).
+the Thick Restarted Lanczos Method (TRLM), and we offer an Implicitly
+Restarted Arnoldi for observing non-hermitian operator spectra.
+For more details we refer the user to the wiki:
+[QUDA's eigensolvers]
+(https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers)
+[Deflating coarse grid solves in Multigrid]
+(https://github.com/lattice/quda/wiki/Multigrid-Solver#multigrid-inverter--lanczos)
 
 ## Software Compatibility:
 
 The library has been tested under Linux (CentOS 7 and Ubuntu 18.04)
-using releases 7.5 through 10.2 of the CUDA toolkit.  Earlier versions
+using releases 10.1 through 11.4 of the CUDA toolkit.  Earlier versions
 of the CUDA toolkit will not work, and we highly recommend the use of
-10.x.  QUDA has been tested in conjunction with x86-64, IBM
-POWER8/POWER9 and ARM CPUs.  CMake 3.11 or greater to required to build QUDA.
+11.x.  QUDA has been tested in conjunction with x86-64, IBM
+POWER8/POWER9 and ARM CPUs.  Both GCC and Clang host compilers are
+supported, with the mininum recommended versions being 7.x and 6, respectively.
+CMake 3.15 or greater to required to build QUDA.
 
 See also Known Issues below.
 
@@ -59,25 +65,25 @@ capability" of your card, either from NVIDIA's documentation or by
 running the deviceQuery example in the CUDA SDK, and pass the
 appropriate value to the `QUDA_GPU_ARCH` variable in cmake.
 
-QUDA 1.0.0, supports devices of compute capability 3.0 or greater.
-While QUDA is no longer supported on the older Fermi architecture, it
-may continue to work (assuming the user disables the use of textures
-(QUDA_TEX=OFF).
+QUDA 1.1.0, supports devices of compute capability 3.0 or greater.
+QUDA is no longer supported on the older Tesla (1.x) and Fermi (2.x)
+architectures.
 
 See also "Known Issues" below.
 
 
 ## Installation:
 
-The recommended method for compiling QUDA is to use cmake, and build
-QUDA in a separate directory from the source directory.  For
-instructions on how to build QUDA using cmake see this page
-https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake. Note that
-this requires cmake version 3.11 or later. You can obtain cmake from
-https://cmake.org/download/. On Linux the binary tar.gz archives unpack
-into a cmake directory and usually run fine from that directory.
+It is recommended to build QUDA in a separate directory from the
+source directory.  For instructions on how to build QUDA using cmake
+see this page
+https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake. Note
+that this requires cmake version 3.15 or later. You can obtain cmake
+from https://cmake.org/download/. On Linux the binary tar.gz archives
+unpack into a cmake directory and usually run fine from that
+directory.
 
-The basic steps for building cmake are: 
+The basic steps for building with cmake are:
 
 1. Create a build dir, outside of the quda source directory. 
 2. In your build-dir run `cmake <path-to-quda-src>` 
@@ -94,16 +100,26 @@ or specify e.g. -DQUDA_GPU_ARCH=sm_60 for a Pascal GPU in step 2.
 
 ### Multi-GPU support
 
-QUDA supports using multiple GPUs through MPI and QMP.
-To enable multi-GPU support either set `QUDA_MPI` or `QUDA_QMP` to ON when configuring QUDA through cmake. 
+QUDA supports using multiple GPUs through MPI and QMP, together with
+the optional use of NVSHMEM GPU-initiated communication for improved
+strong scaling of the Dirac operators.  To enable multi-GPU support
+either set `QUDA_MPI` or `QUDA_QMP` to ON when configuring QUDA
+through cmake.
 
-Note that in any case cmake will automatically try to detect your MPI installation. If you need to specify a particular MPI please set `MPI_C_COMPILER` and `MPI_CXX_COMPILER` in cmake. 
-See also https://cmake.org/cmake/help/v3.9/module/FindMPI.html for more help.
+Note that in any case cmake will automatically try to detect your MPI
+installation. If you need to specify a particular MPI please set
+`MPI_C_COMPILER` and `MPI_CXX_COMPILER` in cmake.  See also
+https://cmake.org/cmake/help/v3.9/module/FindMPI.html for more help.
 
 For QMP please set `QUDA_QMP_HOME` to the installation directory of QMP.
 
 For more details see https://github.com/lattice/quda/wiki/Multi-GPU-Support
 
+To enable NVSHMEM support set `QUDA_NVSHMEM` to ON, and set the
+location of the local NVSHMEM installation with `QUDA_NVSHMEM_HOME`.
+For more details see
+https://github.com/lattice/quda/wiki/Multi-GPU-with-NVSHMEM
+
 ### External dependencies
 
 The eigen-vector solvers (eigCG and incremental eigCG) by default will
@@ -113,7 +129,7 @@ details).  MAGMA is available from
 http://icl.cs.utk.edu/magma/index.html.  MAGMA is enabled using the
 cmake option `QUDA_MAGMA=ON`.
 
-Version 1.0.0 of QUDA includes interface for the external (P)ARPACK
+Version 1.1.0 of QUDA includes interface for the external (P)ARPACK
 library for eigenvector computing. (P)ARPACK is available, e.g., from
 https://github.com/opencollab/arpack-ng.  (P)ARPACK is enabled using
 CMake option `QUDA_ARPACK=ON`. Note that with a multi-GPU option, the
@@ -168,7 +184,7 @@ communication and exterior update).
 ## Using the Library:
 
 Include the header file include/quda.h in your application, link against
-lib/libquda.a, and study tests/invert_test.cpp (for Wilson, clover,
+lib/libquda.so, and study tests/invert_test.cpp (for Wilson, clover,
 twisted-mass, or domain wall fermions) or
 tests/staggered_invert_test.cpp (for asqtad/HISQ fermions) for examples
 of the solver interface.  The various solver options are enumerated in
@@ -188,7 +204,7 @@ used on all GPUs and binary reproducibility.
 
 ## Getting Help:
 
-Please visit http://lattice.github.com/quda for contact information. Bug
+Please visit http://lattice.github.io/quda for contact information. Bug
 reports are especially welcome.
 
 
@@ -209,7 +225,7 @@ Performance Computing, Networking, Storage and Analysis (SC), 2011
 
 When taking advantage of adaptive multigrid, please also cite:
 
-M. A. Clark, A. Strelchenko, M. Cheng, A. Gambhir, and R. Brower,
+M. A. Clark, B. Joo, A. Strelchenko, M. Cheng, A. Gambhir, and R. Brower,
 "Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained
 Parallelization," International Conference for High Performance
 Computing, Networking, Storage and Analysis (SC), 2016
@@ -220,10 +236,14 @@ When taking advantage of block CG, please also cite:
 M. A. Clark, A. Strelchenko, A. Vaquero, M. Wagner, and E. Weinberg,
 "Pushing Memory Bandwidth Limitations Through Efficient
 Implementations of Block-Krylov Space Solvers on GPUs,"
-To be published in Comput. Phys. Commun. (2018) [arXiv:1710.09745 [hep-lat]].
+Comput. Phys. Commun. 233 (2018), 29-40 [arXiv:1710.09745 [hep-lat]].
+
+When taking advantage of the Möbius MSPCG solver, please also cite:
 
-Several other papers that might be of interest are listed at
-http://lattice.github.com/quda .
+Jiqun Tu, M. A. Clark, Chulwoo Jung, Robert Mawhinney, "Solving DWF
+Dirac Equation Using Multi-splitting Preconditioned Conjugate Gradient
+with Tensor Cores on NVIDIA GPUs," published in the Platform of
+Advanced Scientific Computing (PASC21) [arXiv:2104.05615[hep-lat]].
 
 
 ## Authors:
@@ -237,27 +257,29 @@ http://lattice.github.com/quda .
 *  Kate Clark (NVIDIA)
 *  Michael Cheng (Boston University)
 *  Carleton DeTar (Utah University)
-*  Justin Foley (NIH) 
-*  Joel Giedt (Rensselaer Polytechnic Institute) 
+*  Justin Foley (NIH)
 *  Arjun Gambhir (William and Mary)
+*  Joel Giedt (Rensselaer Polytechnic Institute) 
 *  Steven Gottlieb (Indiana University) 
 *  Kyriakos Hadjiyiannakou (Cyprus)
-*  Dean Howarth (Boston University)
-*  Balint Joo (Jefferson Laboratory)
+*  Dean Howarth (Lawrence Livermore Lab, Lawrence Berkeley Lab)
+*  Balint Joo (OLCF, Oak Ridge National Laboratory, formerly Jefferson Lab)
 *  Hyung-Jin Kim (Samsung Advanced Institute of Technology)
 *  Bartek Kostrzewa (Bonn)
-*  Eloy Romero (William and Mary)
+*  James Osborn (Argonne National Laboratory)
 *  Claudio Rebbi (Boston University) 
-*  Guochun Shi (NCSA)
+*  Eloy Romero (William and Mary)
 *  Hauke Sandmeyer (Bielefeld)
 *  Mario Schröck (INFN)
+*  Guochun Shi (NCSA)
 *  Alexei Strelchenko (Fermi National Accelerator Laboratory)
-*  Jiqun Tu (Columbia)
+*  Jiqun Tu (NVIDIA)
 *  Alejandro Vaquero (Utah University)
 *  Mathias Wagner (NVIDIA)
 *  Andre Walker-Loud (Lawrence Berkley Laboratory)
 *  Evan Weinberg (NVIDIA)
-*  Frank Winter (Jlab)
+*  Frank Winter (Jefferson Lab)
+*  Yi-Bo Yang (Chinese Academy of Sciences)
 
 
 Portions of this software were developed at the Innovative Systems Lab,

diff --git a/cmake/FindCUDALibs.cmake b/cmake/FindCUDALibs.cmake
@@ -183,4 +183,5 @@ if(NOT CUDA_VERSION VERSION_LESS "7.0")
 endif()
 
 find_cuda_helper_libs(cuda)
+set(CUDA_cuda_driver_LIBRARY ${CUDA_cuda_LIBRARY})
 find_cuda_helper_libs(nvToolsExt)