Joshua s brown/documentation performance regression (#561)

* Add more documentation * Add Documentation * Add change log comment Co-authored-by: Jonas Lippuner <[email protected]>
parthenon-hpc-lab · Aug 6, 2021 · ed34340 · ed34340
1 parent fa70d97
commit ed34340
Show file tree

Hide file tree

Showing 2 changed files with 159 additions and 16 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -30,8 +30,7 @@
 - [[PR 524]](https://github.com/lanl/parthenon/pull/524) Enforce `Metadata` flags constraints and add new `Metadata::WithFluxes` flag. Note: `Metadata::Independent` will be set automatically unless `Metadata::Derived` is set
 - [[PR 517]](https://github.com/lanl/parthenon/pull/517) Remove optional `dims` argument from `MeshBlockData::Add` and use the shape from the `Metadata` instead
 - [[PR 492]](https://github.com/lanl/parthenon/pull/492) Modify advection example to have an arbitrary number of dense variables and to disable fill derived for profiling.
-- [[PR 486]](https://github.com/lanl/parthenon/pull/486) Unify HDF5 output and restart file writing,
-    add HDF5 compression support, add support for sparse fields in HDF5 output/restart files, add and rename some metadata in HDF5 files.
+- [[PR 486]](https://github.com/lanl/parthenon/pull/486) Unify HDF5 output and restart file writing, add HDF5 compression support, add support for sparse fields in HDF5 output/restart files, add and rename some metadata in HDF5 files.
 - [[PR 522]](https://github.com/lanl/parthenon/pull/522) Corrected ordering of `OutputDatasetNames` to match `ComponentNames`
 
 ### Fixed (not changing behavior/API/variables/...)
@@ -53,16 +52,17 @@
 - [[PR 490]](https://github.com/lanl/parthenon/pull/490) Adjust block size in OverlappingSpace instance tests to remain within Cuda/HIP limits
 - [[PR 488]](https://github.com/lanl/parthenon/pull/488) Update GitLab Dockerfile to use HDF5 version 1.10.7
 - [[PR 510]](https://github.com/lanl/parthenon/pull/510) Fix calling noexistant logger in python performance regression app
-- [[PR 527]](https://github.com/lanl/parthenon/pull/527) Fix problem with ci when rebase is used.
+- [[PR 527]](https://github.com/lanl/parthenon/pull/527) Fix problem with CI when rebase is used.
 - [[PR 518]](https://github.com/lanl/parthenon/pull/518) Added MPI performance regression tests to CI performance app
-- [[PR 530]](https://github.com/lanl/parthenon/pull/530) Fixed issue with ci plotting the oldest 5 commit metrics for each test, also cleaned up legend formatting.
+- [[PR 530]](https://github.com/lanl/parthenon/pull/530) Fixed issue with CI plotting the oldest 5 commit metrics for each test, also cleaned up legend formatting.
 - [[PR 536]](https://github.com/lanl/parthenon/pull/536) Updated to latest Kokkos release.
 - [[PR 520]](https://github.com/lanl/parthenon/pull/520) Add black python formatter to github actions
 - [[PR 519]](https://github.com/lanl/parthenon/pull/519) Add checksum to bash uploader script to verify file is trusted
 - [[PR 549]](https://github.com/lanl/parthenon/pull/549) Add deep-code badge.
 - [[PR 554]](https://github.com/lanl/parthenon/pull/554) Small fix to documentation related to python parthenon tools README
-- [[PR 555](https://github.com/lanl/parthenon/pull/555) Added documentation for darwin ci and scripts
+- [[PR 555](https://github.com/lanl/parthenon/pull/555) Added documentation for darwin CI and scripts
 - [[PR 560]](https://github.com/lanl/parthenon/pull/560) Rename `png_files_to_upload` to more generic `figure_files_to_upload`
+- [[PR 561]](https://github.com/lanl/parthenon/pull/561) Adding documentation to help with adding new performance regression tests.
 
 ### Removed (removing behavior/API/varaibles/...)
 - [[PR 498]](https://github.com/lanl/parthenon/pull/498) Cleanup unused user hooks and variables
@@ -97,7 +97,7 @@ Date: 03/30/2021
 ### Infrastructure (changes irrelevant to downstream codes)
 - [[PR 436]](https://github.com/lanl/parthenon/pull/436) Update Summit build doc and machine file
 - [[PR 435]](https://github.com/lanl/parthenon/pull/435) Fix ctest logic for parsing number of ranks in MPI tests
-- [[PR 407]](https://github.com/lanl/parthenon/pull/407) More cleanup, removed old bash scripts for ci.
+- [[PR 407]](https://github.com/lanl/parthenon/pull/407) More cleanup, removed old bash scripts for CI.
 - [[PR 428]](https://github.com/lanl/parthenon/pull/428) Triad Copyright 2021
 - [[PR 413]](https://github.com/lanl/parthenon/pull/413) LANL Snow machine configuration
 - [[PR 390]](https://github.com/lanl/parthenon/pull/390) Resolve `@PAR_ROOT@` to parthenon root rather than the location of the current source directory
@@ -135,11 +135,11 @@ Date: 01/19/2021
 ### Infrastructure (changes irrelevant to downstream codes)
 - [[PR 392]](https://github.com/lanl/parthenon/pull/392) Fix C++ linting for when parthenon is a submodule
 - [[PR 335]](https://github.com/lanl/parthenon/pull/335) New machine configuration file for LANL's Darwin cluster
-- [[PR 200]](https://github.com/lanl/parthenon/pull/200) Adds support for running ci on power9 nodes.
-- [[PR 347]](https://github.com/lanl/parthenon/pull/347) Speed up darwin ci by using pre installed spack packages from project space
-- [[PR 368]](https://github.com/lanl/parthenon/pull/368) Fixes false positive in ci.
-- [[PR 369]](https://github.com/lanl/parthenon/pull/369) Initializes submodules when running on darwin ci.
-- [[PR 382]](https://github.com/lanl/parthenon/pull/382) Adds output on fail for fast ci implementation on Darwin.
+- [[PR 200]](https://github.com/lanl/parthenon/pull/200) Adds support for running CI on POWER9 nodes.
+- [[PR 347]](https://github.com/lanl/parthenon/pull/347) Speed up darwin CI by using pre installed spack packages from project space
+- [[PR 368]](https://github.com/lanl/parthenon/pull/368) Fixes false positive in CI.
+- [[PR 369]](https://github.com/lanl/parthenon/pull/369) Initializes submodules when running on darwin CI.
+- [[PR 382]](https://github.com/lanl/parthenon/pull/382) Adds output on fail for fast CI implementation on Darwin.
 - [[PR 362]](https://github.com/lanl/parthenon/pull/362) Small fix to clean regression tests output folder on reruns
 - [[PR 403]](https://github.com/lanl/parthenon/pull/403) Cleanup Codacy warnings
 - [[PR 377]](https://github.com/lanl/parthenon/pull/377) New machine configuration file for LLNL's RZAnsel cluster

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -20,6 +20,7 @@ and revision any time as necessary.
             - [Creating a Regression Test](#creating-a-regression-test)
             - [Integrating the Regression Test with CMake](#integrating-the-regression-test-with-cmake)
             - [Running ctest](#running-ctest)
+        * [Adding Performance Regression Tests](#adding-performance-regression-tests)
         * [Including Tests in Code Coverage Report](#including-tests-in-code-coverage-report)
         * [Creating and Uploading Coverage Report](#creating-and-uploading-coverage-report)
 
@@ -200,16 +201,16 @@ NVIDIA V100 (Volta) GPUs (power9 architecture). Tests run on these systems are
 primarily aimed at measuring the performance of this specific architecture.
 Compilation and testing details can be found by looking in the
 [.gitlab-ci-darwin.yml](.gitlab-ci-darwin.yml) file *and* the /scripts/darwin
-folder. In summary, the ci is built in release mode, with OpenMP, MPI, HDF5 and
+folder. In summary, the CI is built in release mode, with OpenMP, MPI, HDF5 and
 Cuda enabled. All tests are run on a single node with access to two Volta
 GPUs. In addition, the regression tests are run in parallel with two mpi
 processors each of which have access to their own Volta gpu. The following
-tests are run with this ci: unit, regression, performance. A final note,
+tests are run with this CI: unit, regression, performance. A final note,
 this CI has been chosen to also check for performance regressions. The CI
-uses a github application located in /scripts/python. After a successful run
+uses a GitHub application located in /scripts/python. After a successful run
 of the CI a link to the performance metrics will appear as part of the parthenon
 metrics status check in the pr next to the commit the metrics were recorded for.
-All data from the regression tests are recorded in the parthenon wiki in a json file.
+All data from the regression tests are recorded in the parthenon wiki in a JSON file.
 
 ### Adding Tests
 
@@ -343,7 +344,7 @@ files are located in parthenon/scripts/python/packages/parthenon_performance_app
 additional tests metrics changes will need to be made to the scripts located in the folder:
 parthenon/scripts/python/packages/parthenon_performance_app.  In general, the app
 works by taking performance metrics that are ouput to a file when a regression test is executed.
-This output is read by the app and compared with metrics that are stored in the wiki (json format).
+This output is read by the app and compared with metrics that are stored in the wiki (JSON format).
 The metrics are then plotted in a png file which is also uploaded to the wiki. Finally, a markdown
 page is created with links to the images and is uploaded to the wiki.
 
@@ -373,6 +374,148 @@ regression    =   1.44 sec*proc (1 test)
 Total Test time (real) =   1.47 sec
 ```
 
+#### Adding Performance Regression Tests
+
+Regression tests can also be added for the purposes of testing the performance.
+Adding performance tests does require understanding a bit about how the CI
+infrastructure is setup. Currently, the performance tests are setup to run on
+dedicated Power9 nodes that are provided by LANL, this is to ensure consistency
+in the result. However, because the performance tests are running on LANL
+machines it means that it is not straightforward to make the results viewable
+to non LANL contributors. To make the results of the regression tests viewable
+on GitHub a custom GitHub application was developed to be used with Parthenon
+which we will call the Parthenon Performance Metrics App (PPMA).
+
+The PPMA is series of python scripts located in the parthenon repository in
+./scripts/python/packages/parthenon_performance_app. The PPMA has the following
+responsibilities:
+
+1. Submit a status check associated with the commit being analyzed to the
+   Parthenon repository
+2. Read in performance metrics from regression tests in a build directory
+3. Compare the current metrics with metrics stored in the Parthenon Wiki, the
+   files in the Wiki should contain performance metrics from previous runs.
+4. Generate images showing the performance numbers.
+5. Add images to an orphan branch in the Parthenon repository (branch name is
+   figures).
+6. Add new performance metrics to the appropriate performance JSON file stored
+   in the Wiki. These JSON files are labeled with the following format:
+   `performance_metrics_current-branch.json`, this means each branch should
+   have its own JSON performance file.
+7. Create or overwrite a wiki page that contains links to the performance
+   figures located in the parthenon wiki repository.
+8. Upload the wiki page the images and performance metrics to the Parthenon
+   repository. The wiki page will be named with the following format:
+   current-branch_target-branch.md
+   E.g. If you were merging hotfix branch into develop the wiki page would be
+   named: hotfix_develop.md, each PR should have it's own Wiki page. For performance
+   tests that are not run as part of a wiki such as the develop branch is scheduled
+   to run daily the current-branch and target-branch are the same: develop_develop.md.
+9. Submit a status check indicating that the performance metrics are ready to
+   be viewed with a link to the wiki page.
+
+In order to add a performance test, let's call it `TEST`, the following tasks must be completed.
+
+1. The test must be inside the folder `./tst/regression/test_suites/TEST_performance`.
+   NOTE: the test folder must have `performance` in the name because
+   regexp matching of the folder name is used to figure out which tests to run
+   from the CI. To be explicit you can see what command is run by looking in the
+   file `./scripts/darwin/build_fast.sh`, there is a line
+
+   ```Bash
+       ctest --output-on-failure -R performance -E "SpaceInstances"
+   ```
+
+2. When the test is run e.g. by calling ctest the test must output the
+   performance metrics in a file called `TEST_results.txt` which will appear in the
+   build directory.
+3. The main PPMA file located in
+   `./scripts/python/packages/parthenon_performance_app/bin/` must be edited. The
+   PPMA script works by looping over the output directories of the regression
+   tests, the loop exists in the main PPMA file and should have a logic block
+   similar to what is shown below:
+
+```python
+for test_dir in all_dirs:
+    if not isinstance(test_dir, str):
+        test_dir = str(test_dir)
+    if (test_dir == "advection_performance"
+        or test_dir == "advection_performance_mpi"):
+
+        figure_url, figure_file, _ = self._createFigureURLPathAndName(
+                    test_dir, current_branch, target_branch
+                )
+
+        # do something...
+
+        if create_figures:
+            figure_files_to_upload.append(figure_file)
+            figure_urls.append(figure_url)
+```
+
+Above you can see a branch for `advection_performance` and
+`advection_performance_mpi`, because the logic is the same for both tests there is
+a single if statement. You will likely need to add a separate branch for any
+additional tests e.g. `dir_foo_performance`.
+
+The `_createFigureURLPathAndName` is a convenience method for generating a figure URL
+and a figure_file name. The `figure_url` will be of the form https://github.com/lanl/parthenon/blob/figures/figure_file.
+Using this method is not a requirement. For instance, if you had multiple different performance figures that you
+wanted to generate for a single test you could do, for example:
+
+```python
+
+if (test_dir == "dir_foo_performance"):
+    jpeg_fig_name = "awesome.jpeg"
+    figure_url1 = "https://github.com/lanl/parthenon/blob/figures/" + jpeg_fig_name
+
+    png_fig_name = "awesome.png"
+    figure_url2 = "https://github.com/lanl/parthenon/blob/figures/" + png_fig_name
+
+    png_fig_name2 = "awesome2.png"
+    figure_url3 = "https://github.com/lanl/parthenon/blob/figures/" + png_fig_name2
+
+    # do something... need to actually generate the figures and read in performance metrics
+
+    if create_figures:
+        figure_files_to_upload.append(jpeg_fig_name)
+        figure_urls.append(figure_url1)
+        figure_files_to_upload.append(png_fig_name)
+        figure_urls.append(figure_url2)
+        figure_files_to_upload.append(png_fig_name2)
+        figure_urls.append(figure_url3)
+```
+
+4. At this point we know the app knows the URLs of the figures and we know the
+   figure names, however we have to actually generate the figures, this
+   involves creating an analyzer, you can take a look at the AdvectionAnalyzer as
+   an example
+   ./scripts/python/packages/parthenon_performance_app/parthenon_performance_app/parthenon_performance_advection_analyzer.py.
+   In general, the Analyzer has the following responsibilities:
+
+        a. It must be able
+           to read in results from the regression tests.
+
+        b. It must be able to package the results in a json format.
+
+        c. It must be able to read previous performance metrics results from the json
+           files stored in the wiki. A convenience script is provided to help with this in
+           the form of the parthenon_performance_json_parser.py, if you are adding new
+           metrics the parser will need to be updated in order to correctly read in the
+           new data.
+
+        d. It must be able to create the figures displaying performance. This is also
+           handled by a separate script the parthenon_performance_plotter.py, it may be
+           approprite to adjust it as needed for new functionality or add a separate
+           plotting script.
+
+A few closing points, the advection performance tests are currently setup so that
+if a user creates a pr to another branch the figures generated will contain the
+performance metrics of the two branches and plot them. In the case that the performance
+metrics are not being run as part of a pr, for example the performance of the development
+branch is checked on a daily basis then the performance from the last 5 commits that
+branches metrics file (stored on the wiki, performance_metrics_develop.json) will be plotted.
+
 #### Including Tests in Code Coverage Report
 
 Unit tests that are not performance tests (Do not have the [performance] tag) are