Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for DSS on NVIDIA GPUs and only CPUs (New) #1609

Merged
merged 44 commits into from
Dec 3, 2024

Conversation

motjuste
Copy link
Contributor

@motjuste motjuste commented Nov 20, 2024

Description

Changes to tests jobs

  • Individual test jobs that used to check for available GPUs have been replaced by using the graphics_card resource, and enable skipping respective tests when the relevant GPUs are not available.
  • Tests have been added to check setting up DSS for using CUDA with NVIDIA GPUs, and then some simple tests are run to see if PyTorch and Tensorflow can actually use the GPUs.
  • Tests have been added to run only on the CPU. These tests will be run on all machines irrespective of available GPUs.
  • Shell scripts have been refactored to have more re-usable functions.
    • This was partly done in lieu of investigating if these could have been implemented as Python scripts instead, but now they are far more compact than the Python scripts would have been. Hence, re-implementation in Python has been dropped for now.
  • Tests verifying Intel GPU were updated since they had a bug where they started counting NVIDIA GPUs too as Intel GPUs. The tests are less precise now, i.e. now testing for a minimum expected value for GPU counts and capacity to be available, instead of previous tests looking for exact counts.

Changes to the test plan

  • Changes related to resource.pxu and additional NVIDIA tests, as explained above.
  • Unused test-plans for individually testing ITEX or IPEX have been removed since they get tested in the main test plan any way.

Changes to the snap

  • The command to trigger the tests from the checkbox-dss snap produced with the provider has been changed from validate-intel-gpu to validate-with-gpu.
  • The command to install all the dependencies for the running the tests called install-deps has been refactored, and now accepts specifying version of the main snaps to be installed, which currently include DSS itself, Microk8s, and kubectl.
  • These are backwards-incompatible change to the snap and hence its version has been bumped from 2.0 to 3.0, and changes have been made to the relevant snapcraft.yaml and to the README.

Changes to the relevant GitHub workflow

  • The GitHub workflow for running DSS has been refactored to now need a single job definition that can be used for all the values from the test matrix.
  • An NVIDIA DGX machine has been added as a target machine, representing a machine that does not have any Intel GPUs, instead, only NVIDIA GPUs.
  • Multiple Microk8s versions have been added to the test-matrix.

Resolved issues

Documentation

No changes to Checkbox's documentation.

Tests

These DSS validations need to be run on machines from Testflinger. See a recent run of the workflow here (the relevant one for this PR is https://github.com/canonical/checkbox/actions/runs/12056842710).

@motjuste motjuste changed the title Checkbox 1586 dss add nvidia Add tests for DSS on NVIDIA GPUs and only CPUs (New) Nov 20, 2024
@motjuste motjuste marked this pull request as ready for review November 20, 2024 15:08
@motjuste motjuste requested a review from a team as a code owner November 20, 2024 15:08
@motjuste motjuste requested a review from pieqq November 20, 2024 15:09
@motjuste motjuste force-pushed the CHECKBOX-1586-dss-add-nvidia branch 4 times, most recently from aa6301f to 968e272 Compare November 27, 2024 18:07
@motjuste
Copy link
Contributor Author

motjuste commented Nov 28, 2024

Updated the PR with refactored scripts and decided against re-implementing them in Python. Furthermore, since last week, I have further lumped in work for CHECKBOX-1668 enabling customisation of Microk8s version in the install-deps script in order to lump together changes that require bumping changes to the version of the Snap.

Please see the updated description of the PR for more details.

Copy link
Collaborator

@pieqq pieqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this big contribution! As usual, the very clear description and git commit messages help a lot with the review, along with the tests showing a successful run in TF.

Two things:

  1. I would refrain from removing .sh extension to Shell scripts. It's much easier to see what kind of file it is by looking at the extension when the scripts are in the bin/ directory
  2. There is already a graphics_card resource in Checkbox that should help with checking if there is at least one Intel/NVIDIA GPU available in the system. Check my inline comment for more info on how to use it.

@motjuste motjuste force-pushed the CHECKBOX-1586-dss-add-nvidia branch 4 times, most recently from e0b2928 to 47e7f56 Compare December 2, 2024 16:35
@motjuste
Copy link
Contributor Author

motjuste commented Dec 2, 2024

The relevant workflow run in Testflinger for the latest commit accommodating the requested changes: https://github.com/canonical/checkbox/actions/runs/12122214006

@motjuste
Copy link
Contributor Author

motjuste commented Dec 3, 2024

The relevant workflow run in Testflinger for the latest commit accommodating the requested changes: https://github.com/canonical/checkbox/actions/runs/12122214006

You will see that all jobs testing latest/edge of DSS fail here unfortunately. There was a release on this risk level yesterday for the DSS snap and it seems to have some bug (Issue reported here). Since these are not failures of the validation suite, I believe then this PR is ready.

@motjuste
Copy link
Contributor Author

motjuste commented Dec 3, 2024

The latest commit is a minor change to the README and does not impact the code, so I propose not to re-run the validations in Testflinger.

@motjuste motjuste requested a review from pieqq December 3, 2024 09:18
For the moment we lump it together in the validate-intel-gpu launcher...
more refactoring coming
This is covered by checking that DSS's status says 'MLFlow deployment: Ready'.
The way the removed test was implemented assumed position of the service's name
in the output and made it flaky, especially when re-running the tests.
Since many tests here depend on some resources to be available, specifically:
GPUs from Intel or NVIDIA, not all tests are expected to pass on a given machine
and hence we should not waste our time too much retrying these tests.
the tests fail on re-runs because they start counting nvidia gpus too
one redundant test job has been removed since the new test-case now implicitly
tests importing itex as well
one redundant test job has been removed since the new test-case now implicitly
tests importing ipex as well
There seems to be a bug in the Intel GPU plugin where it starts counting NVIDIA
GPUs too under its label once NVIDIA's plugin is enabled.  The tests are now
updated to check for matching the minimum slot count instead of an exact one.
It helps to know which script is being run
the previous approach was checking for driver, but
that does not work for NVIDIA GPUs because we don't
install their drivers on the machine (the drivers
are installed in the k8s operator).
@motjuste motjuste force-pushed the CHECKBOX-1586-dss-add-nvidia branch from e81eeca to 1dc30bd Compare December 3, 2024 12:10
Copy link
Collaborator

@pieqq pieqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the modifications! LGTM :)

@pieqq pieqq merged commit 355ecf0 into main Dec 3, 2024
7 checks passed
@pieqq pieqq deleted the CHECKBOX-1586-dss-add-nvidia branch December 3, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants