-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests for DSS on NVIDIA GPUs and only CPUs (New) #1609
Conversation
aa6301f
to
968e272
Compare
Updated the PR with refactored scripts and decided against re-implementing them in Python. Furthermore, since last week, I have further lumped in work for CHECKBOX-1668 enabling customisation of Microk8s version in the Please see the updated description of the PR for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this big contribution! As usual, the very clear description and git commit messages help a lot with the review, along with the tests showing a successful run in TF.
Two things:
- I would refrain from removing
.sh
extension to Shell scripts. It's much easier to see what kind of file it is by looking at the extension when the scripts are in thebin/
directory - There is already a
graphics_card
resource in Checkbox that should help with checking if there is at least one Intel/NVIDIA GPU available in the system. Check my inline comment for more info on how to use it.
contrib/checkbox-dss-validation/checkbox-provider-dss/bin/check_cuda
Outdated
Show resolved
Hide resolved
contrib/checkbox-dss-validation/checkbox-provider-dss/bin/check_dss
Outdated
Show resolved
Hide resolved
contrib/checkbox-dss-validation/checkbox-provider-dss/units/resource.pxu
Outdated
Show resolved
Hide resolved
e0b2928
to
47e7f56
Compare
The relevant workflow run in Testflinger for the latest commit accommodating the requested changes: https://github.com/canonical/checkbox/actions/runs/12122214006 |
You will see that all jobs testing latest/edge of DSS fail here unfortunately. There was a release on this risk level yesterday for the DSS snap and it seems to have some bug (Issue reported here). Since these are not failures of the validation suite, I believe then this PR is ready. |
The latest commit is a minor change to the README and does not impact the code, so I propose not to re-run the validations in Testflinger. |
For the moment we lump it together in the validate-intel-gpu launcher... more refactoring coming
This is covered by checking that DSS's status says 'MLFlow deployment: Ready'. The way the removed test was implemented assumed position of the service's name in the output and made it flaky, especially when re-running the tests.
Since many tests here depend on some resources to be available, specifically: GPUs from Intel or NVIDIA, not all tests are expected to pass on a given machine and hence we should not waste our time too much retrying these tests.
the tests fail on re-runs because they start counting nvidia gpus too
one redundant test job has been removed since the new test-case now implicitly tests importing itex as well
one redundant test job has been removed since the new test-case now implicitly tests importing ipex as well
There seems to be a bug in the Intel GPU plugin where it starts counting NVIDIA GPUs too under its label once NVIDIA's plugin is enabled. The tests are now updated to check for matching the minimum slot count instead of an exact one.
It helps to know which script is being run
the previous approach was checking for driver, but that does not work for NVIDIA GPUs because we don't install their drivers on the machine (the drivers are installed in the k8s operator).
e81eeca
to
1dc30bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the modifications! LGTM :)
Description
Changes to tests jobs
graphics_card
resource, and enable skipping respective tests when the relevant GPUs are not available.Changes to the test plan
resource.pxu
and additional NVIDIA tests, as explained above.Changes to the snap
checkbox-dss
snap produced with the provider has been changed fromvalidate-intel-gpu
tovalidate-with-gpu
.install-deps
has been refactored, and now accepts specifying version of the main snaps to be installed, which currently include DSS itself, Microk8s, andkubectl
.2.0
to3.0
, and changes have been made to the relevantsnapcraft.yaml
and to the README.Changes to the relevant GitHub workflow
Resolved issues
Documentation
No changes to Checkbox's documentation.
Tests
These DSS validations need to be run on machines from Testflinger. See a recent run of the workflow here (the relevant one for this PR is https://github.com/canonical/checkbox/actions/runs/12056842710).