-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] Update CUDA versions for CI #6539
Conversation
@shiyu1994 Hi! May I kindly ask you to update NVIDIA drivers at the host machine where CUDA CI jobs are executed? It will allow us to run tests against the most recent CUDA version
Refer to #6520 for the context of this PR. Some related external links: |
Based on https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-drivers, I think we want R535 (the latest long-term support release). |
Agree. * Based on my personal experience, R530 driver doesn't support CUDA 12.5. |
Gently ping @shiyu1994 for fresh NVIDIA driver installation. |
Can confirm that R535 is enough to run containers with CUDA 12.5.
|
I'll try to contact @shiyu1994 in the maintainer Slack. |
@jameslamb Did you succeed? 👼 |
No, I haven't been able to reach @shiyu1994 in the last 2 months. @shiyu1994 since I do see you're active here (#6623), could you please help us with this? I sent another message in the maintainer private chat as well on a separate topic. |
Just learned that CUDA Forward Compatibility feature is available only for server cards (e.g. Tesla A100) and not for domestic ones (e.g. RTX 4090).
For example, on domestic card RTX 4090 with R535 driver you'll get |
Sorry I cannot login to my slack account, since it is registered with a @qq.com email. I will update the CUDA version of the CI agent. |
Thank you!! |
Thanks a lot! |
@StrikerRUS Done with upgrading the Nvidia driver to 535. Please ping me if there's anything that I need to do. |
@shiyu1994 Thank you very much!
error at the I'll try to trigger |
It helped! 🎉 |
I think this PR is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! I checked the logs and everything looks good to me. I support merging this.
I'm glad the option to restart docker manually was helpful!
One thing that surprised me in the logs... the wheels are only 11.1 MB uncompressed?
checking './dist/lightgbm-4.5.0.99-py3-none-linux_x86_64.whl'
----- package inspection summary -----
file size
* compressed size: 11.1M
* uncompressed size: 23.0M
* compression space saving: 51.8%
If that's true, then maybe we should consider trying to distribute compile in CUDA support for the wheels on PyPI. We could do something like XGBoost does, just supporting 1 major version of CUDA at a time on PyPI (ref: dmlc/xgboost#10807).
Anyway, just thinking out loud... it absolutely should not block this PR, just maybe something to think about for the future.
I like your idea about publishing CUDA version at PyPI! But maybe we should wait for #6138 where we'll get NCCL as a new dependency and there is the following diff in that PR so far: - --max-allowed-size-uncompressed '100M' \
+ --max-allowed-size-uncompressed '500M' \ |
Yes good point! We could also maybe explore what |
Hmmm...
Same problem in another project 2 days ago: All-Hands-AI/OpenHands#4153. |
Still not certain what the root cause is, but @StrikerRUS I found that switching from |
Sorry, this was accidentally closed because of language I used in the description of #6663. I've reopened it and updated it to latest |
Fixed #6520.