-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can nvidiagpubeat be made to also export the process running on each card? #29
Comments
@musiczhzhao Yes, it can. I had a piece of code for it. I will try and integrate into nvidiagpubeat. |
@deepujain Thank you! 👍 |
Hi @deepujain, How are things going? Just to check if there is any update? Any if any help are needed? Best |
The changes are ready. I lost access to my GPU cluster, hence testing the changes has become a challenge and created a dependency. Here is a sample The
The
|
@musiczhzhao I made the changes to nvidiagpubeat to support process details information and made it generic in the process. Please test and share the results here (including few sample events) for query-compute-apps (active GPU process details) . It can now support all types of queries as it is generic. I have tested only --query-gpu and --query-compute-apps. In case you plan to use other options, let me know and you can help me with testing.
https://github.com/eBay/nvidiagpubeat#sample-event has details. |
Hi @deepujain, Thank you! I will test it and get back to you ASAP. 👍 Best |
Hello @deepujain, Happy weekend! I have briefly tested the new version and confirm it can export the application name and gpu memory usage of the application when --query-compute-apps is used. One question have is if there is a way to enable both --query-gpu and --query-compute-apps so both documents can be exported. I tried to enable both in the configuration file and it turned out only the later one become effective. For example, with following in configuration, it seems only export the compute app metrics: ## --query-gpu will provide information about GPU. Another question is we find it useful to have the full command line of the app. For example, if a python script is launched with python, current nvidia-smi will just show app as python, without the actual script name and arguments. Searching around from online we found what people generally do it to firstly get the pid of the application and then get the ful command from ps command. (https://stackoverflow.com/questions/50264491/how-to-customize-nvidia-smi-s-output-to-show-pid-username) Can we have this build-in so it can have the cmd just as metricbeat does? Best, |
Hello Zhao, Thank you for testing out. Could you please raise seperate github issues for each new feature request.
Cheers |
Hi @deepujain, I did a bit more testing which took some time. Another issue we found is that the new version seems assume there is only one app running on each GPU card, or nvidia-smi only return 4 processes if there are 4 GPU cards on a machine. Otherwise it will crash with following error message.
The code allocating the event is in line 71 of nvidia/gpu.go: I will attached the sample events in a separate post. Best, |
Since nvidiagpubeat is based on nvidia-smi and nvidia-smi is able to list the processes that are currently using the gpu cards, in theory nvidiagpubeat should be able to export the process info as metrics. Please correct me if I am wrong.
I am interest to know if there is any plan to do this? It will be very helpful in identifying the GPU resource usage of processes and the code efficiency.
All the best.
The text was updated successfully, but these errors were encountered: