Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD: parse the architecture as supplied by gcnArchName #11244

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Haus1
Copy link

@Haus1 Haus1 commented Jan 14, 2025

The value provided by minor is truncated for AMD so parse the value returned by gcnArchName for an accurate ID.

We can also use the common value for GCN4, gfx800, to avoid missing compatible devices.

This is a follow-up to #11209 and will change the behavior of CDNA3, CDNA, VEGA and GCN4 as they should now be recognized as expected. Of those I only have access to a GCN4 device for testing.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 14, 2025
@JohannesGaessler
Copy link
Collaborator

I don't know at all whether this is the correct way to do it. @IMbackK your input would be appreciated.

@IMbackK
Copy link
Contributor

IMbackK commented Jan 16, 2025

Yes this is more correct, the current code misses the arch step part.
But i have never seen a device report gfx800, rocblas only supports and checks for gfx803, which is reported by fiji and all polaris variants, and the only other variant i am aware of is gfx802, which is not supported by rocblas (or any rocm component).

Thus nak on the change to the gfx8 define.
I will try this pr out on my devices (i have access to gfx803, gfx900, gfx906, gfx908 and gfx1030)

@IMbackK
Copy link
Contributor

IMbackK commented Jan 16, 2025

Theres also a snag in this pr regarding gfx90a, gfx90a reports 9.1 as major minor but its gcnArchName is gfx90a which this pr wont parse correctly, same goes for others like gfx90c.

So the current code is not correct, but this pr has too many issues to serve as an improvement as is.

@Haus1
Copy link
Author

Haus1 commented Jan 16, 2025

It appears this returns the full target ID as defined in https://github.com/ROCm/clr/blob/amd-staging/rocclr/device/device.cpp around line 125. This'll need to be expanded upon in order to parse out xnack status and to handle the addition of generics.

If it were possible to retrieve the version stepping directly that would be preferable to parsing it out of a string. Would the xnack status be of any use here or can that just be ignored?

@IMbackK
Copy link
Contributor

IMbackK commented Jan 16, 2025

xnack can be ignored since we dont use hipMallocManaged allocated memory. Outside of the user recompileing the whole rocm stack with non default flags only gfx942 and gfx90a can end up in xnak+ mode.

@Haus1
Copy link
Author

Haus1 commented Jan 16, 2025

Yeah, they certainly don't make enabling xnack easy. On linux the kernel module also needs patched to prevent it from rejecting the device

@Haus1 Haus1 force-pushed the amd-rework-version branch from 7bd1195 to 468296f Compare January 17, 2025 22:47
@github-actions github-actions bot added script Script related testing Everything test related python python script changes labels Jan 17, 2025
@Haus1 Haus1 force-pushed the amd-rework-version branch from 468296f to 9620bce Compare January 17, 2025 22:54
@Haus1
Copy link
Author

Haus1 commented Jan 17, 2025

This will now work with all the IDs AMD has in staging and will gracefully fall back to the old way if it fails. Please let me know if I've missed anything.

Would it be better to submit backend changes like this to ggml first?

The value provided by minor is truncated for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.

We can also use the common value for GCN4, as gfx800, to avoid missing compatible devices.
@Haus1 Haus1 force-pushed the amd-rework-version branch from 9620bce to f77ea24 Compare January 18, 2025 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes script Script related testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants