Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize compute capability arguments in makefiles #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

KarlLudwig3485
Copy link

@KarlLudwig3485 KarlLudwig3485 commented Oct 15, 2024

The extra {x.y | x >= 6 and y > 0} lines only serve to increase compile times and executable size.
Also, current CUDA still supports CC 5.x, so that should still be included.

I would propose to keep the old CC lines in a commented out state, with the default being for the current CUDA version.

The old ones still work, as I have tested CUDA 5.0 builds (32 and 64-bit) with CC 1.1 on a Windows XP laptop with R304 drivers and a Tesla GPU.

Removes superflous {x.y | y > 0} args, adds comments to CC 6+ lines,
and removes trailing space on CC 3.0 line.

Also uncomments CC 5.0 line in win64 and linux makefiles,
as current CUDA 12.6 still supports CC 5.x (Maxwell).
@brubsby
Copy link
Collaborator

brubsby commented Oct 23, 2024

I couldn't quite determine who was responsible for adding the "two numbered" compute capabilities to the makefiles, but the fact that CC3.5 seemed to "unlock" some functionality that enabled speedup made me think I shouldn't delete all of the CC. for y!=0 lines, as I didn't have a good way of checking that this didn't give speedups. So I don't really want to remove them if the only downside is a slightly larger binary. However I do want to add CC5.0, I was just tricked into thinking it wasn't supported by someone else commenting 5.0 out in a distribution.

Is that reasonable?

@KarlLudwig3485
Copy link
Author

The reason it builds a seperate kernel for CC 3.5 instead of using the one for CC 3.0 is documentend in the Makefile's comments, albeit very lightly.

# NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code 
# NVCCFLAGS += --generate-code arch=compute_35,code=sm_35 # but CC 3.5 (3.2?) _CAN_ use funnel shift which is useful for mfaktc

The only reason this "unlocks" functionality is because of an #if statement in my_intrinsics.h.

Adding more CC x.y arguments would only improve performance if code was written to take advantage of any newer features that they support.

That is something I would be interested in seeing, but I have zero experience programming CUDA, so that's something I still need to research.

TODO:

  • Familiarise myself with mfaktc source code
  • Learn CUDA
  • Make enourmous performance improvements
  • ???
  • Profit

@brubsby
Copy link
Collaborator

brubsby commented Oct 24, 2024

The only reason this "unlocks" functionality is because of an #if statement in my_intrinsics.h.

I wasn't aware of this bit of code before, thank you.

Adding more CC x.y arguments would only improve performance if code was written to take advantage of any newer features that they support.

It's unclear if the CUDA compiler also takes advantage of the features of the "minor" CC version behind the scenes. This software is meant to be performance optimized, so even the possibility of a performance improvement outweighs the cost of a slightly larger binary, imo.

You're more than welcome to compile a smaller binary with just the CC for the cards you're using, if binary size is that important to you.

@KarlLudwig3485
Copy link
Author

It's unclear if the CUDA compiler also takes advantage of the features of the "minor" CC version behind the scenes.

I admit, I don't really have anything to back up my statement, except a vague "vibe" (to use a neologism) I get from the comments on the CC1.1 to 5.0 args.

You're more than welcome to compile a smaller binary with just the CC for the cards you're using, if binary size is that important to you.

My motivation isn't neccessarily binary size, but rather for the makefile to look "pretty". This is entirely irrational, of course.

I can compare CC 6.0 with CC 6.1 on my laptop's GPU, so I'll see if it makes any difference.

@KarlLudwig3485
Copy link
Author

KarlLudwig3485 commented Oct 25, 2024

I can compare CC 6.0 with CC 6.1 on my laptop's GPU, so I'll see if it makes any difference.

M174241147 TF76-77
CC 6.0 - 173.016 GHz-d/day
CC 6.1 - 173.175 GHz-d/day

NVIDIA GeForce GTX 1060 Mobile with a 24W power limit.
It was the same config, same assignment, with nothing else running, averaged over several hours.

This is only anecdotal evidence, of course, but in this instance there was no meaningful performance difference.
I might rerun this test with a different assignment, and see if the result is similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants