Rationalize compute capability arguments in makefiles #3

KarlLudwig3485 · 2024-10-15T20:46:59Z

The extra {x.y | x >= 6 and y > 0} lines only serve to increase compile times and executable size.
Also, current CUDA still supports CC 5.x, so that should still be included.

I would propose to keep the old CC lines in a commented out state, with the default being for the current CUDA version.

The old ones still work, as I have tested CUDA 5.0 builds (32 and 64-bit) with CC 1.1 on a Windows XP laptop with R304 drivers and a Tesla GPU.

Removes superflous {x.y | y > 0} args, adds comments to CC 6+ lines, and removes trailing space on CC 3.0 line. Also uncomments CC 5.0 line in win64 and linux makefiles, as current CUDA 12.6 still supports CC 5.x (Maxwell).

brubsby · 2024-10-23T17:00:26Z

I couldn't quite determine who was responsible for adding the "two numbered" compute capabilities to the makefiles, but the fact that CC3.5 seemed to "unlock" some functionality that enabled speedup made me think I shouldn't delete all of the CC. for y!=0 lines, as I didn't have a good way of checking that this didn't give speedups. So I don't really want to remove them if the only downside is a slightly larger binary. However I do want to add CC5.0, I was just tricked into thinking it wasn't supported by someone else commenting 5.0 out in a distribution.

Is that reasonable?

KarlLudwig3485 · 2024-10-23T18:37:46Z

The reason it builds a seperate kernel for CC 3.5 instead of using the one for CC 3.0 is documentend in the Makefile's comments, albeit very lightly.

# NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code 
# NVCCFLAGS += --generate-code arch=compute_35,code=sm_35 # but CC 3.5 (3.2?) _CAN_ use funnel shift which is useful for mfaktc

The only reason this "unlocks" functionality is because of an #if statement in my_intrinsics.h.

Adding more CC x.y arguments would only improve performance if code was written to take advantage of any newer features that they support.

That is something I would be interested in seeing, but I have zero experience programming CUDA, so that's something I still need to research.

TODO:

Familiarise myself with mfaktc source code
Learn CUDA
Make enourmous performance improvements
???
Profit

brubsby · 2024-10-24T15:35:58Z

The only reason this "unlocks" functionality is because of an #if statement in my_intrinsics.h.

I wasn't aware of this bit of code before, thank you.

Adding more CC x.y arguments would only improve performance if code was written to take advantage of any newer features that they support.

It's unclear if the CUDA compiler also takes advantage of the features of the "minor" CC version behind the scenes. This software is meant to be performance optimized, so even the possibility of a performance improvement outweighs the cost of a slightly larger binary, imo.

You're more than welcome to compile a smaller binary with just the CC for the cards you're using, if binary size is that important to you.

KarlLudwig3485 · 2024-10-24T17:07:25Z

It's unclear if the CUDA compiler also takes advantage of the features of the "minor" CC version behind the scenes.

I admit, I don't really have anything to back up my statement, except a vague "vibe" (to use a neologism) I get from the comments on the CC1.1 to 5.0 args.

You're more than welcome to compile a smaller binary with just the CC for the cards you're using, if binary size is that important to you.

My motivation isn't neccessarily binary size, but rather for the makefile to look "pretty". This is entirely irrational, of course.

I can compare CC 6.0 with CC 6.1 on my laptop's GPU, so I'll see if it makes any difference.

KarlLudwig3485 · 2024-10-25T17:58:22Z

I can compare CC 6.0 with CC 6.1 on my laptop's GPU, so I'll see if it makes any difference.

M174241147 TF76-77
CC 6.0 - 173.016 GHz-d/day
CC 6.1 - 173.175 GHz-d/day

NVIDIA GeForce GTX 1060 Mobile with a 24W power limit.
It was the same config, same assignment, with nothing else running, averaged over several hours.

This is only anecdotal evidence, of course, but in this instance there was no meaningful performance difference.
I might rerun this test with a different assignment, and see if the result is similar.

Rationalize compute capability arguments in makefiles

8f65d2e

Removes superflous {x.y | y > 0} args, adds comments to CC 6+ lines, and removes trailing space on CC 3.0 line. Also uncomments CC 5.0 line in win64 and linux makefiles, as current CUDA 12.6 still supports CC 5.x (Maxwell).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationalize compute capability arguments in makefiles #3

Rationalize compute capability arguments in makefiles #3

KarlLudwig3485 commented Oct 15, 2024 •

edited

Loading

brubsby commented Oct 23, 2024

KarlLudwig3485 commented Oct 23, 2024

brubsby commented Oct 24, 2024

KarlLudwig3485 commented Oct 24, 2024

KarlLudwig3485 commented Oct 25, 2024 •

edited

Loading

Rationalize compute capability arguments in makefiles #3

Are you sure you want to change the base?

Rationalize compute capability arguments in makefiles #3

Conversation

KarlLudwig3485 commented Oct 15, 2024 • edited Loading

brubsby commented Oct 23, 2024

KarlLudwig3485 commented Oct 23, 2024

brubsby commented Oct 24, 2024

KarlLudwig3485 commented Oct 24, 2024

KarlLudwig3485 commented Oct 25, 2024 • edited Loading

KarlLudwig3485 commented Oct 15, 2024 •

edited

Loading

KarlLudwig3485 commented Oct 25, 2024 •

edited

Loading