Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphblast with sms >= 70 #9

Open
jesunsahariar opened this issue Apr 23, 2020 · 6 comments
Open

Graphblast with sms >= 70 #9

jesunsahariar opened this issue Apr 23, 2020 · 6 comments

Comments

@jesunsahariar
Copy link

Hello,

Thank you for hosting Graphblast on a public repo to help the research community.

I was wondering whether there is any plan to get GraphBlast working for the latest sms. I am finding the mgpu version leveraged by GraphBlast a little bit challenging to get it to work on latest sms for Graphblast. I tried to put in some patches in the mgpu version currently being used by Grtaphblast, in particular, for the synchronization primitives (mostly shuffles and ballots suggested by @neoblizz in the mgpu repo). and I am encountering hangs for algorithms such as bfs with matrices of medium size.

I would really appreciate any insight. Thanks in advance!

@neoblizz
Copy link
Member

neoblizz commented Apr 23, 2020

Just a note, @YuxinxinChen might have looked into this in the past.

@jesunsahariar
Copy link
Author

Replacing the intrinsics.cuh file here: https://github.com/ctcyang/moderngpu/blob/9e491c383e935c2cbc0279350640dad3febb8b9d/include/device/intrinsics.cuh
by intrinsics.cuh here: https://github.com/moderngpu/moderngpu/blob/5029d38cab83492d8091cce5902c077ab3ca72a9/include/device/intrinsics.cuh
might solve the problem

Thats pretty much what I did with @neoblizz 's patches for moderngpu. Were you able to run any graph kernel of GraphBlast with a reasonably-sized matrix with sm>=70 once you replaced the intrinsics.cuh file ? If yes, could you please let me know which kernels were you able to run and perhaps the inputs you used? Thanks in advance for your reply.

@neoblizz
Copy link
Member

@jsfiroz it may be helpful to see what specific error you got. :)

@jesunsahariar
Copy link
Author

Hi,

Apologies for the late response. Here is the details of the modifications I made and the problem I am currently encountering:

I have modified the following files of mgpu to compile and run on devices with sm>=70 (mostly related to ballot and shfl):

include/device/ctascan.cuh 
include/device/ctasegscan.cuh 
include/device/intrinsics.cuh

I made some changes to the CMake file since I needed relocatable code to be generated here:

Next I compiled and ran gbfs with the following command with delaunay_n10 matrix, downloaded from https://sparse.tamu.edu/DIMACS10/delaunay_n10:

./bin/gbfs --timing 0 --earlyexit 1 --mxvmode 0 --struconly 1 --niter 1 --opreuse 1  --debug 1 graphblast/data/mydata/delaunay_n10/delaunay_n10.mtx 

The program hangs, presumably in one of the branches here:

I am running on a GeForce RTX 2080 Ti GPU.

Any feedback would be greatly appreciated. Please let me know if I can provide any additional information. Thanks in advance!

partial o/p and the program hangs in one of the aforementioned branch (perhaps something related to reducebykey operation in mgpu?):

===Begin assign===
Input: 1
Executing assignDense
Mask: 1
Accum:0
SCMP: 0
Repl: 0
Tran: 0
mask_ind:
[0]:0
mask_val:
[0]:1
w_val:
[0]:1 [1]:0 [2]:0 [3]:0 [4]:0 [5]:0 [6]:0 [7]:0 [8]:0 [9]:0 [10]:0 [11]:0 [12]:0 [13]:0 [14]:0 [15]:0 [16]:0 [17]:0 [18]:0 [19]:0 [20]:0 [21]:0 [22]:0 [23]:0 [24]:0 [25]:0 [26]:0 [27]:0 [28]:0 [29]:0 [30]:0 [31]:0 [32]:0 [33]:0 [34]:0 [35]:0 [36]:0 [37]:0 [38]:0 [39]:0
===End assign===
val:
[0]:1 [1]:0 [2]:0 [3]:0 [4]:0 [5]:0 [6]:0 [7]:0 [8]:0 [9]:0 [10]:0 [11]:0 [12]:0 [13]:0 [14]:0 [15]:0 [16]:0 [17]:0 [18]:0 [19]:0 [20]:0 [21]:0 [22]:0 [23]:0 [24]:0 [25]:0 [26]:0 [27]:0 [28]:0 [29]:0 [30]:0 [31]:0 [32]:0 [33]:0 [34]:0 [35]:0 [36]:0 [37]:0 [38]:0 [39]:0
===Begin vxm===
ind:
[0]:0
val:
[0]:1
Load balance mode: 2
Identity: 0
Sparse format: 0
Symmetric: 0
u_vec_type: 1
Executing Spmspv MERGE
In structure only mode
Mask: 1
Accum:0
SCMP: 0
Repl: 0
Tran: 1
NT: 128 NB: 1
d_temp_nvals:
[0]:9
d_scan:
[0]:0 [1]:9
u_nvals: 1
w_nvals: 9
SwapInd:
[0]:64 [1]:242 [2]:299 [3]:301 [4]:302 [5]:303 [6]:305 [7]:315 [8]:317
1 bytes required!
TempInd:
[0]:64 [1]:242 [2]:299 [3]:301 [4]:302 [5]:303 [6]:305 [7]:315 [8]:317

@ctcyang
Copy link
Collaborator

ctcyang commented Apr 29, 2020

Thanks @neoblizz and @YuxinxinChen for your help!

With regards to your issue @jsfiroz, so far as I can tell the problem is with moderngpu not supporting the new _sync variants of CUDA 10.1 and up. If I try that dataset, it gets the d_scan result wrong, which is an output of mgpu::Scan (I tried with both mgpu::Scan and my modified mgpu::ScanPrealloc and neither of them get the right answer). My guess is moderngpu relies on some assumptions regarding synchronization which are not being met anymore with the new _sync variants. As a result, some of its public methods like Scan or ReduceByKey are not giving the right result.

The easiest solution I can think of compiling with setting to sm_70 and setting environmental variable export CUDA_HOME=/usr/local/cuda-10.0 or lower. For that, I get the correct solution:

./bin/gbfs --timing 0 --earlyexit 1 --mxvmode 0 --struconly 1 --niter 1 --opreuse 1 --debug 0 /data/gunrock_dataset/large/delaunay_n10/delaunay_n10.mtx
Undirected due to mtx: 1
Undirected due to cmd: 0
Undirected: 1
Remove self-loop: 1
Reading /data/gunrock_dataset/large/delaunay_n10/.delaunay_n10.mtx.ud.nosl.bin
Allocate 1025
Allocate 7334
Allocate 7334
CPU BFS finished in 0.032187 msec. Search depth is: 18

CORRECT
cpu, 0.0460148,
warmup, 2.14911, 0
tight, 1.70672
vxm, 1.97005

CORRECT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants