-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
propagate() is 20x slower than built-in sparse matmul #106
Comments
Also compared with PyG 2.0.1, which shows good performance with a single node feature An interesting observation here is that, the PyG SpMM scales almost linearly with the size (# of rows) of (Recall that DGL performs well on both spmv and spmm cases, and is still the best reference) Reproducer: import numpy as np
import scipy.sparse as sp
import torch
from torch_geometric.data import Data
from torch_geometric.nn import MessagePassing
n = 1024
np.random.seed(0)
A = sp.random(n, n, density=0.01, format='coo')
b = np.random.rand(1, n)
B = np.random.rand(100, n)
edge_index = torch.tensor([A.row, A.col], dtype=torch.long)
data = Data(edge_index=edge_index, A=torch.tensor(A.data),
B=torch.tensor(B), b=torch.tensor(b))
class SpMVLayer(MessagePassing):
def __init__(self):
super().__init__(aggr="add", node_dim=-1)
def forward(self, x, edge_index, A):
return self.propagate(edge_index, x=x, A=A)
def message(self, x_j, A):
m = x_j * A
return m
spmv = SpMVLayer()
BA_gnn = spmv(data.B, data.edge_index, data.A)
np.array_equal(BA_gnn.numpy(), B @ A) # True
%timeit spmv(data.B, data.edge_index, data.A) # ~10ms, 10x slower than SciPy
bA_gnn = spmv(data.b, data.edge_index, data.A)
np.array_equal(bA_gnn.numpy(), b @ A) # True
%timeit spmv(data.b, data.edge_index, data.A) # ~100us, as fast as SciPy |
Thanks, this benchmark is very useful. For this specific operation of the function spmm_copyxj_fused(g)
propagate(
copy_xj,
g, +; xj=g.ndata.B
)
end
function spmm_copyxj_unfused(g)
propagate(
(xi, xj, e) -> xj,
g, +; xj=g.ndata.B
)
end
Adj = map(x -> x > 0 ? 1 : 0, A)
@assert spmm_copyxj_unfused(g) ≈ B * Adj
@assert spmm_copyxj_fused(g) ≈ B * A # bug for weighted graphs fixed in #107
@btime spmm_copyxj_fused(g) # 268.614 μs (22 allocations: 1.13 MiB)
@btime spmm_copyxj_unfused(g) # 4.263 ms (52855 allocations: 12.23 MiB)
@btime B * Adj # 196.135 μs (2 allocations: 800.05 KiB) We can get more generic fusing integrating As to why the unfused operation based on gather/scatter are so slow and if can be improved, this has to be analyzed |
Thanks, this seems a very promising direction -- is there a dedicated issue to plan and track it? #19 ? |
yes we could open a separate issue. I'm not familiar yet with SuiteSparseGraphBLAS so I didn't think about a proper integration plan. The first step would just be a drop-in sparse matrix replacement as in #19 I guess |
I'm beginning to work on this area again. The primary issue is still missing rrules for the min/max reduction cases. But I'll make a PR without those by the end of the week. Those missing rrules are possible, and I'm tackling that issue from 3-4 directions. |
We don't necessarily have to takle all together. We can dispatch just on + at the beginning and use gather/scatter for the other reductions |
With #107 the example in the OP now gives good results function spmm_unfused(g)
propagate(
(xi, xj, e) -> e .* xj , # same as e_mul_xj
g, +; xj=g.ndata.B, e=g.edata.A
)
end
function spmm_fused(g)
propagate(
e_mul_xj,
g, +; xj=g.ndata.B, e=vec(g.edata.A)
)
end
@assert isequal(spmm_fused(g), B * A) # true
@assert isequal(spmm_unfused(g), B * A) # true
@btime spmm_fused(g) # 271.228 μs (33 allocations: 1.21 MiB)
@btime spmm_unfused(g) # 4.324 ms (52855 allocations: 12.23 MiB)
@btime B * A # 196.115 μs (2 allocations: 800.05 KiB) |
Confirmed the speed-up from fusion with #107. Thanks for such quick response.
Also notice that, the speed-up is gone with passing Nonetheless, the case of multiple edge features could have a large room for performance optimization. A simple reasoning is: we know that the special case 3D sparse tensor is outside of standard GraphBLAS, so I doubt if SuiteSparseGraphBLAS.jl works for such case? |
For multiple edge features, one could use |
No it doesn't. We would flatten when using GraphBLAS for that case.
That's what TACO generating Julia is for :), which is pretty close!
Yes I think this will work fine for now. I'll have a TACO.jl and associated packages to provide those eventually, hopefully in the next few months. We can take care of speeding up My plan right now: SuiteSparse:GraphBLAS where it's possible, TACO everywhere else. SS:GrB will be faster in most cases (often significantly), but TACO will provide backup. |
Also one other thing I forgot to mention. In the medium-term (probably 1-a few years) SuiteSparse:GraphBLAS will support fast user defined datatypes. This makes it pretty easy to add support for multidimensional edge features by just having a tuple/vector type. I will be exposing the current functionality to do so shortly in SS:GrB, although user defined functions/datatypes are currently quite slow since they use pointers for everything. So we will hopefully have two backends with different tradeoffs: TACO.jl and SS:GrB, both of which should support the full set of computations required with different speed tradeoffs depending on the computation. |
Just notice a promising new work Compiler Support for Sparse Tensor Computations in MLIR. Table 1-4 contain comparison to TACO. Worth considering. |
Ooh I'm glad that finally came out, Aart has been working on that dialect for a while. |
With the well-known graph-matrix duality (see GraphBLAS intro, Fig. 1), simple graph message passing kernels are equivalent to sparse matrix-vector multiplication (SpMV) or sparse matrix-matrix multiplication (SpMM). However, I notice that
propagate()
is more than 20x slower than the built-inA * B
for an equivalent operation. I did the same test with DGL, and did not observe such drastic slow down.To reproduce
Such performance gap can't be explained by storing the sparse matrix in COO (GNN libraries' default) vs CSR (SciPy default) vs CSC (Julia default). In the code below, changing scipy matrix format has minor effect on speed. Also, the speed of DGL and SciPy are similar.
Compare with DGL and SciPy
Effect of fusion
DGL's
update_all
fuses the message and reduction kernels. To mimic the two-stagepropagate
, and see if fusion causes such performance difference:The unfused version is only slightly slower. It cannot explain the 5ms vs 200us performance gap.
There must be other causes of inefficiency. I'd like to figure it out and bring the performance at least close to DGL. (I use DGL a lot, but there are certain projects that favor all-Julia implementation, and your package seems a good option with a clean syntax🙂 )
Package version
Jl:
Py:
The text was updated successfully, but these errors were encountered: