-
-
Notifications
You must be signed in to change notification settings - Fork 48
Description
Intro
I've been playing around with different approaches to efficiently computing Kronecker products, with particular interest in the cases where both input matrices are either covariance matrices or correlation matrices (common in multi-output Gaussian Processes). These special cases have symmetries that might be leveraged to achieve more efficient compute.
Implementations
Below are "user defined function" implementations of kprod() (arbitrary inputs), kprod_cov (covariance inputs), and kprod_corr (correlation inputs). After looking at the kronecker product functions currently in Eigen (in an unsupported section here), I also wrote *_blocked versions of each, where I gather the idea is that blocked computation achieves better memory locality.
Here's the collection of functions:
matrix kprod(matrix A, matrix B){
int n_A = rows(A) ;
int n_B = rows(B) ;
int n_Z = n_A * n_B ;
matrix[n_Z, n_Z] Z ;
for (i in 1:n_A) {
for (j in 1:n_A) {
for (p in 1:n_B) {
for (q in 1:n_B) {
int row = (i - 1) * n_B + p ;
int col = (j - 1) * n_B + q ;
Z[row, col] = A[i, j] * B[p, q] ;
}
}
}
}
return(Z) ;
}
matrix kprod_cov(matrix A, matrix B) {
int n_A = rows(A) ;
int n_B = rows(B) ;
int n_Z = n_A * n_B ;
matrix[n_Z, n_Z] Z ;
// Loop over the upper triangular part of Z (includes diagonal)
for (row in 1:n_Z) {
int i = ((row - 1) %/% n_B) + 1 ;
int p = ((row - 1) % n_B) + 1 ;
for (col in row:n_Z) {
int j = ((col - 1) %/% n_B) + 1 ;
int q = ((col - 1) % n_B) + 1 ;
// Compute the Kronecker product element
Z[row, col] = A[i, j] * B[p, q] ;
// Fill the symmetric counterpart
// Note: this is redundant on the diagonal, but unsure it's worth a wholly separate loop to copy just the off-diagonal
Z[col, row] = Z[row, col] ;
}
}
return(Z) ;
}
matrix kprod_corr(matrix A, matrix B) {
int n_A = rows(A) ;
int n_B = rows(B) ;
int n_Z = n_A * n_B ;
matrix[n_Z, n_Z] Z ;
// Set diagonal elements to 1
for (row in 1:n_Z) {
Z[row, row] = 1.0 ;
}
// Loop over the upper triangular off-diagonal elements
for (row in 1:(n_Z - 1)) {
int i = ((row - 1) %/% n_B) + 1 ;
int p = ((row - 1) % n_B) + 1 ;
for (col in (row + 1):n_Z) {
int j = ((col - 1) %/% n_B) + 1 ;
int q = ((col - 1) % n_B) + 1 ;
// Compute the product of corresponding elements
Z[row, col] = A[i, j] * B[p, q] ;
// Fill the symmetric counterpart
Z[col, row] = Z[row, col] ;
}
}
return(Z) ;
}
matrix kprod_blocked(matrix A, matrix B) {
int r_A = rows(A) ;
int c_A = cols(A) ;
int r_B = rows(B) ;
int c_B = cols(B) ;
matrix[r_A * r_B, c_A * c_B] Z ;
for (i in 1:r_A) {
for (j in 1:c_A) {
// Compute the starting row and column indices in C for the current block
int row_start = (i - 1) * r_B + 1 ;
int col_start = (j - 1) * c_B + 1 ;
// Assign the block to the appropriate position in C
Z[row_start:(row_start + r_B - 1), col_start:(col_start + c_B - 1)] = A[i, j] * B ;
}
}
return Z ;
}
matrix kprod_cov_blocked(matrix A, matrix B){
int n_A = rows(A) ;
int n_B = rows(B) ;
int n_Z = n_A * n_B ;
matrix[n_Z, n_Z] Z ;
// First loop: Handle diagonal blocks (i == j)
for (i in 1:n_A) {
int row_start = (i - 1) * n_B + 1 ;
int col_start = (i - 1) * n_B + 1 ;
Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = A[i, i] * B ;
}
// Second loop: Handle off-diagonal blocks (i < j)
for (i in 1:(n_A - 1)) {
for (j in (i + 1):n_A) {
matrix[n_B, n_B] block = A[i, j] * B ;
int row_i = (i - 1) * n_B + 1 ;
int col_j = (j - 1) * n_B + 1 ;
int row_j = (j - 1) * n_B + 1 ;
int col_i = (i - 1) * n_B + 1 ;
// Assign block to position (i, j)
Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
// Assign symmetric block to position (j, i)
Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
}
}
return Z ;
}
matrix kprod_corr_blocked(matrix A, matrix B){
int n_A = rows(A) ;
int n_B = rows(B) ;
int n_Z = n_A * n_B ;
matrix[n_Z, n_Z] Z ;
// First loop: Handle diagonal blocks (i == j)
for (i in 1:n_A) {
int row_start = (i - 1) * n_B + 1 ;
int col_start = (i - 1) * n_B + 1 ;
Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = B ;
}
// Second loop: Handle off-diagonal blocks (i < j)
for (i in 1:(n_A - 1)) {
for (j in (i + 1):n_A) {
matrix[n_B, n_B] block = A[i, j] * B ;
int row_i = (i - 1) * n_B + 1 ;
int col_j = (j - 1) * n_B + 1 ;
int row_j = (j - 1) * n_B + 1 ;
int col_i = (i - 1) * n_B + 1 ;
// Assign block to position (i, j)
Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
// Assign symmetric block to position (j, i)
Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
}
}
return Z ;
}Benchmarks
I've attempted to benchmark these implementations across a variety of input sizes and with both default model compilation options:
stanc_options = list()
cpp_options = list()as well as "fast" model compilation options:
stanc_options = list('O1')
cpp_options = list(
stan_threads=FALSE
, STAN_CPP_OPTIMS=TRUE
, STAN_NO_RANGE_CHECKS=TRUE
, CXXFLAGS_OPTIM = "-O3 -march=native -mtune=native"
)Using the unblocked kprod() performance as baseline, here's some relative-performance data (note: I'm still computing more values and more samples for the larger sizes and will update this plot as things finish; at present there are 20 samples for each point in the 2-65 range as well as 127-129):

Benchmark take-aways
- So far it seems that there's not much benefit of the correlation-input functions over the covariance-input functions, especially with "fast" compilation options.
- The non-blocked
kprod_covperforms better thankprodat the smallest input sizes, but eventually seems to fall to equal performance with larger sample sizes. - The blocked
kprod_cov_blockedperforms worse than the non-blockedkprodat the smallest input sizes, but achieves equal performance by moderate input sizes and increases monotonically (if asymptotically) thereafter for large performance benefits at large input sizes. kprod_blockedhas a similar relative performance trajectory askprod_cov_blocked, but there still seems to be benefit of the latter, especially when using default model compilation options.- There are spikes of sudden performance changes (relative to
kprod) in many of the kprod-alternatives at the N=64 & N=128 values (note: values of N of 63,65,127 & 129 are also present in the graph, showing that the spikes occur at 64 & 128 specifically). I'm not really sure what to make of those nor why the blocked functions spike to higher relative performance at those values while the non-blocked functions spike to lower relative performance at those values.
Questions
I welcome any thoughts on this. Are any of these benchmarks even pertinent, or would implementation in Stan directly have performance implications that make this benchmarking of user-defined functions irrelevant? Also, given the minor performance bump seen here between kprod_blocked and kprod_cov_blocked, maybe it makes sense to start with simply using the existing-but-unsupported Eigen implementation that is akin to kprod_blocked?