Skip to content

Kronecker Products #1454

@mike-lawrence

Description

@mike-lawrence

Intro

I've been playing around with different approaches to efficiently computing Kronecker products, with particular interest in the cases where both input matrices are either covariance matrices or correlation matrices (common in multi-output Gaussian Processes). These special cases have symmetries that might be leveraged to achieve more efficient compute.

Implementations

Below are "user defined function" implementations of kprod() (arbitrary inputs), kprod_cov (covariance inputs), and kprod_corr (correlation inputs). After looking at the kronecker product functions currently in Eigen (in an unsupported section here), I also wrote *_blocked versions of each, where I gather the idea is that blocked computation achieves better memory locality.

Here's the collection of functions:

matrix kprod(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	for (i in 1:n_A) {
		for (j in 1:n_A) {
			for (p in 1:n_B) {
				for (q in 1:n_B) {
					int row = (i - 1) * n_B + p ;
					int col = (j - 1) * n_B + q ;
					Z[row, col] = A[i, j] * B[p, q] ;
				}
			}
		}
	}
	return(Z) ;
}
matrix kprod_cov(matrix A, matrix B) {
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// Loop over the upper triangular part of Z (includes diagonal)
	for (row in 1:n_Z) {
		int i = ((row - 1) %/% n_B) + 1 ;
		int p = ((row - 1) % n_B) + 1 ;
		for (col in row:n_Z) {
			int j = ((col - 1) %/% n_B) + 1 ;
			int q = ((col - 1) % n_B) + 1 ;
			// Compute the Kronecker product element
			Z[row, col] = A[i, j] * B[p, q] ;
			// Fill the symmetric counterpart
			// Note: this is redundant on the diagonal, but unsure it's worth a wholly separate loop to copy just the off-diagonal
			Z[col, row] = Z[row, col] ;
		}
	}
	return(Z) ;
}
matrix kprod_corr(matrix A, matrix B) {
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// Set diagonal elements to 1
	for (row in 1:n_Z) {
		Z[row, row] = 1.0 ;
	}
	// Loop over the upper triangular off-diagonal elements
	for (row in 1:(n_Z - 1)) {
		int i = ((row - 1) %/% n_B) + 1 ;
		int p = ((row - 1) % n_B) + 1 ;
		for (col in (row + 1):n_Z) {
			int j = ((col - 1) %/% n_B) + 1 ;
			int q = ((col - 1) % n_B) + 1 ;
			// Compute the product of corresponding elements
			Z[row, col] = A[i, j] * B[p, q] ;
			// Fill the symmetric counterpart
			Z[col, row] = Z[row, col] ;
		}
	}
	return(Z) ;
}
matrix kprod_blocked(matrix A, matrix B) {
	int r_A = rows(A) ;
	int c_A = cols(A) ;
	int r_B = rows(B) ;
	int c_B = cols(B) ;
	matrix[r_A * r_B, c_A * c_B] Z ;
	for (i in 1:r_A) {
		for (j in 1:c_A) {
			// Compute the starting row and column indices in C for the current block
			int row_start = (i - 1) * r_B + 1 ;
			int col_start = (j - 1) * c_B + 1 ;
			// Assign the block to the appropriate position in C
			Z[row_start:(row_start + r_B - 1), col_start:(col_start + c_B - 1)] = A[i, j] * B ;
		}
	}
	return Z ;
}
matrix kprod_cov_blocked(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// First loop: Handle diagonal blocks (i == j)
	for (i in 1:n_A) {
		int row_start = (i - 1) * n_B + 1 ;
		int col_start = (i - 1) * n_B + 1 ;
		Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = A[i, i] * B ;
	}
	// Second loop: Handle off-diagonal blocks (i < j)
	for (i in 1:(n_A - 1)) {
		for (j in (i + 1):n_A) {
			matrix[n_B, n_B] block = A[i, j] * B ;
			int row_i = (i - 1) * n_B + 1 ;
			int col_j = (j - 1) * n_B + 1 ;
			int row_j = (j - 1) * n_B + 1 ;
			int col_i = (i - 1) * n_B + 1 ;
			// Assign block to position (i, j)
			Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
			// Assign symmetric block to position (j, i)
			Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
		}
	}
	return Z ;
}
matrix kprod_corr_blocked(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// First loop: Handle diagonal blocks (i == j)
	for (i in 1:n_A) {
		int row_start = (i - 1) * n_B + 1 ;
		int col_start = (i - 1) * n_B + 1 ;
		Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = B ;
	}
	// Second loop: Handle off-diagonal blocks (i < j)
	for (i in 1:(n_A - 1)) {
		for (j in (i + 1):n_A) {
			matrix[n_B, n_B] block = A[i, j] * B ;
			int row_i = (i - 1) * n_B + 1 ;
			int col_j = (j - 1) * n_B + 1 ;
			int row_j = (j - 1) * n_B + 1 ;
			int col_i = (i - 1) * n_B + 1 ;
			// Assign block to position (i, j)
			Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
			// Assign symmetric block to position (j, i)
			Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
		}
	}
	return Z ;
}

Benchmarks

I've attempted to benchmark these implementations across a variety of input sizes and with both default model compilation options:

stanc_options = list()
cpp_options = list()

as well as "fast" model compilation options:

stanc_options = list('O1')
cpp_options = list(
	stan_threads=FALSE
	, STAN_CPP_OPTIMS=TRUE
	, STAN_NO_RANGE_CHECKS=TRUE
	, CXXFLAGS_OPTIM = "-O3 -march=native -mtune=native"
)

Using the unblocked kprod() performance as baseline, here's some relative-performance data (note: I'm still computing more values and more samples for the larger sizes and will update this plot as things finish; at present there are 20 samples for each point in the 2-65 range as well as 127-129):
kronecker

Benchmark take-aways

  • So far it seems that there's not much benefit of the correlation-input functions over the covariance-input functions, especially with "fast" compilation options.
  • The non-blocked kprod_cov performs better than kprod at the smallest input sizes, but eventually seems to fall to equal performance with larger sample sizes.
  • The blocked kprod_cov_blocked performs worse than the non-blocked kprod at the smallest input sizes, but achieves equal performance by moderate input sizes and increases monotonically (if asymptotically) thereafter for large performance benefits at large input sizes.
  • kprod_blocked has a similar relative performance trajectory as kprod_cov_blocked, but there still seems to be benefit of the latter, especially when using default model compilation options.
  • There are spikes of sudden performance changes (relative to kprod) in many of the kprod-alternatives at the N=64 & N=128 values (note: values of N of 63,65,127 & 129 are also present in the graph, showing that the spikes occur at 64 & 128 specifically). I'm not really sure what to make of those nor why the blocked functions spike to higher relative performance at those values while the non-blocked functions spike to lower relative performance at those values.

Questions

I welcome any thoughts on this. Are any of these benchmarks even pertinent, or would implementation in Stan directly have performance implications that make this benchmarking of user-defined functions irrelevant? Also, given the minor performance bump seen here between kprod_blocked and kprod_cov_blocked, maybe it makes sense to start with simply using the existing-but-unsupported Eigen implementation that is akin to kprod_blocked?

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions