Kronecker Products

### Intro
I've been playing around with different approaches to efficiently computing Kronecker products, with particular interest in the cases where both input matrices are either covariance matrices or correlation matrices (common in multi-output Gaussian Processes). These special cases have symmetries that might be leveraged to achieve more efficient compute. 

### Implementations
Below are "user defined function" implementations of `kprod()` (arbitrary inputs), `kprod_cov` (covariance inputs), and `kprod_corr` (correlation inputs). After looking at the kronecker product functions currently in Eigen (in an unsupported section [here](https://github.com/OPM/eigen3/blob/master/unsupported/Eigen/src/KroneckerProduct/KroneckerTensorProduct.h)), I also wrote `*_blocked` versions of each, where I gather the idea is that blocked computation achieves better memory locality. 

Here's the collection of functions:
```stan
matrix kprod(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	for (i in 1:n_A) {
		for (j in 1:n_A) {
			for (p in 1:n_B) {
				for (q in 1:n_B) {
					int row = (i - 1) * n_B + p ;
					int col = (j - 1) * n_B + q ;
					Z[row, col] = A[i, j] * B[p, q] ;
				}
			}
		}
	}
	return(Z) ;
}
matrix kprod_cov(matrix A, matrix B) {
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// Loop over the upper triangular part of Z (includes diagonal)
	for (row in 1:n_Z) {
		int i = ((row - 1) %/% n_B) + 1 ;
		int p = ((row - 1) % n_B) + 1 ;
		for (col in row:n_Z) {
			int j = ((col - 1) %/% n_B) + 1 ;
			int q = ((col - 1) % n_B) + 1 ;
			// Compute the Kronecker product element
			Z[row, col] = A[i, j] * B[p, q] ;
			// Fill the symmetric counterpart
			// Note: this is redundant on the diagonal, but unsure it's worth a wholly separate loop to copy just the off-diagonal
			Z[col, row] = Z[row, col] ;
		}
	}
	return(Z) ;
}
matrix kprod_corr(matrix A, matrix B) {
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// Set diagonal elements to 1
	for (row in 1:n_Z) {
		Z[row, row] = 1.0 ;
	}
	// Loop over the upper triangular off-diagonal elements
	for (row in 1:(n_Z - 1)) {
		int i = ((row - 1) %/% n_B) + 1 ;
		int p = ((row - 1) % n_B) + 1 ;
		for (col in (row + 1):n_Z) {
			int j = ((col - 1) %/% n_B) + 1 ;
			int q = ((col - 1) % n_B) + 1 ;
			// Compute the product of corresponding elements
			Z[row, col] = A[i, j] * B[p, q] ;
			// Fill the symmetric counterpart
			Z[col, row] = Z[row, col] ;
		}
	}
	return(Z) ;
}
matrix kprod_blocked(matrix A, matrix B) {
	int r_A = rows(A) ;
	int c_A = cols(A) ;
	int r_B = rows(B) ;
	int c_B = cols(B) ;
	matrix[r_A * r_B, c_A * c_B] Z ;
	for (i in 1:r_A) {
		for (j in 1:c_A) {
			// Compute the starting row and column indices in C for the current block
			int row_start = (i - 1) * r_B + 1 ;
			int col_start = (j - 1) * c_B + 1 ;
			// Assign the block to the appropriate position in C
			Z[row_start:(row_start + r_B - 1), col_start:(col_start + c_B - 1)] = A[i, j] * B ;
		}
	}
	return Z ;
}
matrix kprod_cov_blocked(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// First loop: Handle diagonal blocks (i == j)
	for (i in 1:n_A) {
		int row_start = (i - 1) * n_B + 1 ;
		int col_start = (i - 1) * n_B + 1 ;
		Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = A[i, i] * B ;
	}
	// Second loop: Handle off-diagonal blocks (i < j)
	for (i in 1:(n_A - 1)) {
		for (j in (i + 1):n_A) {
			matrix[n_B, n_B] block = A[i, j] * B ;
			int row_i = (i - 1) * n_B + 1 ;
			int col_j = (j - 1) * n_B + 1 ;
			int row_j = (j - 1) * n_B + 1 ;
			int col_i = (i - 1) * n_B + 1 ;
			// Assign block to position (i, j)
			Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
			// Assign symmetric block to position (j, i)
			Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
		}
	}
	return Z ;
}
matrix kprod_corr_blocked(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// First loop: Handle diagonal blocks (i == j)
	for (i in 1:n_A) {
		int row_start = (i - 1) * n_B + 1 ;
		int col_start = (i - 1) * n_B + 1 ;
		Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = B ;
	}
	// Second loop: Handle off-diagonal blocks (i < j)
	for (i in 1:(n_A - 1)) {
		for (j in (i + 1):n_A) {
			matrix[n_B, n_B] block = A[i, j] * B ;
			int row_i = (i - 1) * n_B + 1 ;
			int col_j = (j - 1) * n_B + 1 ;
			int row_j = (j - 1) * n_B + 1 ;
			int col_i = (i - 1) * n_B + 1 ;
			// Assign block to position (i, j)
			Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
			// Assign symmetric block to position (j, i)
			Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
		}
	}
	return Z ;
}
```
### Benchmarks
I've attempted to benchmark these implementations across a variety of input sizes and with both default model compilation options:
```r
stanc_options = list()
cpp_options = list()
```
 as well as "fast" model compilation options:
```r
stanc_options = list('O1')
cpp_options = list(
	stan_threads=FALSE
	, STAN_CPP_OPTIMS=TRUE
	, STAN_NO_RANGE_CHECKS=TRUE
	, CXXFLAGS_OPTIM = "-O3 -march=native -mtune=native"
)
```
Using the unblocked `kprod()` performance as baseline, here's some relative-performance data (note: I'm still computing more values and more samples for the larger sizes and will update this plot as things finish; at present there are 20 samples for each point in the 2-65 range as well as 127-129):
![kronecker](https://github.com/user-attachments/assets/4e4be202-dc1c-497f-8541-24a013510bff)



### Benchmark take-aways
- So far it seems that there's not much benefit of the correlation-input functions over the covariance-input functions, especially with "fast" compilation options. 
- The non-blocked `kprod_cov` performs better than `kprod` at the smallest input sizes, but eventually seems to fall to equal performance with larger sample sizes. 
- The blocked `kprod_cov_blocked` performs worse than the non-blocked `kprod` at the smallest input sizes, but achieves equal performance by moderate input sizes and increases monotonically (if asymptotically) thereafter for large performance benefits at large input sizes. 
- `kprod_blocked` has a similar relative performance trajectory as `kprod_cov_blocked`, but there still seems to be benefit of the latter, especially when using default model compilation options. 
- There are spikes of sudden performance changes (relative to `kprod`) in many of the kprod-alternatives at the N=64 & N=128 values (note: values of N of 63,65,127 & 129 are also present in the graph, showing that the spikes occur at 64 & 128 specifically). I'm not really sure what to make of those nor why the blocked functions spike to higher relative performance at those values while the non-blocked functions spike to lower relative performance at those values.

### Questions
I welcome any thoughts on this. Are any of these benchmarks even pertinent, or would implementation in Stan directly have performance implications that make this benchmarking of user-defined functions irrelevant? Also, given the minor performance bump seen here between `kprod_blocked` and `kprod_cov_blocked`, maybe it makes sense to start with simply using the existing-but-unsupported Eigen implementation that is akin to `kprod_blocked`?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Kronecker Products #1454

Intro

Implementations

Benchmarks

Benchmark take-aways

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Kronecker Products #1454

Description

Intro

Implementations

Benchmarks

Benchmark take-aways

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions