Skip to content

Commit 1413c6f

Browse files
committed
update readme
1 parent 43e745e commit 1413c6f

File tree

1 file changed

+174
-155
lines changed

1 file changed

+174
-155
lines changed

README.md

Lines changed: 174 additions & 155 deletions
Original file line numberDiff line numberDiff line change
@@ -1,207 +1,226 @@
11
# BlockStructuredSolvers.jl
22

3-
![CI](https://github.com/mit-shin-group/BlockStructuredSolvers/actions/workflows/ci.yml/badge.svg)
4-
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
3+
A high-performance Julia package for solving block-structured linear systems with a focus on block tridiagonal matrices.
4+
5+
[![Build Status](https://github.com/username/BlockStructuredSolvers.jl/workflows/CI/badge.svg)](https://github.com/username/BlockStructuredSolvers.jl/actions)
6+
[![Coverage](https://codecov.io/gh/username/BlockStructuredSolvers.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/username/BlockStructuredSolvers.jl)
57

68
## Overview
7-
`BlockStructuredSolvers.jl` is a Julia package for efficiently solving block tridiagonal systems using hierarchical nested dissection and Cholesky factorization. The package is specifically optimized for large-scale structured linear algebra problems in scientific computing and optimization.
9+
10+
BlockStructuredSolvers.jl provides specialized solvers for block-structured linear systems with a focus on block tridiagonal matrices. It offers multiple backends for different hardware architectures:
11+
12+
- CPU: Sequential solver for standard CPUs
13+
- CUDA: High-performance GPU solvers using NVIDIA CUDA
14+
- ROCm: GPU solvers for AMD GPUs using ROCm
15+
16+
Each backend supports both single (Float32) and double (Float64) precision computations, allowing you to balance speed and accuracy based on your application needs.
817

918
## Features
10-
- **Hierarchical nested dissection** for block tridiagonal matrices
11-
- **Multi-level Cholesky factorization** with optimal separator selection
12-
- **CPU and GPU support** with optimized BLAS/LAPACK and CUDA operations
13-
- **Automatic matrix structure detection** for easy integration with sparse matrices
14-
- **Memory-efficient implementation** suitable for very large systems
19+
20+
- **Multiple Backends**: CPU, CUDA (NVIDIA), and ROCm (AMD) support
21+
- **Precision Options**: Both Float32 and Float64 support
22+
- **Algorithm Variants**:
23+
- Sequential solver: Traditional block tridiagonal algorithm
24+
- Batched solver: Optimized for parallel execution on GPUs
25+
- **High Performance**: Optimized implementations for each backend
26+
- **Simple Interface**: Consistent API across all backends
1527

1628
## Installation
17-
To install the package, use Julia's package manager:
1829

1930
```julia
2031
using Pkg
2132
Pkg.add("BlockStructuredSolvers")
2233
```
2334

24-
For the development version:
25-
```julia
26-
Pkg.add(url="https://github.com/mit-shin-group/BlockStructuredSolvers.jl")
27-
```
35+
## Quick Start
2836

29-
## Usage
37+
Here's a simple example of how to use BlockStructuredSolvers to solve a block tridiagonal system:
3038

31-
### Basic Usage
3239
```julia
3340
using BlockStructuredSolvers
41+
using LinearAlgebra
42+
using Random
43+
44+
# Problem size
45+
N = 100 # Number of blocks
46+
n = 32 # Size of each block
47+
48+
# Choose precision
49+
T = Float64
50+
51+
# CPU solver
52+
function solve_on_cpu()
53+
# Initialize the solver
54+
data = initialize_cpu(N, n, T)
55+
56+
# Create and set up matrices
57+
A_tensor = Array{T, 3}(zeros(n, n, N))
58+
B_tensor = Array{T, 3}(zeros(n, n, N-1))
59+
60+
# Fill with your data (example: random positive definite matrices)
61+
for i in 1:N
62+
temp = randn(T, n, n)
63+
A_tensor[:, :, i] = temp * temp' + n * I
64+
end
65+
66+
for i in 1:N-1
67+
temp = randn(T, n, n)
68+
B_tensor[:, :, i] = temp
69+
end
70+
71+
# Copy data to solver
72+
for i in 1:N
73+
copyto!(data.A_list[i], A_tensor[:, :, i])
74+
end
75+
76+
for i in 1:N-1
77+
copyto!(data.B_list[i], B_tensor[:, :, i])
78+
end
79+
80+
# Create right-hand side
81+
d_list = Vector{Matrix{T}}(undef, N)
82+
for i in 1:N
83+
d_list[i] = rand(T, n, 1)
84+
end
85+
86+
# Factorize
87+
factorize!(data)
88+
89+
# Solve
90+
solve!(data, d_list)
91+
92+
# Solution is now in d_list
93+
return d_list
94+
end
95+
96+
# GPU solver using CUDA
97+
function solve_on_gpu()
98+
using CUDA
99+
100+
# Initialize the batched solver
101+
data = initialize_batched(N, n, T, CuArray)
102+
103+
# Create and set up matrices
104+
A_tensor = CuArray{T, 3}(zeros(n, n, N))
105+
B_tensor = CuArray{T, 3}(zeros(n, n, N-1))
106+
107+
# Fill with your data (example: random positive definite matrices)
108+
CUDA.@allowscalar for i in 1:N
109+
temp = randn(T, n, n)
110+
A_tensor[:, :, i] .= CuArray{T, 2}(temp * temp' + n * I)
111+
end
112+
113+
CUDA.@allowscalar for i in 1:N-1
114+
temp = randn(T, n, n)
115+
B_tensor[:, :, i] .= CuArray{T, 2}(temp)
116+
end
117+
118+
# Copy data to solver
119+
copyto!(data.A_tensor, A_tensor)
120+
copyto!(data.B_tensor, B_tensor)
121+
122+
# Create right-hand side
123+
d_tensor = CuArray{T, 3}(zeros(T, n, 1, N))
124+
CUDA.@allowscalar for i in 1:N
125+
d_tensor[:, :, i] = CuArray{T, 2}(rand(T, n, 1))
126+
end
127+
128+
# Copy right-hand side to solver
129+
copyto!(data.d_tensor, d_tensor)
130+
131+
# Factorize
132+
CUDA.@sync factorize!(data)
133+
134+
# Solve
135+
CUDA.@sync solve!(data)
136+
137+
# Solution is now in data.d_list
138+
return data.d_list
139+
end
140+
```
34141

35-
# Generate or prepare your block tridiagonal matrix
36-
# A_list: Vector of diagonal blocks
37-
# B_list: Vector of off-diagonal blocks
142+
## API Reference
38143

39-
# Initialize the solver data structure
40-
N = length(A_list) # Number of diagonal blocks
41-
m = 16 # Number of sub-blocks between separators
42-
n = size(A_list[1], 1) # Size of each block
43-
P = ceil(Int, sqrt(N)) # Number of separators (using √N rule)
44-
level = 2 # Number of hierarchical levels
144+
### CPU Solver
45145

46-
# Create the solver data structure
47-
data = initialize(N, m, n, P, A_list, B_list, level)
146+
```julia
147+
# Initialize a CPU solver
148+
data = initialize_cpu(N, n, T=Float64)
48149

49-
# Factorize (can be reused for multiple right-hand sides)
150+
# Factorize the matrix
50151
factorize!(data)
51152

52-
# Solve the system Ax = b
53-
x = zeros(N * n) # Solution vector
54-
solve!(data, b, x) # b is the right-hand side vector
153+
# Solve the system
154+
solve!(data, d_list)
55155
```
56156

57-
### Working with Sparse Matrices
58-
```julia
59-
using BlockStructuredSolvers
60-
using SparseArrays
61-
62-
# Create or import your sparse matrix
63-
A_sparse = get_sparse_matrix() # Your sparse matrix
157+
### CUDA Solver
64158

65-
# Detect block structure
66-
N, n = detect_spaces_and_divide_csc(A_sparse)
159+
```julia
160+
# Initialize a batched CUDA solver
161+
data = initialize_batched(N, n, T=Float64, M=CuArray)
67162

68-
# Convert to block tridiagonal form
69-
A_list, B_list = construct_block_tridiagonal(A_sparse, N, n)
163+
# Initialize a sequential CUDA solver
164+
data = initialize_seq(N, n, T=Float64, M=CuArray)
70165

71-
# Use the square root rule for optimal separator selection
72-
P = ceil(Int, sqrt(N))
73-
m = max(1, round(Int, (N-P)/(P-1)))
74-
level = min(3, ceil(Int, log2(log2(N+1))))
166+
# Factorize the matrix
167+
CUDA.@sync factorize!(data)
75168

76-
# Initialize and solve as in basic usage
77-
data = initialize(N, m, n, P, A_list, B_list, level)
78-
factorize!(data)
79-
solve!(data, b, x)
169+
# Solve the system
170+
CUDA.@sync solve!(data)
80171
```
81172

82-
### GPU Acceleration
83-
The package automatically leverages GPU resources when CUDA arrays are provided:
173+
### ROCm Solver
84174

85175
```julia
86-
using BlockStructuredSolvers
87-
using CUDA
176+
# Initialize a batched ROCm solver
177+
data = initialize_batched(N, n, T=Float64, M=ROCArray)
88178

89-
# Convert your data to GPU arrays
90-
A_list_gpu = [CuMatrix(A) for A in A_list]
91-
B_list_gpu = [CuMatrix(B) for B in B_list]
92-
b_gpu = CuVector(b)
179+
# Initialize a sequential ROCm solver
180+
data = initialize_seq(N, n, T=Float64, M=ROCArray)
93181

94-
# The solver will use GPU-optimized operations
95-
data = initialize(N, m, n, P, A_list_gpu, B_list_gpu, level)
96-
factorize!(data)
97-
x_gpu = CUDA.zeros(N * n)
98-
solve!(data, b_gpu, x_gpu)
182+
# Factorize the matrix
183+
AMDGPU.@sync factorize!(data)
99184

100-
# Transfer results back to CPU if needed
101-
x = Array(x_gpu)
185+
# Solve the system
186+
AMDGPU.@sync solve!(data)
102187
```
103188

104-
## Benchmarks
105-
106-
### Performance Scaling
107-
108-
The solver demonstrates excellent performance scaling across different problem sizes and configurations:
109-
110-
| Problem Size (N) | Block Size (n) | Separators | Levels | CPU Time (ms) | GPU Time (ms) |
111-
|------------------|---------------|------------|--------|---------------|---------------|
112-
| 1024 | 16 | 32 | 2 | 45.2 | 12.7 |
113-
| 4096 | 16 | 64 | 2 | 183.4 | 38.2 |
114-
| 16384 | 16 | 128 | 3 | 742.1 | 124.5 |
115-
| 65536 | 16 | 256 | 3 | 3245.8 | 382.1 |
116-
117-
### Separator Selection Strategy
118-
119-
Optimal separator selection significantly impacts performance. Our benchmarks show:
189+
## Performance
120190

121-
1. **Square Root Rule**: Using √N separators at each level provides near-optimal performance for most problems
122-
2. **Level-dependent sizing**: Deeper levels benefit from fewer separators
191+
BlockStructuredSolvers.jl is designed for high performance across different hardware architectures. Here are some key performance considerations:
123192

124-
For 2D grid problems with N=16384 blocks (n=16):
193+
- **GPU vs CPU**: For large systems, GPU implementations can offer significant speedups over CPU implementations.
194+
- **Batched vs Sequential**: The batched solver is generally faster on GPUs for large problems.
195+
- **Float32 vs Float64**: Single precision (Float32) is approximately twice as fast as double precision (Float64) but offers less numerical accuracy.
125196

126-
| Level 1 Separators | Level 2 Separators | Level 3 Separators | Total Time (ms) |
127-
|--------------------|--------------------|--------------------|-----------------|
128-
| 128 | 11 | 3 | 124.5 |
129-
| 128 | 16 | 4 | 129.2 |
130-
| 96 | 10 | 3 | 131.8 |
131-
| 256 | 16 | 4 | 142.3 |
197+
## Algorithm
132198

133-
### Nested Dissection Performance
199+
The package implements a block Cholesky factorization for block tridiagonal matrices. For a block tridiagonal matrix with diagonal blocks A_i and off-diagonal blocks B_i, the algorithm:
134200

135-
Hierarchical nested dissection substantially improves performance compared to direct solvers:
201+
1. Factorizes the diagonal blocks using Cholesky decomposition
202+
2. Updates the remaining blocks using forward and backward substitution
203+
3. Solves the system using the factorized blocks
136204

137-
| Problem Size | Direct Solver (s) | 1-Level (s) | 2-Level (s) | 3-Level (s) |
138-
|--------------|-------------------|-------------|-------------|-------------|
139-
| 4096 | 2.34 | 0.42 | 0.18 | 0.19 |
140-
| 16384 | 38.71 | 5.23 | 0.74 | 0.39 |
141-
| 65536 | OOM | 26.45 | 3.25 | 1.28 |
205+
For the batched implementation, the algorithm is reorganized to maximize parallel execution on GPUs.
142206

143-
*OOM = Out of Memory
144-
145-
### CPU vs. GPU Performance
146-
147-
The solver has been optimized for both CPU and GPU environments:
148-
149-
![CPU vs GPU Performance](https://example.com/performance_chart.png)
150-
151-
- **CPU optimization**: Best performance with moderately sized blocks (m=8-16)
152-
- **GPU optimization**: Shows best scaling with smaller blocks (m=4-8) and higher parallelism
153-
154-
### Hardware-Specific Tuning
155-
156-
For optimal performance, we recommend:
157-
158-
- **CPU systems**: Set `m ≈ 16` and use 2-3 levels for problems with N > 10000
159-
- **GPU systems**: Set `m ≈ 8` and use 3-4 levels for problems with N > 10000
160-
- **Memory-constrained**: Lower level count reduces memory overhead at slight performance cost
161-
162-
## API Reference
163-
164-
### Core Functions
165-
166-
#### `initialize(N, m, n, P, A_list, B_list, level) -> BlockTriDiagData`
167-
Creates and initializes a hierarchical block tridiagonal system.
168-
- `N`: Number of diagonal blocks
169-
- `m`: Number of sub-blocks between separators
170-
- `n`: Size of each block
171-
- `P`: Number of separators
172-
- `A_list`: Vector of diagonal blocks
173-
- `B_list`: Vector of off-diagonal blocks
174-
- `level`: Number of hierarchical levels
175-
176-
#### `factorize!(data::BlockTriDiagData)`
177-
Performs hierarchical Cholesky factorization on the block tridiagonal system.
178-
179-
#### `solve!(data::BlockTriDiagData, d_list, x)`
180-
Solves the factorized system with right-hand side `d_list` and stores the solution in `x`.
181-
182-
### Utility Functions
183-
184-
#### `detect_spaces_and_divide_csc(csc_matrix::SparseMatrixCSC) -> (N, n)`
185-
Automatically detects block structure in a sparse matrix.
186-
- Returns `N` (number of blocks) and `n` (block size)
207+
## Contributing
187208

188-
#### `construct_block_tridiagonal(sparse_matrix, N, n) -> (A_list, B_list)`
189-
Converts a sparse matrix into block tridiagonal form.
190-
- Returns vectors of diagonal blocks (`A_list`) and off-diagonal blocks (`B_list`)
209+
Contributions are welcome! Please feel free to submit a Pull Request.
191210

192-
## Performance Considerations
211+
## License
193212

194-
- This package leverages **BLAS** and **LAPACK** for CPU operations and **CUBLAS** and **CUSOLVER** for GPU operations.
195-
- For optimal performance, use **contiguous arrays** when possible to avoid stride-checking issues.
196-
- The square root rule (√N separators at each level) generally provides a good starting point for tuning.
197-
- For extremely large problems, using 3+ levels with GPU acceleration provides optimal performance.
198-
- Memory usage scales with O(N) rather than O(N²), enabling solution of very large systems.
213+
This package is licensed under the MIT License - see the LICENSE file for details.
199214

200-
## Contributing
201-
Contributions are welcome! Feel free to submit issues or pull requests on GitHub.
215+
## Citation
202216

203-
## License
204-
This package is licensed under the MIT License. See the LICENSE file for details.
217+
If you use BlockStructuredSolvers.jl in your research, please cite:
205218

206-
## Acknowledgments
207-
TODO
219+
```
220+
@software{BlockStructuredSolvers,
221+
author = {Author},
222+
title = {BlockStructuredSolvers.jl: A High-Performance Julia Package for Solving Block-Structured Linear Systems},
223+
year = {2023},
224+
url = {https://github.com/username/BlockStructuredSolvers.jl}
225+
}
226+
```

0 commit comments

Comments
 (0)