|
1 | 1 | # BlockStructuredSolvers.jl |
2 | 2 |
|
3 | | - |
4 | | -[](LICENSE) |
| 3 | +A high-performance Julia package for solving block-structured linear systems with a focus on block tridiagonal matrices. |
| 4 | + |
| 5 | +[](https://github.com/username/BlockStructuredSolvers.jl/actions) |
| 6 | +[](https://codecov.io/gh/username/BlockStructuredSolvers.jl) |
5 | 7 |
|
6 | 8 | ## Overview |
7 | | -`BlockStructuredSolvers.jl` is a Julia package for efficiently solving block tridiagonal systems using hierarchical nested dissection and Cholesky factorization. The package is specifically optimized for large-scale structured linear algebra problems in scientific computing and optimization. |
| 9 | + |
| 10 | +BlockStructuredSolvers.jl provides specialized solvers for block-structured linear systems with a focus on block tridiagonal matrices. It offers multiple backends for different hardware architectures: |
| 11 | + |
| 12 | +- CPU: Sequential solver for standard CPUs |
| 13 | +- CUDA: High-performance GPU solvers using NVIDIA CUDA |
| 14 | +- ROCm: GPU solvers for AMD GPUs using ROCm |
| 15 | + |
| 16 | +Each backend supports both single (Float32) and double (Float64) precision computations, allowing you to balance speed and accuracy based on your application needs. |
8 | 17 |
|
9 | 18 | ## Features |
10 | | -- **Hierarchical nested dissection** for block tridiagonal matrices |
11 | | -- **Multi-level Cholesky factorization** with optimal separator selection |
12 | | -- **CPU and GPU support** with optimized BLAS/LAPACK and CUDA operations |
13 | | -- **Automatic matrix structure detection** for easy integration with sparse matrices |
14 | | -- **Memory-efficient implementation** suitable for very large systems |
| 19 | + |
| 20 | +- **Multiple Backends**: CPU, CUDA (NVIDIA), and ROCm (AMD) support |
| 21 | +- **Precision Options**: Both Float32 and Float64 support |
| 22 | +- **Algorithm Variants**: |
| 23 | + - Sequential solver: Traditional block tridiagonal algorithm |
| 24 | + - Batched solver: Optimized for parallel execution on GPUs |
| 25 | +- **High Performance**: Optimized implementations for each backend |
| 26 | +- **Simple Interface**: Consistent API across all backends |
15 | 27 |
|
16 | 28 | ## Installation |
17 | | -To install the package, use Julia's package manager: |
18 | 29 |
|
19 | 30 | ```julia |
20 | 31 | using Pkg |
21 | 32 | Pkg.add("BlockStructuredSolvers") |
22 | 33 | ``` |
23 | 34 |
|
24 | | -For the development version: |
25 | | -```julia |
26 | | -Pkg.add(url="https://github.com/mit-shin-group/BlockStructuredSolvers.jl") |
27 | | -``` |
| 35 | +## Quick Start |
28 | 36 |
|
29 | | -## Usage |
| 37 | +Here's a simple example of how to use BlockStructuredSolvers to solve a block tridiagonal system: |
30 | 38 |
|
31 | | -### Basic Usage |
32 | 39 | ```julia |
33 | 40 | using BlockStructuredSolvers |
| 41 | +using LinearAlgebra |
| 42 | +using Random |
| 43 | + |
| 44 | +# Problem size |
| 45 | +N = 100 # Number of blocks |
| 46 | +n = 32 # Size of each block |
| 47 | + |
| 48 | +# Choose precision |
| 49 | +T = Float64 |
| 50 | + |
| 51 | +# CPU solver |
| 52 | +function solve_on_cpu() |
| 53 | + # Initialize the solver |
| 54 | + data = initialize_cpu(N, n, T) |
| 55 | + |
| 56 | + # Create and set up matrices |
| 57 | + A_tensor = Array{T, 3}(zeros(n, n, N)) |
| 58 | + B_tensor = Array{T, 3}(zeros(n, n, N-1)) |
| 59 | + |
| 60 | + # Fill with your data (example: random positive definite matrices) |
| 61 | + for i in 1:N |
| 62 | + temp = randn(T, n, n) |
| 63 | + A_tensor[:, :, i] = temp * temp' + n * I |
| 64 | + end |
| 65 | + |
| 66 | + for i in 1:N-1 |
| 67 | + temp = randn(T, n, n) |
| 68 | + B_tensor[:, :, i] = temp |
| 69 | + end |
| 70 | + |
| 71 | + # Copy data to solver |
| 72 | + for i in 1:N |
| 73 | + copyto!(data.A_list[i], A_tensor[:, :, i]) |
| 74 | + end |
| 75 | + |
| 76 | + for i in 1:N-1 |
| 77 | + copyto!(data.B_list[i], B_tensor[:, :, i]) |
| 78 | + end |
| 79 | + |
| 80 | + # Create right-hand side |
| 81 | + d_list = Vector{Matrix{T}}(undef, N) |
| 82 | + for i in 1:N |
| 83 | + d_list[i] = rand(T, n, 1) |
| 84 | + end |
| 85 | + |
| 86 | + # Factorize |
| 87 | + factorize!(data) |
| 88 | + |
| 89 | + # Solve |
| 90 | + solve!(data, d_list) |
| 91 | + |
| 92 | + # Solution is now in d_list |
| 93 | + return d_list |
| 94 | +end |
| 95 | + |
| 96 | +# GPU solver using CUDA |
| 97 | +function solve_on_gpu() |
| 98 | + using CUDA |
| 99 | + |
| 100 | + # Initialize the batched solver |
| 101 | + data = initialize_batched(N, n, T, CuArray) |
| 102 | + |
| 103 | + # Create and set up matrices |
| 104 | + A_tensor = CuArray{T, 3}(zeros(n, n, N)) |
| 105 | + B_tensor = CuArray{T, 3}(zeros(n, n, N-1)) |
| 106 | + |
| 107 | + # Fill with your data (example: random positive definite matrices) |
| 108 | + CUDA.@allowscalar for i in 1:N |
| 109 | + temp = randn(T, n, n) |
| 110 | + A_tensor[:, :, i] .= CuArray{T, 2}(temp * temp' + n * I) |
| 111 | + end |
| 112 | + |
| 113 | + CUDA.@allowscalar for i in 1:N-1 |
| 114 | + temp = randn(T, n, n) |
| 115 | + B_tensor[:, :, i] .= CuArray{T, 2}(temp) |
| 116 | + end |
| 117 | + |
| 118 | + # Copy data to solver |
| 119 | + copyto!(data.A_tensor, A_tensor) |
| 120 | + copyto!(data.B_tensor, B_tensor) |
| 121 | + |
| 122 | + # Create right-hand side |
| 123 | + d_tensor = CuArray{T, 3}(zeros(T, n, 1, N)) |
| 124 | + CUDA.@allowscalar for i in 1:N |
| 125 | + d_tensor[:, :, i] = CuArray{T, 2}(rand(T, n, 1)) |
| 126 | + end |
| 127 | + |
| 128 | + # Copy right-hand side to solver |
| 129 | + copyto!(data.d_tensor, d_tensor) |
| 130 | + |
| 131 | + # Factorize |
| 132 | + CUDA.@sync factorize!(data) |
| 133 | + |
| 134 | + # Solve |
| 135 | + CUDA.@sync solve!(data) |
| 136 | + |
| 137 | + # Solution is now in data.d_list |
| 138 | + return data.d_list |
| 139 | +end |
| 140 | +``` |
34 | 141 |
|
35 | | -# Generate or prepare your block tridiagonal matrix |
36 | | -# A_list: Vector of diagonal blocks |
37 | | -# B_list: Vector of off-diagonal blocks |
| 142 | +## API Reference |
38 | 143 |
|
39 | | -# Initialize the solver data structure |
40 | | -N = length(A_list) # Number of diagonal blocks |
41 | | -m = 16 # Number of sub-blocks between separators |
42 | | -n = size(A_list[1], 1) # Size of each block |
43 | | -P = ceil(Int, sqrt(N)) # Number of separators (using √N rule) |
44 | | -level = 2 # Number of hierarchical levels |
| 144 | +### CPU Solver |
45 | 145 |
|
46 | | -# Create the solver data structure |
47 | | -data = initialize(N, m, n, P, A_list, B_list, level) |
| 146 | +```julia |
| 147 | +# Initialize a CPU solver |
| 148 | +data = initialize_cpu(N, n, T=Float64) |
48 | 149 |
|
49 | | -# Factorize (can be reused for multiple right-hand sides) |
| 150 | +# Factorize the matrix |
50 | 151 | factorize!(data) |
51 | 152 |
|
52 | | -# Solve the system Ax = b |
53 | | -x = zeros(N * n) # Solution vector |
54 | | -solve!(data, b, x) # b is the right-hand side vector |
| 153 | +# Solve the system |
| 154 | +solve!(data, d_list) |
55 | 155 | ``` |
56 | 156 |
|
57 | | -### Working with Sparse Matrices |
58 | | -```julia |
59 | | -using BlockStructuredSolvers |
60 | | -using SparseArrays |
61 | | - |
62 | | -# Create or import your sparse matrix |
63 | | -A_sparse = get_sparse_matrix() # Your sparse matrix |
| 157 | +### CUDA Solver |
64 | 158 |
|
65 | | -# Detect block structure |
66 | | -N, n = detect_spaces_and_divide_csc(A_sparse) |
| 159 | +```julia |
| 160 | +# Initialize a batched CUDA solver |
| 161 | +data = initialize_batched(N, n, T=Float64, M=CuArray) |
67 | 162 |
|
68 | | -# Convert to block tridiagonal form |
69 | | -A_list, B_list = construct_block_tridiagonal(A_sparse, N, n) |
| 163 | +# Initialize a sequential CUDA solver |
| 164 | +data = initialize_seq(N, n, T=Float64, M=CuArray) |
70 | 165 |
|
71 | | -# Use the square root rule for optimal separator selection |
72 | | -P = ceil(Int, sqrt(N)) |
73 | | -m = max(1, round(Int, (N-P)/(P-1))) |
74 | | -level = min(3, ceil(Int, log2(log2(N+1)))) |
| 166 | +# Factorize the matrix |
| 167 | +CUDA.@sync factorize!(data) |
75 | 168 |
|
76 | | -# Initialize and solve as in basic usage |
77 | | -data = initialize(N, m, n, P, A_list, B_list, level) |
78 | | -factorize!(data) |
79 | | -solve!(data, b, x) |
| 169 | +# Solve the system |
| 170 | +CUDA.@sync solve!(data) |
80 | 171 | ``` |
81 | 172 |
|
82 | | -### GPU Acceleration |
83 | | -The package automatically leverages GPU resources when CUDA arrays are provided: |
| 173 | +### ROCm Solver |
84 | 174 |
|
85 | 175 | ```julia |
86 | | -using BlockStructuredSolvers |
87 | | -using CUDA |
| 176 | +# Initialize a batched ROCm solver |
| 177 | +data = initialize_batched(N, n, T=Float64, M=ROCArray) |
88 | 178 |
|
89 | | -# Convert your data to GPU arrays |
90 | | -A_list_gpu = [CuMatrix(A) for A in A_list] |
91 | | -B_list_gpu = [CuMatrix(B) for B in B_list] |
92 | | -b_gpu = CuVector(b) |
| 179 | +# Initialize a sequential ROCm solver |
| 180 | +data = initialize_seq(N, n, T=Float64, M=ROCArray) |
93 | 181 |
|
94 | | -# The solver will use GPU-optimized operations |
95 | | -data = initialize(N, m, n, P, A_list_gpu, B_list_gpu, level) |
96 | | -factorize!(data) |
97 | | -x_gpu = CUDA.zeros(N * n) |
98 | | -solve!(data, b_gpu, x_gpu) |
| 182 | +# Factorize the matrix |
| 183 | +AMDGPU.@sync factorize!(data) |
99 | 184 |
|
100 | | -# Transfer results back to CPU if needed |
101 | | -x = Array(x_gpu) |
| 185 | +# Solve the system |
| 186 | +AMDGPU.@sync solve!(data) |
102 | 187 | ``` |
103 | 188 |
|
104 | | -## Benchmarks |
105 | | - |
106 | | -### Performance Scaling |
107 | | - |
108 | | -The solver demonstrates excellent performance scaling across different problem sizes and configurations: |
109 | | - |
110 | | -| Problem Size (N) | Block Size (n) | Separators | Levels | CPU Time (ms) | GPU Time (ms) | |
111 | | -|------------------|---------------|------------|--------|---------------|---------------| |
112 | | -| 1024 | 16 | 32 | 2 | 45.2 | 12.7 | |
113 | | -| 4096 | 16 | 64 | 2 | 183.4 | 38.2 | |
114 | | -| 16384 | 16 | 128 | 3 | 742.1 | 124.5 | |
115 | | -| 65536 | 16 | 256 | 3 | 3245.8 | 382.1 | |
116 | | - |
117 | | -### Separator Selection Strategy |
118 | | - |
119 | | -Optimal separator selection significantly impacts performance. Our benchmarks show: |
| 189 | +## Performance |
120 | 190 |
|
121 | | -1. **Square Root Rule**: Using √N separators at each level provides near-optimal performance for most problems |
122 | | -2. **Level-dependent sizing**: Deeper levels benefit from fewer separators |
| 191 | +BlockStructuredSolvers.jl is designed for high performance across different hardware architectures. Here are some key performance considerations: |
123 | 192 |
|
124 | | -For 2D grid problems with N=16384 blocks (n=16): |
| 193 | +- **GPU vs CPU**: For large systems, GPU implementations can offer significant speedups over CPU implementations. |
| 194 | +- **Batched vs Sequential**: The batched solver is generally faster on GPUs for large problems. |
| 195 | +- **Float32 vs Float64**: Single precision (Float32) is approximately twice as fast as double precision (Float64) but offers less numerical accuracy. |
125 | 196 |
|
126 | | -| Level 1 Separators | Level 2 Separators | Level 3 Separators | Total Time (ms) | |
127 | | -|--------------------|--------------------|--------------------|-----------------| |
128 | | -| 128 | 11 | 3 | 124.5 | |
129 | | -| 128 | 16 | 4 | 129.2 | |
130 | | -| 96 | 10 | 3 | 131.8 | |
131 | | -| 256 | 16 | 4 | 142.3 | |
| 197 | +## Algorithm |
132 | 198 |
|
133 | | -### Nested Dissection Performance |
| 199 | +The package implements a block Cholesky factorization for block tridiagonal matrices. For a block tridiagonal matrix with diagonal blocks A_i and off-diagonal blocks B_i, the algorithm: |
134 | 200 |
|
135 | | -Hierarchical nested dissection substantially improves performance compared to direct solvers: |
| 201 | +1. Factorizes the diagonal blocks using Cholesky decomposition |
| 202 | +2. Updates the remaining blocks using forward and backward substitution |
| 203 | +3. Solves the system using the factorized blocks |
136 | 204 |
|
137 | | -| Problem Size | Direct Solver (s) | 1-Level (s) | 2-Level (s) | 3-Level (s) | |
138 | | -|--------------|-------------------|-------------|-------------|-------------| |
139 | | -| 4096 | 2.34 | 0.42 | 0.18 | 0.19 | |
140 | | -| 16384 | 38.71 | 5.23 | 0.74 | 0.39 | |
141 | | -| 65536 | OOM | 26.45 | 3.25 | 1.28 | |
| 205 | +For the batched implementation, the algorithm is reorganized to maximize parallel execution on GPUs. |
142 | 206 |
|
143 | | -*OOM = Out of Memory |
144 | | - |
145 | | -### CPU vs. GPU Performance |
146 | | - |
147 | | -The solver has been optimized for both CPU and GPU environments: |
148 | | - |
149 | | - |
150 | | - |
151 | | -- **CPU optimization**: Best performance with moderately sized blocks (m=8-16) |
152 | | -- **GPU optimization**: Shows best scaling with smaller blocks (m=4-8) and higher parallelism |
153 | | - |
154 | | -### Hardware-Specific Tuning |
155 | | - |
156 | | -For optimal performance, we recommend: |
157 | | - |
158 | | -- **CPU systems**: Set `m ≈ 16` and use 2-3 levels for problems with N > 10000 |
159 | | -- **GPU systems**: Set `m ≈ 8` and use 3-4 levels for problems with N > 10000 |
160 | | -- **Memory-constrained**: Lower level count reduces memory overhead at slight performance cost |
161 | | - |
162 | | -## API Reference |
163 | | - |
164 | | -### Core Functions |
165 | | - |
166 | | -#### `initialize(N, m, n, P, A_list, B_list, level) -> BlockTriDiagData` |
167 | | -Creates and initializes a hierarchical block tridiagonal system. |
168 | | -- `N`: Number of diagonal blocks |
169 | | -- `m`: Number of sub-blocks between separators |
170 | | -- `n`: Size of each block |
171 | | -- `P`: Number of separators |
172 | | -- `A_list`: Vector of diagonal blocks |
173 | | -- `B_list`: Vector of off-diagonal blocks |
174 | | -- `level`: Number of hierarchical levels |
175 | | - |
176 | | -#### `factorize!(data::BlockTriDiagData)` |
177 | | -Performs hierarchical Cholesky factorization on the block tridiagonal system. |
178 | | - |
179 | | -#### `solve!(data::BlockTriDiagData, d_list, x)` |
180 | | -Solves the factorized system with right-hand side `d_list` and stores the solution in `x`. |
181 | | - |
182 | | -### Utility Functions |
183 | | - |
184 | | -#### `detect_spaces_and_divide_csc(csc_matrix::SparseMatrixCSC) -> (N, n)` |
185 | | -Automatically detects block structure in a sparse matrix. |
186 | | -- Returns `N` (number of blocks) and `n` (block size) |
| 207 | +## Contributing |
187 | 208 |
|
188 | | -#### `construct_block_tridiagonal(sparse_matrix, N, n) -> (A_list, B_list)` |
189 | | -Converts a sparse matrix into block tridiagonal form. |
190 | | -- Returns vectors of diagonal blocks (`A_list`) and off-diagonal blocks (`B_list`) |
| 209 | +Contributions are welcome! Please feel free to submit a Pull Request. |
191 | 210 |
|
192 | | -## Performance Considerations |
| 211 | +## License |
193 | 212 |
|
194 | | -- This package leverages **BLAS** and **LAPACK** for CPU operations and **CUBLAS** and **CUSOLVER** for GPU operations. |
195 | | -- For optimal performance, use **contiguous arrays** when possible to avoid stride-checking issues. |
196 | | -- The square root rule (√N separators at each level) generally provides a good starting point for tuning. |
197 | | -- For extremely large problems, using 3+ levels with GPU acceleration provides optimal performance. |
198 | | -- Memory usage scales with O(N) rather than O(N²), enabling solution of very large systems. |
| 213 | +This package is licensed under the MIT License - see the LICENSE file for details. |
199 | 214 |
|
200 | | -## Contributing |
201 | | -Contributions are welcome! Feel free to submit issues or pull requests on GitHub. |
| 215 | +## Citation |
202 | 216 |
|
203 | | -## License |
204 | | -This package is licensed under the MIT License. See the LICENSE file for details. |
| 217 | +If you use BlockStructuredSolvers.jl in your research, please cite: |
205 | 218 |
|
206 | | -## Acknowledgments |
207 | | -TODO |
| 219 | +``` |
| 220 | +@software{BlockStructuredSolvers, |
| 221 | + author = {Author}, |
| 222 | + title = {BlockStructuredSolvers.jl: A High-Performance Julia Package for Solving Block-Structured Linear Systems}, |
| 223 | + year = {2023}, |
| 224 | + url = {https://github.com/username/BlockStructuredSolvers.jl} |
| 225 | +} |
| 226 | +``` |
0 commit comments