Optimize sparse condensed Hessian kernels on GPU #399

amontoison · 2025-01-10T16:43:49Z

julia> CUDA.@profile madnlp(exa2; tol=tol)

EXIT: Optimal Solution Found (tol = 1.0e-07).
Profiler ran for 18.03 s, capturing 1527254 events.

Host-side activity: calling CUDA APIs took 1.8 s (10.00% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                       │ Name                                                   │
├──────────┼────────────┼───────┼─────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│   34.35% │      6.2 s │  7942 │ 780.08 µs ± 6403.01 (  0.48 ‥ 63485.38) │ cuStreamSynchronize                                    │
│    3.84% │  691.74 ms │   541 │   1.28 ms ± 17.79  (  0.01 ‥ 361.13)    │ cudaMemcpyAsync                                        │
│    0.83% │   149.6 ms │     2 │   74.8 ms ± 105.52 (  0.19 ‥ 149.42)    │ cudaFree                                               │
│    0.39% │   70.94 ms │ 13602 │   5.22 µs ± 1.28   (  3.34 ‥ 42.44)     │ cuLaunchKernel                                         │
│    0.18% │   32.92 ms │  2383 │  13.82 µs ± 1.56   ( 11.68 ‥ 31.47)     │ cuMemcpyDtoHAsync                                      │
│    0.18% │   32.88 ms │  6874 │   4.78 µs ± 1.31   (  3.58 ‥ 62.47)     │ cudaLaunchKernel                                       │
│    0.18% │   32.67 ms │  6696 │   4.88 µs ± 24.32  (  1.19 ‥ 608.44)    │ cuMemAllocFromPoolAsync                                │
│    0.10% │   17.59 ms │    58 │ 303.23 µs ± 366.74 ( 226.5 ‥ 3020.52)   │ cuMemcpyHtoDAsync                                      │
│    0.05% │    8.34 ms │   727 │  11.48 µs ± 1.9    (  8.11 ‥ 35.76)     │ cuMemcpyDtoDAsync                                      │
│    0.03% │    5.12 ms │   945 │   5.42 µs ± 2.45   (  2.38 ‥ 42.68)     │ cudaMemsetAsync                                        │
│    0.02% │    3.58 ms │  1508 │   2.37 µs ± 0.58   (  1.43 ‥ 8.34)      │ cuMemFreeAsync                                         │
│    0.01% │    1.78 ms │     8 │ 221.94 µs ± 106.48 ( 27.18 ‥ 408.17)    │ cudaMalloc                                             │
│    0.00% │  267.03 µs │  1564 │ 170.73 ns ± 187.04 (   0.0 ‥ 2145.77)   │ cudaGetLastError                                       │
│    0.00% │  209.09 µs │  1454 │ 143.81 ns ± 144.0  (   0.0 ‥ 715.26)    │ cuCtxPushCurrent                                       │
│    0.00% │  188.35 µs │    74 │   2.55 µs ± 0.59   (  1.91 ‥ 6.68)      │ cudaStreamSynchronize                                  │
│    0.00% │  179.77 µs │  1454 │ 123.64 ns ± 136.6  (   0.0 ‥ 476.84)    │ cuCtxPopCurrent                                        │
│    0.00% │  142.81 µs │  1454 │  98.22 ns ± 130.94 (   0.0 ‥ 476.84)    │ cuCtxGetDevice                                         │
│    0.00% │  128.03 µs │  1462 │  87.57 ns ± 128.1  (   0.0 ‥ 476.84)    │ cuDeviceGet                                            │
│    0.00% │   70.81 µs │     1 │                                         │ cuMemGetInfo                                           │
│    0.00% │   28.37 µs │     3 │   9.46 µs ± 5.6    (  3.81 ‥ 15.02)     │ cuMemsetD32Async                                       │
│    0.00% │    9.06 µs │     6 │   1.51 µs ± 0.73   (  0.72 ‥ 2.62)      │ cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.00% │    1.91 µs │     1 │                                         │ cudaGetDevice                                          │
│    0.00% │    1.19 µs │     3 │ 397.36 ns ± 275.3  (238.42 ‥ 715.26)    │ cudaDeviceGetAttribute                                 │
│    0.00% │  715.26 ns │     2 │ 357.63 ns ± 168.59 (238.42 ‥ 476.84)    │ cuMemPoolGetAttribute                                  │
│    0.00% │  476.84 ns │     9 │  52.98 ns ± 105.13 (   0.0 ‥ 238.42)    │ cuDeviceGetCount                                       │
└──────────┴────────────┴───────┴─────────────────────────────────────────┴────────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 10.16 s (56.33% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                                                                                                                                                                                                                                                                                    ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   28.57% │     5.15 s │    68 │  75.76 ms ± 0.16   ( 75.43 ‥ 76.13)    │ gpu__transfer_jtsj_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, DynamicSize, DynamicSize, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float64, 1l, 1l>, Tuple<OneTo<Int64 ⋯
│   15.40% │     2.78 s │   111 │  25.01 ms ± 25.31  (  0.04 ‥ 50.71)    │ gpu__transfer_to_csc_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, DynamicSize, DynamicSize, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float64, 1l, 1l>, Tuple<OneTo<Int ⋯
│    1.89% │  340.67 ms │    68 │   5.01 ms ± 0.03   (  4.96 ‥ 5.18)     │ void cudss::factorize_ker<long, double, int, double, 32, 1, 0, 0, 1, 0, 0, 0, 1>(int, int, int, double*, double*, long const*, int const*, int const*, int*, int const*, long const*, long const*, long const*, int const*, int const*, int*, int const*, int, int, int, int, int, int*, int*, int*, do ⋯
│    1.02% │  184.76 ms │   728 │ 253.79 µs ± 157.24 ( 18.12 ‥ 380.52)   │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double, void>(cusparse::KernelCoeffs<double>, int const*, int const*, int const*, double const*, int, int, int, double const*, double*)                                                                                      ⋯
│    1.02% │  184.57 ms │     2 │  92.29 ms ± 129.26 (  0.89 ‥ 183.69)   │ void cudss::radix_sort_ker<int, int, 1, 20, 4, 0>(int, int const*, int*, int*, int*, int*, int, int)                                                                                                                                                                                                    ⋯
│    0.97% │  175.16 ms │    68 │   2.58 ms ± 0.0    (  2.56 ‥ 2.59)     │ void cudss::factorize_v3_ker<long, double, int, double, 256, 1, 0, 0, 1, 1, 0, 0, 4>(int, int, int, int, double*, double*, long const*, int const*, int const*, int*, int const*, long const*, long const*, long const*, int*, int const*, int*, int const*, int, int, int, int, int, int*, int*, int*, ⋯
│    0.82% │  148.62 ms │     1 │                                        │ void cudss::radix_sort_ker<long, int, 1, 20, 4, 1>(int, long const*, int*, int*, int*, int*, int, int)                                                                                                                                                                                                  ⋯
│    0.80% │  144.78 ms │   224 │ 646.33 µs ± 0.7    (644.68 ‥ 648.26)   │ void cudss::bwd_v2_ker<long, double, int, 32, 16, 1, 0, 0>(int const*, int const*, int, int, double*, double*, int const*, long const*, long const*, long const*, int const*, double*, long const*, int const*, int const*, long const*, int const*, int const*, int*, int, int, int, int const*, int,  ⋯
│    0.77% │  138.94 ms │     1 │                                        │ void cudss::map_ker<long, int, int, 128, 1, 2>(int, int const*, int const*, int const*, int const*, int const*, int const*, long const*, long const*, long const*, int const*, int const*, int const*, int*, int const*, int, int*, int*, long*, long*, int, int, int)                                  ⋯
│    0.72% │  130.68 ms │   224 │ 583.41 µs ± 2.69   (577.93 ‥ 592.47)   │ void cudss::bwd_ker<long, double, int, 128, 128, 16, 8, 8, 1, 0, 1, 0>(int const*, int const*, int, int, double*, double*, int const*, long const*, long const*, long const*, int const*, double*, long const*, int const*, int const*, long const*, int const*, int const*, int*, int, int, int, int c ⋯
│    0.68% │  122.04 ms │     1 │                                        │ void cudss::nnz_per_col_ker<int, int, 1, 0>(int, int const*, int const*, int const*, int const*, int*, int*, int*, int*, int, int, int)                                                                                                                                                                 ⋯
│    0.62% │  111.69 ms │   224 │ 498.62 µs ± 1.93   (493.29 ‥ 503.54)   │ void cudss::fwd_v2_ker<long, double, int, 32, 1, 0, 0, 32, 1>(int const*, int const*, int, int, double*, int const*, long const*, long const*, int const*, double*, long const*, int const*, int const*, long const*, int const*, int*, int, int, int, int const*, int, int, int, int, int)             ⋯
│    0.57% │  103.17 ms │     1 │                                        │ void cudss::trans_columns_ker<int, 2, 128>(int, int const*, int const*, int*, int*, int)                                                                                                                                                                                                                ⋯
│    0.20% │   35.68 ms │     1 │                                        │ void cudss::trans_nnz_per_row_ker<int, 2, 128>(int, int const*, int const*, int*, int)                                                                                                                                                                                                                  ⋯
│    0.15% │    26.7 ms │     2 │  13.35 ms ± 18.84  (  0.03 ‥ 26.67)    │ gpu__set_coo_to_csc_map_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, DynamicSize, DynamicSize, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Int64, 1l, 1l>, Tuple<OneTo<In ⋯
│    0.14% │   25.71 ms │   224 │  114.8 µs ± 2.2    (112.06 ‥ 121.36)   │ void cudss::fwd_v2_ker<long, double, int, 256, 1, 0, 0, 256, 1>(int const*, int const*, int, int, double*, int const*, long const*, long const*, int const*, double*, long const*, int const*, int const*, long const*, int const*, int*, int, int, int, int const*, int, int, int, int, int)           ⋯
│    0.12% │   22.15 ms │     1 │                                        │ void cudss::adjncy_ker<int, int, 128, 2>(int, int const*, int const*, int*, int*, int*, int*, int)                                                                                                                                                                                                      ⋯
│    0.09% │   15.88 ms │    61 │ 260.38 µs ± 347.4  (  1.43 ‥ 2901.55)  │ [copy pageable to device memory]                                                                                                                                                                                                                                                                        ⋯
│    0.08% │   14.51 ms │  1400 │  10.37 µs ± 1.97   (  5.96 ‥ 12.87)    │ void cusparse::csrmv_v3_partition_kernel<std::integral_constant<bool, false>, 256, int, int, double, double, double>(int const*, int, int, int, int*)                                                                                                                                                   ⋯
│    0.08% │   14.43 ms │     1 │                                        │ void cudss::map_offsets_ker<long, int, int, 128, 1>(int, int const*, int const*, int const*, int const*, int const*, int const*, long const*, long const*, int const*, int const*, int const*, int*, int*, int, int, int, int)                                                                          ⋯
│    0.07% │   13.36 ms │    23 │ 580.73 µs ± 72.43  (398.64 ‥ 680.45)   │ comparator_small_kernel(Tuple<CuDeviceArray<Tuple<Int64, Int64>, 1l, 1l>, CuDeviceArray<Int64, 1l, 1l>>, Int32, Tuple<Int64, Int64>, Tuple<Int64, Int64>, Tuple<Int64, Int64>, _45, isless, Val<false>, Tuple<CuDeviceArray<Tuple<Int64, Int64>, 1l, 1l>, CuDeviceArray<Int64, 1l, 1l>><1l>)            ⋯
│    0.07% │   13.18 ms │  1185 │  11.13 µs ± 7.12   (  5.25 ‥ 24.56)    │ [copy device to device memory]                                                                                                                                                                                                                                                                          ⋯
│    0.07% │   12.84 ms │   105 │ 122.26 µs ± 23.37  ( 58.41 ‥ 170.47)   │ comparator_kernel(Tuple<CuDeviceArray<Tuple<Int64, Int64>, 1l, 1l>, CuDeviceArray<Int64, 1l, 1l>>, Int32, Tuple<Int64, Int64>, Tuple<Int64, Int64>, _45, isless, Val<false>, Tuple<CuDeviceArray<Tuple<Int64, Int64>, 1l, 1l>, CuDeviceArray<Int64, 1l, 1l>><1l>)                                       ⋯
│    0.07% │   11.76 ms │   672 │  17.49 µs ± 3.3    (  12.4 ‥ 21.22)    │ void cusparse::csrmv_v3_kernel<std::integral_constant<bool, false>, int, int, double, double, double, double, void>(cusparse::KernelCoeffs<double>, int const*, int const*, int const*, double const*, int, int, int, double const*, double*, int*, double*)                                            ⋯
│    0.06% │   11.09 ms │    10 │   1.11 ms ± 3.2    (   0.0 ‥ 10.18)    │ void cudss::dependency_map_ker<int, int, 32>(int, int const*, int const*, int const*, int const*, int*, int*, int const*)                                                                                                                                                                               ⋯

The text was updated successfully, but these errors were encountered:

sshin23 · 2025-01-10T16:45:04Z

could you also post the problem script as well?

amontoison · 2025-01-10T16:51:48Z

Yes, it's the content of this file with N=50000.

amontoison · 2025-01-11T14:36:38Z

@jbcaillau, do you have a dense row in the Jacobian of the constraint in your goddard problem?
We solve KKT systems with J'J on GPU with MadNLP and I'm wondering if we just don't form a dense matrix.

jbcaillau · 2025-01-11T15:33:57Z

@amontoison what to think of the first two calls gpu__transfer... below, more consuming than CUDSS linear algebra?

julia> CUDA.@profile madnlp(exa2; tol=tol)

EXIT: Optimal Solution Found (tol = 1.0e-07).
Profiler ran for 18.03 s, capturing 1527254 events.
[...]

Device-side activity: GPU was busy for 10.16 s (56.33% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                                                                                                                                                                                                                                                                                    ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   28.57% │     5.15 s │    68 │  75.76 ms ± 0.16   ( 75.43 ‥ 76.13)    │ gpu__transfer_jtsj_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, DynamicSize, DynamicSize, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float64, 1l, 1l>, Tuple<OneTo<Int64 ⋯
│   15.40% │     2.78 s │   111 │  25.01 ms ± 25.31  (  0.04 ‥ 50.71)    │ gpu__transfer_to_csc_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, DynamicSize, DynamicSize, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float64, 1l, 1l>, Tuple<OneTo<Int ⋯
│    1.89% │  340.67 ms │    68 │   5.01 ms ± 0.03   (  4.96 ‥ 5.18)     │ void cudss::factorize_ker<long, double, int, double, 32, 1, 0, 0, 1, 0, 0, 0, 1>(int, int, int, double*, double*, long const*, int const*, int const*, int*, int const*, long const*, long const*, long const*, int const*, int const*, int*, int const*, int, int, int, int, int, int*, int*, int*, do ⋯
│    1.02% │  184.76 ms │   728 │ 253.79 µs ± 157.24 ( 18.12 ‥ 380.52)   │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double, void>(cusparse::KernelCoeffs<double>, int const*, int const*, int const*, double const*, int, int, int, double const*, double*)                                                                                      ⋯

jbcaillau · 2025-01-11T15:39:06Z

@amontoison as should be clear from the constraint structure in goddard-exa2.jl, each Jacobian row should be very sparse. a typical constraint involves, at line i, x[i], x[i+1], u[i], u[i+1].

I am also writing a pure ADNLPModels version on which it will be easy to check the sparsity structure.

@jbcaillau, do you have a dense row in the Jacobian of the constraint in your goddard problem? We solve KKT systems with J'J on GPU with MadNLP and I'm wondering if we just don't form a dense matrix.

sshin23 · 2025-01-11T16:00:28Z

It might be good to check (1) global scope variable is causing performance issue and (2) CUDA.@Profile itself is causing something. In my experience this type of problem the bottleneck is the symbolic factorization. I don't see anything in the problem that may cause performance bottleneck

amontoison · 2025-01-11T16:03:46Z

@amontoison what to think of the first two calls gpu__transfer... below, more consuming than CUDSS linear algebra?

julia> CUDA.@profile madnlp(exa2; tol=tol)

EXIT: Optimal Solution Found (tol = 1.0e-07).
Profiler ran for 18.03 s, capturing 1527254 events.
[...]

Device-side activity: GPU was busy for 10.16 s (56.33% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                                                                                                                                                                                                                                                                                    ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   28.57% │     5.15 s │    68 │  75.76 ms ± 0.16   ( 75.43 ‥ 76.13)    │ gpu__transfer_jtsj_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, DynamicSize, DynamicSize, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float64, 1l, 1l>, Tuple<OneTo<Int64 ⋯
│   15.40% │     2.78 s │   111 │  25.01 ms ± 25.31  (  0.04 ‥ 50.71)    │ gpu__transfer_to_csc_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, DynamicSize, DynamicSize, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float64, 1l, 1l>, Tuple<OneTo<Int ⋯
│    1.89% │  340.67 ms │    68 │   5.01 ms ± 0.03   (  4.96 ‥ 5.18)     │ void cudss::factorize_ker<long, double, int, double, 32, 1, 0, 0, 1, 0, 0, 0, 1>(int, int, int, double*, double*, long const*, int const*, int const*, int*, int const*, long const*, long const*, long const*, int const*, int const*, int*, int const*, int, int, int, int, int, int*, int*, int*, do ⋯
│    1.02% │  184.76 ms │   728 │ 253.79 µs ± 157.24 ( 18.12 ‥ 380.52)   │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double, void>(cusparse::KernelCoeffs<double>, int const*, int const*, int const*, double const*, int, int, int, double const*, double*)                                                                                      ⋯

It means that these two kernels in MadNLP.jl are not effficient and should be improved (if not related to global scope)
It's on my TODO list.

jbcaillau · 2025-01-11T23:16:35Z

@sshin23 @amontoison regarding point (1) below, I have changed variables from the global scope to constants in goddard-exa.jl so there should not be any issue on this side. below is the new run, still with N = 50_000.

It might be good to check (1) global scope variable is causing performance issue and (2) CUDA.@Profile itself is causing something. In my experience this type of problem the bottleneck is the symbolic factorization. I don't see anything in the problem that may cause performance bottleneck

julia> CUDA.@profile madnlp(exa2; tol=tol)
This is MadNLP version v0.8.5, running with cuDSS v0.4.0

Number of nonzeros in constraint Jacobian............:  1350017
Number of nonzeros in Lagrangian Hessian.............:  1300008

Total number of variables............................:   350008
                     variables with only lower bounds:        0
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:   300007
Total number of inequality constraints...............:   150004
        inequality constraints with only lower bounds:    50002
   inequality constraints with lower and upper bounds:   100002
        inequality constraints with only upper bounds:        0

[...]

Number of Iterations....: 55

                                   (scaled)                 (unscaled)
Objective...............:  -1.0264249724384082e+00   -1.0264249724384082e+00
Dual infeasibility......:   9.1620584206984594e-13    9.1620584206984594e-13
Constraint violation....:   8.1824980736335345e-10    8.1824980736335345e-10
Complementarity.........:   9.0909098505047791e-09    9.0909098505047791e-09
Overall NLP error.......:   9.0909098505047791e-09    9.0909098505047791e-09

Number of objective function evaluations             = 56
Number of objective gradient evaluations             = 56
Number of constraint evaluations                     = 56
Number of constraint Jacobian evaluations            = 56
Number of Lagrangian Hessian evaluations             = 55
Total wall-clock secs in solver (w/o fun. eval./lin. alg.)  = 11.129
Total wall-clock secs in linear solver                      =  1.444
Total wall-clock secs in NLP function evaluations           =  0.231
Total wall-clock secs                                       = 12.803

EXIT: Optimal Solution Found (tol = 1.0e-07).
Profiler ran for 12.8 s, capturing 1265706 events.

Host-side activity: calling CUDA APIs took 2.39 s (18.69% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                   │
├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────────────────────────────────────────┤
│   66.94% │     8.57 s │  7479 │   1.15 ms ± 8.37   (   0.0 ‥ 76.55)   │ cuStreamSynchronize                                    │
│    5.79% │  741.08 ms │   386 │   1.92 ms ± 22.82  (  0.01 ‥ 401.2)   │ cudaMemcpyAsync                                        │
│    1.14% │  146.36 ms │    14 │  10.45 ms ± 38.78  (  0.02 ‥ 145.19)  │ cudaFree                                               │
│    1.05% │  134.48 ms │ 12206 │  11.02 µs ± 15.52  (  2.86 ‥ 1585.01) │ cuLaunchKernel                                         │
│    0.53% │   68.31 ms │    63 │   1.08 ms ± 1.04   (  0.71 ‥ 8.08)    │ cuMemcpyHtoDAsync                                      │
│    0.43% │   54.86 ms │  4714 │  11.64 µs ± 5.37   (  3.58 ‥ 57.46)   │ cudaLaunchKernel                                       │
│    0.39% │   49.56 ms │  5930 │   8.36 µs ± 20.97  (  1.19 ‥ 1075.74) │ cuMemAllocFromPoolAsync                                │
│    0.31% │   39.88 ms │  2228 │   17.9 µs ± 6.56   ( 10.73 ‥ 70.1)    │ cuMemcpyDtoHAsync                                      │
│    0.24% │   30.21 ms │ 10263 │   2.94 µs ± 1.94   (  1.43 ‥ 41.96)   │ cuMemFreeAsync                                         │
│    0.13% │   16.78 ms │   572 │  29.33 µs ± 15.69  (  7.63 ‥ 93.94)   │ cuMemcpyDtoDAsync                                      │
│    0.10% │    12.9 ms │   784 │  16.46 µs ± 13.86  (  2.15 ‥ 76.77)   │ cudaMemsetAsync                                        │
│    0.01% │    1.49 ms │     8 │ 186.15 µs ± 132.94 ( 38.15 ‥ 478.51)  │ cudaMalloc                                             │
│    0.01% │  710.49 µs │    73 │   9.73 µs ± 1.78   (  1.91 ‥ 11.44)   │ cudaStreamSynchronize                                  │
│    0.00% │  535.01 µs │  1102 │ 485.49 ns ± 468.02 (   0.0 ‥ 2384.19) │ cudaGetLastError                                       │
│    0.00% │  430.11 µs │  1144 │ 375.97 ns ± 314.69 (   0.0 ‥ 1668.93) │ cuCtxPushCurrent                                       │
│    0.00% │  283.72 µs │  1144 │ 248.01 ns ± 178.15 (   0.0 ‥ 953.67)  │ cuCtxPopCurrent                                        │
│    0.00% │   243.9 µs │  1144 │  213.2 ns ± 169.28 (   0.0 ‥ 715.26)  │ cuDeviceGet                                            │
│    0.00% │  240.09 µs │  1144 │ 209.87 ns ± 182.07 (   0.0 ‥ 953.67)  │ cuCtxGetDevice                                         │
│    0.00% │    84.4 µs │     1 │                                       │ cuMemGetInfo                                           │
│    0.00% │   53.41 µs │     3 │   17.8 µs ± 20.76  (  4.53 ‥ 41.72)   │ cuMemsetD32Async                                       │
│    0.00% │   12.87 µs │     6 │   2.15 µs ± 0.99   (  0.72 ‥ 3.34)    │ cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.00% │    2.15 µs │     2 │   1.07 µs ± 0.84   (  0.48 ‥ 1.67)    │ cuMemPoolGetAttribute                                  │
│    0.00% │    1.91 µs │     3 │ 635.78 ns ± 688.26 (238.42 ‥ 1430.51) │ cudaDeviceGetAttribute                                 │
│    0.00% │    1.67 µs │     1 │                                       │ cudaGetDevice                                          │
└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 10.13 s (79.13% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                                                              ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────
│   40.23% │     5.15 s │    67 │  76.89 ms ± 0.2    ( 76.53 ‥ 77.38)    │ gpu__transfer_jtsj_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Cart ⋯
│   22.10% │     2.83 s │   111 │   25.5 ms ± 25.79  (  0.05 ‥ 51.91)    │ gpu__transfer_to_csc_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Ca ⋯
│    2.63% │  336.15 ms │    67 │   5.02 ms ± 0.03   (  4.96 ‥ 5.22)     │ void cudss::factorize_ker<long, double, int, double, 32, 1, 0, 0, 1, 0, 0, 0, 1>( ⋯
│    1.75% │  224.71 ms │     2 │ 112.35 ms ± 157.64 (  0.89 ‥ 223.82)   │ void cudss::radix_sort_ker<int, int, 1, 20, 4, 0>(int, int const*, int*, int*, in ⋯
│    1.35% │  173.45 ms │    67 │   2.59 ms ± 0.0    (  2.58 ‥ 2.6)      │ void cudss::factorize_v3_ker<long, double, int, double, 256, 1, 0, 0, 1, 1, 0, 0, ⋯
│    1.14% │  145.91 ms │     1 │                                        │ void cudss::radix_sort_ker<long, int, 1, 20, 4, 1>(int, long const*, int*, int*,  ⋯
│    1.05% │  134.73 ms │     1 │                                        │ void cudss::map_ker<long, int, int, 128, 1, 2>(int, int const*, int const*, int c ⋯
│    1.01% │  128.83 ms │   497 │ 259.22 µs ± 155.13 ( 20.27 ‥ 382.42)   │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double ⋯
│    0.96% │  123.07 ms │     1 │                                        │ void cudss::nnz_per_col_ker<int, int, 1, 0>(int, int const*, int const*, int cons ⋯
│    0.80% │  102.91 ms │     1 │                                        │ void cudss::trans_columns_ker<int, 2, 128>(int, int const*, int const*, int*, int ⋯
│    0.74% │   95.35 ms │   147 │ 648.65 µs ± 0.62   (647.31 ‥ 650.64)   │ void cudss::bwd_v2_ker<long, double, int, 32, 16, 1, 0, 0>(int const*, int const* ⋯
│    0.68% │   87.66 ms │   147 │ 596.31 µs ± 3.41   (589.85 ‥ 605.11)   │ void cudss::bwd_ker<long, double, int, 128, 128, 16, 8, 8, 1, 0, 1, 0>(int const* ⋯
│    0.57% │   72.64 ms │   147 │ 494.17 µs ± 1.89   (490.19 ‥ 499.25)   │ void cudss::fwd_v2_ker<long, double, int, 32, 1, 0, 0, 32, 1>(int const*, int con ⋯
│    0.44% │   56.56 ms │    66 │ 856.98 µs ± 1019.88 (  1.43 ‥ 7880.45) │ [copy pageable to device memory]                                                  ⋯
│    0.35% │    45.4 ms │     2 │   22.7 ms ± 32.05  (  0.04 ‥ 45.37)    │ gpu__set_coo_to_csc_map_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, ⋯
│    0.28% │   35.58 ms │     1 │                                        │ void cudss::trans_nnz_per_row_ker<int, 2, 128>(int, int const*, int const*, int*, ⋯
│    0.19% │   24.35 ms │    23 │   1.06 ms ± 0.15   (  0.67 ‥ 1.25)     │ comparator_small_kernel(Tuple<CuDeviceArray<Tuple<Int64, Int64>, 1, 1>, CuDeviceA ⋯
│    0.18% │    22.6 ms │   105 │ 215.25 µs ± 46.07  ( 95.61 ‥ 312.09)   │ comparator_kernel(Tuple<CuDeviceArray<Tuple<Int64, Int64>, 1, 1>, CuDeviceArray<I ⋯
│    0.17% │   22.08 ms │     1 │                                        │ void cudss::adjncy_ker<int, int, 128, 2>(int, int const*, int const*, int*, int*, ⋯
│    0.13% │   16.86 ms │   147 │ 114.69 µs ± 2.06   (112.53 ‥ 121.36)   │ void cudss::fwd_v2_ker<long, double, int, 256, 1, 0, 0, 256, 1>(int const*, int c ⋯
│    0.12% │    15.2 ms │     1 │                                        │ gpu__build_condensed_aug_symbolic_hess_kernel_(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.12% │   14.86 ms │     1 │                                        │ void cudss::map_offsets_ker<long, int, int, 128, 1>(int, int const*, int const*,  ⋯
│    0.10% │   12.96 ms │   876 │  14.79 µs ± 9.87   (   6.2 ‥ 40.05)    │ [copy device to device memory]                                                    ⋯
│    0.10% │   12.72 ms │  2307 │   5.51 µs ± 159.21 (  1.43 ‥ 7647.28)  │ [copy device to pageable memory]                                                  ⋯
│    0.09% │    11.9 ms │   110 │ 108.21 µs ± 1.25   (104.43 ‥ 110.86)   │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│    0.09% │   10.97 ms │    42 │ 261.12 µs ± 35.6   (178.34 ‥ 303.03)   │ comparator_small_kernel(CuDeviceArray<Tuple<Tuple<Int32, Int32>, Int64>, 1, 1>, I ⋯
│    0.09% │   10.97 ms │    10 │    1.1 ms ± 3.17   (   0.0 ‥ 10.08)    │ void cudss::dependency_map_ker<int, int, 32>(int, int const*, int const*, int con ⋯
│    0.08% │    9.92 ms │     1 │                                        │ void cudss::xadj_ker<int, int, 128, 2>(int, int const*, int const*, int*, int*, i ⋯
│    0.08% │    9.92 ms │   938 │  10.57 µs ± 1.84   (   6.2 ‥ 12.87)    │ void cusparse::csrmv_v3_partition_kernel<std::integral_constant<bool, false>, 256 ⋯
│    0.07% │    8.64 ms │    63 │ 137.16 µs ± 17.18  ( 96.56 ‥ 161.41)   │ comparator_small_kernel(CuDeviceArray<Tuple<Int64, Int64>, 1, 1>, Int32, Int32, I ⋯
│    0.07% │    8.43 ms │   441 │  19.11 µs ± 3.53   ( 13.83 ‥ 22.41)    │ void cusparse::csrmv_v3_kernel<std::integral_constant<bool, false>, int, int, dou ⋯
│    0.06% │    7.75 ms │     1 │                                        │ void cudss::fwd_bwd_order_ker<long, int, 256>(int, int, int*, int*, long const*,  ⋯
│    0.06% │    7.74 ms │   463 │  16.72 µs ± 3.96   (  6.68 ‥ 20.27)    │ partial_mapreduce_grid(norm, max, Float64, CartesianIndices<1, Tuple<OneTo<Int64> ⋯
│    0.06% │    7.45 ms │   551 │  13.52 µs ± 10.79  (  5.72 ‥ 31.71)    │ void axpy_kernel_val<double, double>(cublasAxpyParamsVal<double, double, double>) ⋯
│    0.06% │    7.24 ms │   132 │  54.86 µs ± 11.59  ( 25.75 ‥ 78.44)    │ comparator_kernel(CuDeviceArray<Tuple<Tuple<Int32, Int32>, Int64>, 1, 1>, Int32,  ⋯
│    0.05% │    5.89 ms │   198 │  29.75 µs ± 6.54   ( 14.07 ‥ 42.68)    │ comparator_kernel(CuDeviceArray<Tuple<Int64, Int64>, 1, 1>, Int32, Int32, Int32,  ⋯
│    0.04% │    5.69 ms │    10 │  568.7 µs ± 1743.12 (  1.91 ‥ 5529.17) │ void cudss::csc_rows_ker<long, int, int, 256>(int, int, int const*, int const*, l ⋯
│    0.04% │    5.63 ms │     2 │   2.82 ms ± 0.01   (  2.81 ‥ 2.82)     │ void offsets_par_ker<int, int, int, 128, 1>(int, int*, int*, int*, int*, int)     ⋯
│    0.04% │     5.1 ms │   294 │  17.33 µs ± 0.25   ( 16.69 ‥ 18.84)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.03% │    4.41 ms │    67 │  65.78 µs ± 0.31   ( 65.09 ‥ 66.76)    │ void cudss::copy_matrix_ker<long, double, int, 128>(int, int const*, long const*, ⋯
│    0.03% │    4.27 ms │   404 │  10.58 µs ± 1.68   (  8.58 ‥ 13.83)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.03% │    4.09 ms │   294 │   13.9 µs ± 2.44   (  9.54 ‥ 17.64)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.03% │    3.52 ms │   603 │   5.84 µs ± 2.56   (  1.43 ‥ 12.16)    │ gpu_fill_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndic ⋯
│    0.03% │    3.43 ms │   147 │  23.31 µs ± 0.22   ( 22.89 ‥ 23.84)    │ void cudss::diag_ker<long, double, int, 256>(int, int, double*, long, double cons ⋯
│    0.02% │    3.12 ms │     1 │                                        │ void cudss::nnz_per_col_ker<int, int, 1, 1>(int, int const*, int const*, int cons ⋯
│    0.02% │    3.12 ms │     1 │                                        │ void offsets_par_ker<int, int, int, 128, 2>(int, int*, int*, int*, int*, int)     ⋯
│    0.02% │    3.09 ms │     1 │                                        │ void offsets_par_ker<long, long, int, 128, 2>(long, long*, long*, int*, int*, int ⋯
│    0.02% │    3.07 ms │     1 │                                        │ void offsets_par_ker<long, long, long, 128, 2>(long, long*, long*, long*, int*, i ⋯
│    0.02% │    2.98 ms │   787 │   3.79 µs ± 7.07   (  0.95 ‥ 97.99)    │ [set device memory]                                                               ⋯
│    0.02% │    2.97 ms │   294 │  10.11 µs ± 0.29   (   9.3 ‥ 10.73)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.02% │     2.9 ms │   810 │   3.58 µs ± 0.25   (  2.86 ‥ 4.29)     │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<2, Tuple<OneTo<In ⋯
│    0.02% │    2.81 ms │     1 │                                        │ void offsets_par_ker<long, int, int, 128, 1>(long, long*, int*, int*, int*, int)  ⋯
│    0.02% │    2.76 ms │   147 │  18.76 µs ± 0.27   ( 18.12 ‥ 19.79)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.02% │     2.5 ms │    18 │ 138.65 µs ± 124.72 ( 16.21 ‥ 307.08)   │ partial_scan(add_sum, CuDeviceArray<Int64, 1, 1>, CuDeviceArray<Bool, 1, 1>, Cart ⋯
│    0.02% │    2.46 ms │   147 │  16.76 µs ± 0.27   ( 16.21 ‥ 17.88)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.02% │    2.27 ms │     1 │                                        │ void cudss::modify_update_ker<long, int, 128>(int, int, long const*, int*, int co ⋯
│    0.02% │    2.08 ms │     1 │                                        │ gpu__build_condensed_aug_symbolic_jt_kernel_(CompilerMetadata<DynamicSize, Dynami ⋯
│    0.02% │    2.08 ms │   226 │   9.19 µs ± 3.8    (   6.2 ‥ 16.21)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.02% │    2.07 ms │   134 │  15.47 µs ± 3.19   ( 11.68 ‥ 19.55)    │ gpu__transfer_hessian_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, C ⋯
│    0.02% │    2.05 ms │   337 │   6.09 µs ± 2.49   (  5.25 ‥ 29.33)    │ partial_mapreduce_grid(identity, _, Bool, CartesianIndices<1, Tuple<OneTo<Int64>> ⋯
│    0.02% │    1.98 ms │   224 │   8.85 µs ± 2.54   (  5.96 ‥ 11.92)    │ partial_mapreduce_grid(ComposedFunction<float, norm>, _, Float64, CartesianIndice ⋯
│    0.02% │    1.95 ms │   791 │   2.47 µs ± 0.58   (  1.91 ‥ 3.81)     │ void cusparse::vector_scalar_multiply_kernel<256, cusparse::AlignedVectorScalarMu ⋯
│    0.01% │    1.91 ms │   336 │    5.7 µs ± 4.91   (  3.58 ‥ 55.55)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.01% │    1.79 ms │   147 │  12.17 µs ± 0.8    ( 10.73 ‥ 13.59)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.01% │    1.69 ms │   148 │  11.41 µs ± 0.31   ( 10.97 ‥ 14.31)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.01% │    1.63 ms │   147 │  11.12 µs ± 0.19   ( 10.73 ‥ 11.68)    │ gpu__diag_operation_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.01% │    1.62 ms │   110 │  14.72 µs ± 0.94   ( 13.35 ‥ 15.97)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.01% │    1.61 ms │   110 │  14.63 µs ± 0.92   ( 13.35 ‥ 16.21)    │ partial_mapreduce_grid(identity, _, Float64, CartesianIndices<1, Tuple<OneTo<Int6 ⋯
│    0.01% │     1.6 ms │   499 │   3.21 µs ± 0.19   (  2.62 ‥ 3.81)     │ partial_mapreduce_grid(identity, _, Float64, CartesianIndices<2, Tuple<OneTo<Int6 ⋯
│    0.01% │    1.53 ms │    55 │   27.8 µs ± 0.18   ( 27.42 ‥ 28.13)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.01% │    1.52 ms │    55 │  27.71 µs ± 0.23   ( 27.18 ‥ 28.37)    │ partial_mapreduce_grid(identity, _, Float64, CartesianIndices<1, Tuple<OneTo<Int6 ⋯
│    0.01% │    1.45 ms │    56 │  25.94 µs ± 0.3    ( 25.51 ‥ 26.46)    │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│    0.01% │    1.43 ms │   118 │  12.09 µs ± 2.02   (  9.78 ‥ 15.02)    │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│    0.01% │     1.4 ms │   110 │  12.75 µs ± 1.2    ( 10.49 ‥ 14.54)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.01% │     1.3 ms │   147 │   8.82 µs ± 0.19   (  8.34 ‥ 9.3)      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.01% │    1.29 ms │   110 │  11.72 µs ± 0.45   ( 10.97 ‥ 12.64)    │ partial_mapreduce_grid(identity, _, Float64, CartesianIndices<1, Tuple<OneTo<Int6 ⋯
│    0.01% │    1.23 ms │   118 │  10.42 µs ± 1.46   (  8.58 ‥ 12.4)     │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│    0.01% │    1.22 ms │   110 │   11.1 µs ± 0.25   ( 10.49 ‥ 11.68)    │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.01% │     1.2 ms │   110 │  10.88 µs ± 0.65   (  9.78 ‥ 12.16)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.01% │    1.16 ms │   341 │    3.4 µs ± 0.56   (  2.86 ‥ 6.68)     │ partial_mapreduce_grid(identity, _, Bool, CartesianIndices<2, Tuple<OneTo<Int64>, ⋯
│    0.01% │    1.08 ms │   114 │   9.46 µs ± 0.27   (  8.82 ‥ 10.49)    │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.01% │    1.07 ms │   167 │   6.41 µs ± 0.86   (  5.48 ‥ 8.11)     │ partial_mapreduce_grid(_136<promote_<Float64>>, add_sum, Float64, CartesianIndice ⋯
│    0.01% │    1.04 ms │    55 │  18.89 µs ± 0.4    ( 18.36 ‥ 20.03)    │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.01% │    1.03 ms │   112 │   9.23 µs ± 1.01   (  7.87 ‥ 11.68)    │ gpu_linear_copy_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Cartesi ⋯
│    0.01% │    1.02 ms │     1 │                                        │ void cudss::radix_sort_ker<long, int, 1, 20, 4, 0>(int, long const*, int*, int*,  ⋯
│    0.01% │  945.57 µs │   147 │   6.43 µs ± 0.17   (  5.96 ‥ 6.91)     │ void cudss::perm_ker<double, int, int, 128, 1>(int, double*, double*, int*)       ⋯
│    0.01% │  838.99 µs │   147 │   5.71 µs ± 0.16   (  5.48 ‥ 5.96)     │ void cudss::perm_ker<double, int, int, 128, 0>(int, double*, double*, int*)       ⋯
│    0.01% │  803.47 µs │     1 │                                        │ void cudss::compute_hybrid_minimum_chunk_size_ker<long, double, int, 128, 1>(int, ⋯
│    0.01% │  783.21 µs │    55 │  14.24 µs ± 0.61   ( 12.87 ‥ 15.97)    │ partial_mapreduce_grid(identity, min, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│    0.01% │  761.51 µs │   220 │   3.46 µs ± 0.17   (  2.86 ‥ 4.05)     │ partial_mapreduce_grid(identity, min, Float64, CartesianIndices<2, Tuple<OneTo<In ⋯
│    0.01% │  760.79 µs │    57 │  13.35 µs ± 0.26   ( 12.87 ‥ 13.83)    │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.01% │  733.61 µs │    55 │  13.34 µs ± 0.75   ( 11.68 ‥ 15.02)    │ partial_mapreduce_grid(identity, min, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│    0.01% │  655.65 µs │   111 │   5.91 µs ± 1.45   (  4.29 ‥ 7.87)     │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │  625.37 µs │    67 │   9.33 µs ± 1.61   (  5.72 ‥ 10.49)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │  615.36 µs │    55 │  11.19 µs ± 1.5    (  8.82 ‥ 15.02)    │ partial_mapreduce_grid(identity, min, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│    0.00% │  606.54 µs │    55 │  11.03 µs ± 0.26   ( 10.49 ‥ 11.68)    │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│    0.00% │  598.91 µs │    55 │  10.89 µs ± 1.11   (  9.78 ‥ 14.31)    │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│    0.00% │  582.93 µs │    31 │   18.8 µs ± 23.52  (  1.91 ‥ 65.09)    │ aggregate_partial_scan(add_sum, CuDeviceArray<Int64, 1, 1>, CuDeviceArray<Int64,  ⋯
│    0.00% │  579.83 µs │   111 │   5.22 µs ± 0.34   (  4.77 ‥ 5.96)     │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│    0.00% │  557.18 µs │    55 │  10.13 µs ± 0.19   (  9.54 ‥ 10.49)    │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.00% │  556.71 µs │   114 │   4.88 µs ± 0.22   (  4.53 ‥ 6.2)      │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  544.07 µs │    55 │   9.89 µs ± 1.29   (  8.58 ‥ 13.11)    │ partial_mapreduce_grid(identity, min, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│    0.00% │  538.11 µs │   167 │   3.22 µs ± 0.15   (  2.86 ‥ 3.58)     │ partial_mapreduce_grid(identity, add_sum, Float64, CartesianIndices<2, Tuple<OneT ⋯
│    0.00% │  518.32 µs │    55 │   9.42 µs ± 0.27   (  8.82 ‥ 10.01)    │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│    0.00% │  505.45 µs │    57 │   8.87 µs ± 0.16   (  8.58 ‥ 9.06)     │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  473.02 µs │    15 │  31.53 µs ± 28.49  (  4.05 ‥ 78.2)     │ findall                                                                           ⋯
│    0.00% │  411.51 µs │    80 │   5.14 µs ± 2.05   (  3.58 ‥ 8.58)     │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   400.3 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │  380.99 µs │    56 │    6.8 µs ± 0.21   (  6.44 ‥ 7.87)     │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │  372.65 µs │     1 │                                        │ void cudss::define_superpanel_ker<long, int, 256>(int, int, int, long const*, int ⋯
│    0.00% │  363.83 µs │    57 │   6.38 µs ± 0.42   (  5.72 ‥ 7.63)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  337.12 µs │   147 │   2.29 µs ± 0.16   (  1.91 ‥ 2.62)     │ void cusparse::vector_scalar_multiply_kernel<256, cusparse::UnalignedVectorScalar ⋯
│    0.00% │  332.83 µs │     1 │                                        │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│    0.00% │  288.72 µs │    64 │   4.51 µs ± 0.46   (  4.05 ‥ 6.91)     │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │  282.53 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │  263.93 µs │    57 │   4.63 µs ± 0.16   (  4.29 ‥ 5.01)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  262.26 µs │     1 │                                        │ void cudss::count_dep_fwd_bwd_ker<long, int, 256>(int, int, int*, int*, long cons ⋯
│    0.00% │  241.04 µs │   117 │   2.06 µs ± 2.76   (  1.19 ‥ 26.7)     │ gpu_fill_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndic ⋯
│    0.00% │  232.22 µs │    57 │   4.07 µs ± 0.16   (  3.81 ‥ 4.53)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  230.07 µs │    67 │   3.43 µs ± 0.12   (   3.1 ‥ 3.58)     │ void cudss::independent_ker<long, double, int, double, 64, 1, 0, 0>(int, int, dou ⋯
│    0.00% │  215.77 µs │    57 │   3.79 µs ± 0.13   (  3.58 ‥ 4.29)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  215.05 µs │    57 │   3.77 µs ± 0.14   (  3.58 ‥ 4.05)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  214.34 µs │    57 │   3.76 µs ± 0.16   (  3.58 ‥ 4.29)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  211.72 µs │    57 │   3.71 µs ± 0.14   (  3.58 ‥ 4.05)     │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  206.71 µs │    57 │   3.63 µs ± 0.11   (  3.58 ‥ 4.05)     │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  191.45 µs │    57 │   3.36 µs ± 0.1    (   3.1 ‥ 3.58)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  190.02 µs │     1 │                                        │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│    0.00% │  182.87 µs │     1 │                                        │ void cudss::updates_ker<long, int, 128>(int, int const*, int const*, long*, int c ⋯
│    0.00% │  167.85 µs │     1 │                                        │ void cudss::updates_offsets_ker<long, int, 128>(int, int const*, long const*, int ⋯
│    0.00% │  162.12 µs │    56 │    2.9 µs ± 0.14   (  2.62 ‥ 3.1)      │ partial_mapreduce_grid(identity, add_sum, Float64, CartesianIndices<1, Tuple<OneT ⋯
│    0.00% │  160.22 µs │    13 │  12.32 µs ± 8.06   (  6.44 ‥ 37.67)    │ partial_scan(add_sum, CuDeviceArray<Int64, 1, 1>, CuDeviceArray<Int64, 1, 1>, Car ⋯
│    0.00% │  148.53 µs │    57 │   2.61 µs ± 0.16   (  2.38 ‥ 2.86)     │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  148.06 µs │    57 │    2.6 µs ± 0.15   (  2.38 ‥ 2.86)     │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  146.63 µs │    57 │   2.57 µs ± 0.14   (  2.38 ‥ 2.86)     │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  143.05 µs │    57 │   2.51 µs ± 0.14   (  2.15 ‥ 2.62)     │ gpu_compress_to_dense(CompilerMetadata<DynamicSize, DynamicCheck, void, Cartesian ⋯
│    0.00% │   135.9 µs │    57 │   2.38 µs ± 0.19   (  1.91 ‥ 2.62)     │ gpu_kerg(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  133.04 µs │    31 │   4.29 µs ± 1.69   (  1.91 ‥ 7.15)     │ gpu_linear_copy_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Cartesi ⋯
│    0.00% │  124.22 µs │     1 │                                        │ void cudss::supernode_map_ker<int, 1>(int, int const*, int const*, int*, int, int ⋯
│    0.00% │  122.55 µs │    57 │   2.15 µs ± 0.18   (  1.91 ‥ 2.38)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  115.63 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   111.1 µs │    56 │   1.98 µs ± 0.18   (  1.67 ‥ 2.38)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  103.47 µs │    57 │   1.82 µs ± 0.16   (  1.43 ‥ 2.15)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  103.47 µs │    57 │   1.82 µs ± 0.16   (  1.67 ‥ 2.15)     │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  102.04 µs │    57 │   1.79 µs ± 0.16   (  1.43 ‥ 2.15)     │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  101.33 µs │    57 │   1.78 µs ± 0.18   (  1.43 ‥ 2.15)     │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  101.09 µs │    19 │   5.32 µs ± 2.87   (  1.91 ‥ 11.44)    │ scan                                                                              ⋯
│    0.00% │  100.61 µs │    57 │   1.77 µs ± 0.13   (  1.67 ‥ 2.15)     │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │  100.14 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   95.61 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │    87.5 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   87.26 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   87.02 µs │    31 │   2.81 µs ± 0.84   (  1.19 ‥ 4.29)     │ gpu_fill_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndic ⋯
│    0.00% │   82.73 µs │     4 │  20.68 µs ± 0.53   ( 20.03 ‥ 21.22)    │ gpu_linear_copy_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Cartesi ⋯
│    0.00% │   82.49 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   77.25 µs │     2 │  38.62 µs ± 2.7    ( 36.72 ‥ 40.53)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   77.25 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   76.77 µs │     2 │  38.39 µs ± 8.09   ( 32.66 ‥ 44.11)    │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│    0.00% │   75.82 µs │     1 │                                        │ void cudss::supernode_map_offsets_ker<int, 1>(int, int const*, int const*, int*,  ⋯
│    0.00% │    75.1 µs │    55 │   1.37 µs ± 0.16   (  1.19 ‥ 1.67)     │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.00% │   73.19 µs │    55 │   1.33 µs ± 0.16   (  1.19 ‥ 1.67)     │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.00% │   72.48 µs │    55 │   1.32 µs ± 0.15   (  1.19 ‥ 1.67)     │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.00% │   72.24 µs │    55 │   1.31 µs ± 0.14   (  0.95 ‥ 1.67)     │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.00% │    72.0 µs │    55 │   1.31 µs ± 0.15   (  1.19 ‥ 1.67)     │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.00% │   70.81 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   66.52 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   63.42 µs │     2 │  31.71 µs ± 3.03   ( 29.56 ‥ 33.86)    │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │   60.32 µs │    55 │    1.1 µs ± 0.17   (  0.95 ‥ 1.43)     │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.00% │   60.08 µs │    55 │   1.09 µs ± 0.16   (  0.95 ‥ 1.43)     │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    59.6 µs │    55 │   1.08 µs ± 0.16   (  0.95 ‥ 1.43)     │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.00% │    59.6 µs │    55 │   1.08 µs ± 0.17   (  0.95 ‥ 1.43)     │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│    0.00% │   52.45 µs │     1 │                                        │ gpu__set_con_scale_sparse_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, voi ⋯
│    0.00% │   50.07 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │   50.07 µs │     1 │                                        │ gpu__set_colptr_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Cartesi ⋯
│    0.00% │   48.64 µs │     2 │  24.32 µs ± 0.67   ( 23.84 ‥ 24.8)     │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    48.4 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   47.92 µs │     2 │  23.96 µs ± 0.17   ( 23.84 ‥ 24.08)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   43.63 µs │     2 │  21.82 µs ± 7.92   ( 16.21 ‥ 27.42)    │ gpu__set_coo_to_colptr_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void,  ⋯
│    0.00% │   40.53 µs │    12 │   3.38 µs ± 0.09   (  3.34 ‥ 3.58)     │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   39.34 µs │     2 │  19.67 µs ± 0.17   ( 19.55 ‥ 19.79)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   38.86 µs │     1 │                                        │ void cudss::blocks_ker<long, int, 128>(int, int const*, long const*, int const*,  ⋯
│    0.00% │   38.15 µs │     5 │   7.63 µs ± 2.68   (  4.77 ‥ 10.97)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   37.67 µs │     4 │   9.42 µs ± 2.64   (   6.2 ‥ 12.64)    │ partial_mapreduce_grid(is_valid, _, Bool, CartesianIndices<1, Tuple<OneTo<Int64>> ⋯
│    0.00% │   35.05 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   33.86 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   31.95 µs │     1 │                                        │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │   29.33 µs │     2 │  14.66 µs ± 3.88   ( 11.92 ‥ 17.4)     │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   29.09 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │   28.13 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   27.18 µs │     1 │                                        │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│    0.00% │   26.94 µs │     3 │   8.98 µs ± 0.28   (  8.82 ‥ 9.3)      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │    26.7 µs │     2 │  13.35 µs ± 0.0    ( 13.35 ‥ 13.35)    │ gpu_setindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│    0.00% │   24.08 µs │     2 │  12.04 µs ± 0.84   ( 11.44 ‥ 12.64)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   23.84 µs │     1 │                                        │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    21.7 µs │     4 │   5.42 µs ± 0.23   (  5.25 ‥ 5.72)     │ partial_mapreduce_grid(identity, add_sum, Int64, CartesianIndices<1, Tuple<OneTo< ⋯
│    0.00% │   20.74 µs │     2 │  10.37 µs ± 0.17   ( 10.25 ‥ 10.49)    │ gpu_setindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│    0.00% │   18.84 µs │     3 │   6.28 µs ± 0.14   (   6.2 ‥ 6.44)     │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│    0.00% │   18.12 µs │     1 │                                        │ void cudss::offsets_ker<int, 1>(int, int*)                                        ⋯
│    0.00% │   17.88 µs │     2 │   8.94 µs ± 0.17   (  8.82 ‥ 9.06)     │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │    15.5 µs │     3 │   5.17 µs ± 0.96   (  4.29 ‥ 6.2)      │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│    0.00% │   15.02 µs │     2 │   7.51 µs ± 0.51   (  7.15 ‥ 7.87)     │ gpu_fill_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndic ⋯
│    0.00% │   14.54 µs │     3 │   4.85 µs ± 0.77   (  4.29 ‥ 5.72)     │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│    0.00% │   14.31 µs │     2 │   7.15 µs ± 1.01   (  6.44 ‥ 7.87)     │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│    0.00% │   14.07 µs │     1 │                                        │ gpu__force_lower_triangular_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, v ⋯
│    0.00% │   12.87 µs │     2 │   6.44 µs ± 0.34   (   6.2 ‥ 6.68)     │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   12.64 µs │     3 │   4.21 µs ± 0.28   (  4.05 ‥ 4.53)     │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│    0.00% │    12.4 µs │     1 │                                        │ partial_mapreduce_grid(identity, _, Int64, CartesianIndices<1, Tuple<OneTo<Int64> ⋯
│    0.00% │   12.16 µs │     4 │   3.04 µs ± 0.23   (  2.86 ‥ 3.34)     │ partial_mapreduce_grid(identity, add_sum, Int64, CartesianIndices<2, Tuple<OneTo< ⋯
│    0.00% │   11.44 µs │     1 │                                        │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│    0.00% │   10.97 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   10.73 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   10.25 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │   10.25 µs │     1 │                                        │ void cudss::nnz_count_ker<long, int, 128>(int, int const*, int const*, long*, lon ⋯
│    0.00% │    8.34 µs │     1 │                                        │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    7.87 µs │     1 │                                        │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    6.91 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │    6.44 µs │     1 │                                        │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │     6.2 µs │     1 │                                        │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │     6.2 µs │     1 │                                        │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │     6.2 µs │     1 │                                        │ partial_mapreduce_grid(identity, _, Int64, CartesianIndices<2, Tuple<OneTo<Int64> ⋯
│    0.00% │     6.2 µs │     1 │                                        │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│    0.00% │    5.48 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │    5.48 µs │     1 │                                        │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│    0.00% │    5.01 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │    4.29 µs │     1 │                                        │ void cudss::supernode_dependant_ker<int, 128>(int, int*, int*, int*, int*)        ⋯
│    0.00% │    4.29 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │    4.05 µs │     1 │                                        │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    3.58 µs │     1 │                                        │ void cudss::set_default_ker<int, 128>(int, int*)                                  ⋯
│    0.00% │    3.58 µs │     1 │                                        │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│    0.00% │     3.1 µs │     1 │                                        │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    2.86 µs │     1 │                                        │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    2.86 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    2.62 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    2.62 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    2.62 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    2.38 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    2.38 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    2.38 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    2.38 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│    0.00% │    1.91 µs │     1 │                                        │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
└──────────┴────────────┴───────┴────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────

sshin23 · 2025-01-11T23:40:02Z

Now I think I see where the bottleneck is coming from. Variable tf appears in a large number of constrains. The kernels that are currently bottlenecked performs compression from coo to csc. For each csc entry we run serial for loop. So far we didn't have problem where a large number of uncompressed coo entries are mapped to a csc entry.

jbcaillau · 2025-01-12T19:01:50Z

@sshin23 good point. this is typical of problem with free final time (tf). with the effect that, contrary to what I answered to @amontoison, the column of derivatives wrt. to tf contains a long vector colinear to ones. a possibility is to recast the problem as one with fixed final time, up to the price of adding the trivial ODE $t_f' = 0$ and as many variables as tf[i], i = 1 to N + 1.

amontoison · 2025-01-15T15:33:17Z

@jbcaillau @sshin23 @frapac
I tested the Goddard problem with our new solver MadNCL and I divided the elapsed time by two if we solve KKT systems with the formulation K2r (augmented system) instead of K1s (normal equation).

julia> CUDA.@profile MadNCL.solve!(solver)  # cuDSS + K2r
MadNCL algorithm

Total number of variables............................:      350008
Total number of constraints..........................:      450011

outer  inner     objective    inf_pr   inf_du    η        μ       ρ 
   0      0 -5.8505490e+01 1.02e-08 1.09e-09 1.00e-01 1.0e-01 1.00e+02
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
 50 -5.8505490e+01 1.84e+03 1.00e+00  -1.0 5.00e-01    -  1.00e+00 1.00e+00h  1
 51 -5.5166787e+01 1.11e-01 1.09e-01  -1.0 1.84e+03    -  9.79e-01 1.00e+00h  1
   1     51 -5.9272031e+01 5.50e-02 1.09e-01 2.00e-03 2.0e-02 1.00e+02
 51 -4.6956298e+01 1.11e-01 5.50e+00  -1.7 1.84e+03    -  9.79e-01 1.00e+00h  1
 52 -4.9640344e+01 9.55e-02 2.55e+00  -1.7 3.04e+01    -  2.41e-01 1.00e+00h  1
 53 -5.1953591e+01 2.77e-02 2.43e-01  -1.7 3.81e+00    -  7.63e-01 1.00e+00h  1
   2     53 -6.7131864e+01 3.11e-02 2.43e-01 2.00e-03 2.0e-02 1.00e+03
 53  1.2008474e+01 2.77e-02 2.80e+01  -1.7 3.81e+00    -  7.63e-01 1.00e+00h  1
 54 -1.5580249e+00 2.10e-02 3.59e+00  -1.7 5.33e+00    -  8.58e-01 8.72e-01h  1
 55  8.6557668e+01 4.09e-03 6.72e-02  -1.7 3.59e+01    -  9.21e-01 1.00e+00h  1
   3     55 -9.8649629e+01 1.57e-02 6.72e-02 2.00e-03 2.0e-02 1.00e+04
 55  1.6508346e+03 4.09e-03 1.41e+02  -1.7 3.59e+01    -  9.21e-01 1.00e+00h  1
 56  2.4071415e+02 3.74e-02 4.74e-02  -1.7 5.39e+01    -  1.00e+00 1.00e+00h  1
   4     56 -4.4719603e+01 3.84e-03 4.74e-02 2.00e-03 2.0e-02 1.00e+05
 56  2.7638501e+03 3.74e-02 3.45e+02  -1.7 5.39e+01    -  1.00e+00 1.00e+00h  1
 57  1.7232742e+02 3.58e-02 1.71e-01  -1.7 3.28e+01    -  1.00e+00 1.00e+00h  1
   5     57 -1.1885306e+01 1.15e-03 1.71e-01 4.00e-04 4.0e-03 1.00e+05
 57  5.3821207e+02 3.58e-02 1.15e+02  -2.4 3.28e+01    -  1.00e+00 1.00e+00h  1
 58  1.1329393e+02 3.42e-02 5.30e-02  -2.4 7.88e+00    -  9.94e-01 1.00e+00h  1
   6     58 -4.0055188e+00 3.22e-04 5.30e-02 8.00e-05 8.0e-04 1.00e+05
 58  1.4238671e+02 3.42e-02 3.22e+01  -3.1 7.88e+00    -  9.94e-01 1.00e+00h  1
 59  3.3451678e+01 3.76e-02 5.50e-02  -3.1 2.23e+00    -  8.96e-01 1.00e+00h  1
   7     59 -1.7762561e+00 1.06e-04 5.50e-02 8.00e-05 8.0e-04 1.00e+06
 59  4.3728795e+01 3.76e-02 9.52e+01  -3.1 2.23e+00    -  8.96e-01 1.00e+00h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
 60  3.1460631e+01 1.66e-03 7.04e-01  -3.1 5.90e-01    -  1.00e+00 1.00e+00h  1
 61  3.3545820e+01 9.10e-05 1.95e-04  -3.1 2.39e-01    -  1.00e+00 1.00e+00h  1
   8     61 -1.6436231e+00 6.74e-05 1.95e-04 1.60e-05 1.6e-04 1.00e+06
 61  4.7776421e+01 9.10e-05 6.74e+01  -3.8 2.39e-01    -  1.00e+00 1.00e+00h  1
 62  6.4907327e+00 3.69e-02 6.74e-01  -3.8 7.72e-01    -  9.80e-01 1.00e+00h  1
 63  7.4583765e+00 2.79e-04 7.20e-02  -3.8 1.41e+00    -  1.00e+00 1.00e+00h  1
 64  7.3868438e+00 4.18e-06 9.48e-04  -3.8 2.60e-01    -  1.00e+00 1.00e+00h  1
   9     64 -1.1206517e+00 1.16e-05 9.48e-04 3.20e-06 3.2e-05 1.00e+06
 64  8.2356167e+00 4.18e-06 1.16e+01  -4.5 2.60e-01    -  1.00e+00 1.00e+00h  1
 65  7.6833037e-01 3.68e-03 5.86e-01  -4.5 8.87e-01    -  1.00e+00 1.00e+00h  1
 66  5.1726664e-01 4.57e-04 8.08e-03  -4.5 5.67e-02  -3.4 1.00e+00 1.00e+00h  1
 67  4.9360534e-01 8.48e-06 1.70e-03  -4.5 1.30e-01  -3.9 1.00e+00 1.00e+00h  1
  10     67 -1.0245727e+00 5.49e-06 1.70e-03 3.20e-06 3.2e-05 1.00e+07
 67  1.0188374e+00 8.48e-06 4.95e+01  -4.5 1.30e-01  -3.9 1.00e+00 1.00e+00h  1
 68 -5.9019994e-02 4.95e-03 7.09e+00  -4.5 2.49e+00    -  4.80e-01 1.00e+00H  1
 69 -1.6284289e-02 6.94e-04 1.66e-02  -4.5 1.12e+00    -  1.00e+00 1.00e+00h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
 70 -1.7971694e-02 1.09e-04 8.74e-04  -4.5 7.59e-01    -  1.00e+00 1.00e+00h  1
  11     70 -1.0248683e+00 7.54e-06 8.74e-04 3.20e-06 3.2e-05 1.00e+08
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
 70  1.3064956e+00 1.09e-04 6.79e+02  -4.5 7.59e-01    -  1.00e+00 1.00e+00h  1
 71 -2.2420547e-01 5.05e-03 2.93e+01  -4.5 4.97e-01    -  4.20e-01 1.00e+00h  1
 72 -6.2036534e-02 5.85e-04 1.15e-03  -4.5 1.05e-01    -  1.00e+00 1.00e+00h  1
  12     72 -1.0196324e+00 1.27e-06 1.15e-03 6.40e-07 6.4e-06 1.00e+08
 72  3.8344221e-01 5.85e-04 1.27e+02  -5.2 1.05e-01    -  1.00e+00 1.00e+00h  1
 73 -1.0994926e+00 7.86e-02 1.36e+00  -5.2 1.18e+00    -  7.82e-01 1.00e+00h  1
 74 -1.0290048e+00 7.50e-04 7.83e-02  -5.2 8.21e-01    -  1.00e+00 1.00e+00h  1
 75 -1.0506458e+00 3.24e-04 1.06e-02  -5.2 3.54e-01    -  1.00e+00 1.00e+00h  1
 76 -1.0508501e+00 1.89e-06 1.40e-05  -5.2 2.62e-02    -  1.00e+00 1.00e+00h  1
  13     76 -1.0063088e+00 3.52e-07 1.40e-05 1.16e-07 1.2e-06 1.00e+08
 76 -1.0142486e+00 1.89e-06 3.52e+01  -5.9 2.62e-02    -  1.00e+00 1.00e+00h  1
 77 -1.2114770e+00 2.09e-02 8.82e-02  -5.9 5.31e-01    -  8.86e-01 1.00e+00h  1
 78 -1.2541092e+00 6.25e-03 3.20e-02  -5.9 4.56e-01    -  1.00e+00 7.58e-01h  1
 79 -1.2627063e+00 9.16e-04 1.19e-02  -5.9 2.89e-01    -  1.00e+00 1.00e+00h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
 80 -1.2656895e+00 1.31e-04 8.35e-04  -5.9 1.52e-01    -  1.00e+00 1.00e+00h  1
 81 -1.2666296e+00 1.34e-05 9.43e-05  -5.9 4.21e-02    -  1.00e+00 1.00e+00h  1
 82 -1.2668871e+00 9.50e-07 7.85e-06  -5.9 1.86e-02    -  1.00e+00 1.00e+00h  1
  14     82 -1.0018805e+00 2.03e-07 7.85e-06 1.65e-08 1.6e-07 1.00e+08
 82 -1.2196642e+00 9.50e-07 2.03e+01  -6.8 1.86e-02    -  1.00e+00 1.00e+00h  1
 83 -1.2374161e+00 6.10e-04 1.31e+01  -6.8 5.37e-01    -  9.26e-01 3.55e-01h  1
 84 -1.2779401e+00 6.36e-03 3.47e-02  -6.8 9.97e-01    -  1.00e+00 1.00e+00h  1
 85 -1.2759068e+00 4.48e-04 5.78e-05  -6.8 3.97e-01    -  1.00e+00 1.00e+00h  1
 86 -1.2758056e+00 8.54e-06 4.20e-06  -6.8 7.20e-02    -  1.00e+00 1.00e+00h  1
 87 -1.2758043e+00 1.55e-09 1.08e-09  -6.8 1.58e-03    -  1.00e+00 1.00e+00h  1
  15     87 -1.0004278e+00 2.17e-07 1.08e-09 1.77e-09 1.8e-08 1.00e+08
Profiler ran for 4.12 s, capturing 867992 events.

Host-side activity: calling CUDA APIs took 396.24 ms (9.63% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                       │ Name                        │
├──────────┼────────────┼───────┼─────────────────────────────────────────┼─────────────────────────────┤
│   27.88% │     1.15 s │  6350 │ 180.68 µs ± 2271.95 (  0.48 ‥ 31003.48) │ cuStreamSynchronize         │
│    1.24% │   50.96 ms │ 10296 │   4.95 µs ± 0.98   (  3.58 ‥ 31.71)     │ cuLaunchKernel              │
│    0.65% │   26.62 ms │  1987 │   13.4 µs ± 1.27   ( 11.44 ‥ 24.08)     │ cuMemcpyDtoHAsync           │
│    0.38% │   15.76 ms │  3356 │   4.69 µs ± 0.62   (  3.81 ‥ 11.44)     │ cudaLaunchKernel            │
│    0.36% │   14.91 ms │  4286 │   3.48 µs ± 1.33   (  1.43 ‥ 19.31)     │ cuMemAllocFromPoolAsync     │
│    0.32% │   13.32 ms │    55 │ 242.25 µs ± 42.7   (222.68 ‥ 383.38)    │ cuMemcpyHtoDAsync           │
│    0.16% │    6.53 ms │   652 │  10.02 µs ± 1.79   (  7.63 ‥ 28.85)     │ cuMemcpyDtoDAsync           │
│    0.10% │    4.16 ms │   331 │  12.58 µs ± 2.96   (  8.58 ‥ 23.6)      │ cudaMemcpyAsync             │
│    0.06% │    2.54 ms │   504 │   5.04 µs ± 2.05   (  2.38 ‥ 31.23)     │ cudaMemsetAsync             │
│    0.02% │    1.02 ms │   420 │   2.44 µs ± 0.59   (  1.43 ‥ 5.48)      │ cuMemFreeAsync              │
│    0.01% │  413.18 µs │  1276 │ 323.81 ns ± 183.02 (   0.0 ‥ 2384.19)   │ cudaStreamGetCaptureInfo_v2 │
│    0.01% │  373.36 µs │   108 │   3.46 µs ± 0.76   (  2.38 ‥ 7.15)      │ cudaFuncGetAttributes       │
│    0.01% │  360.73 µs │   147 │   2.45 µs ± 0.29   (  1.91 ‥ 3.34)      │ cudaStreamSynchronize       │
│    0.01% │  280.14 µs │   292 │ 959.39 ns ± 556.05 (238.42 ‥ 3814.7)    │ cudaEventRecord             │
│    0.00% │  205.76 µs │  1380 │  149.1 ns ± 172.67 (   0.0 ‥ 1430.51)   │ cudaGetLastError            │
│    0.00% │  158.07 µs │  1304 │ 121.22 ns ± 133.07 (   0.0 ‥ 476.84)    │ cuCtxPushCurrent            │
│    0.00% │  144.48 µs │  1304 │  110.8 ns ± 132.49 (   0.0 ‥ 476.84)    │ cuCtxPopCurrent             │
│    0.00% │  115.16 µs │  1304 │  88.31 ns ± 129.11 (   0.0 ‥ 476.84)    │ cuCtxGetDevice              │
│    0.00% │  104.43 µs │  1304 │  80.08 ns ± 123.01 (   0.0 ‥ 476.84)    │ cuDeviceGet                 │
│    0.00% │   70.57 µs │    80 │ 882.15 ns ± 664.03 (238.42 ‥ 3337.86)   │ cudaFuncSetAttribute        │
│    0.00% │   37.43 µs │    92 │ 406.87 ns ± 167.8  (238.42 ‥ 953.67)    │ cudaMalloc                  │
└──────────┴────────────┴───────┴─────────────────────────────────────────┴─────────────────────────────┘

Device-side activity: GPU was busy for 2.62 s (63.56% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   21.10% │  868.11 ms │    40 │   21.7 ms ± 0.35   ( 21.13 ‥ 22.93)  │ void cudss::update_ker<long, double, int, double, 256, 1, 0, 0>(int, int, double*, double*, int const*, int const*, int*, int con ⋯
│    9.43% │  387.97 ms │    40 │    9.7 ms ± 0.02   (  9.67 ‥ 9.78)   │ void cudss::factorize_ker<long, double, int, double, 32, 1, 0, 0, 1, 0, 0, 0, 1>(int, int, int, double*, double*, long const*, in ⋯
│    5.83% │   239.9 ms │    40 │    6.0 ms ± 0.05   (  5.85 ‥ 6.07)   │ void cudss::factorize_v3_ker<long, double, int, double, 256, 1, 0, 0, 1, 1, 0, 0, 4>(int, int, int, int, double*, double*, long c ⋯
│    5.60% │  230.33 ms │    92 │    2.5 ms ± 0.0    (   2.5 ‥ 2.51)   │ void cudss::bwd_v2_ker<long, double, int, 32, 16, 1, 0, 0>(int const*, int const*, int, int, double*, double*, int const*, long c ⋯
│    5.39% │  221.61 ms │    40 │   5.54 ms ± 0.0    (  5.53 ‥ 5.55)   │ void cudss::kernel<cudss::getrf_params_<double, 2, 256, 1, 64, 64, 68, 16, 1, 1>>(int, int, void*, int, void*, int, int, int, int ⋯
│    4.37% │  179.82 ms │    92 │   1.95 ms ± 0.01   (  1.94 ‥ 1.97)   │ void cudss::fwd_v2_ker<long, double, int, 256, 1, 0, 0, 256, 1>(int const*, int const*, int, int, double*, int const*, long const ⋯
│    3.67% │  150.89 ms │    92 │   1.64 ms ± 0.0    (  1.63 ‥ 1.65)   │ void cudss::fwd_v2_ker<long, double, int, 32, 1, 0, 0, 32, 1>(int const*, int const*, int, int, double*, int const*, long const*, ⋯
│    3.27% │  134.45 ms │    92 │   1.46 ms ± 0.03   (  1.41 ‥ 1.53)   │ void cudss::bwd_ker<long, double, int, 128, 128, 16, 8, 8, 1, 0, 1, 0>(int const*, int const*, int, int, double*, double*, int co ⋯
│    0.76% │   31.46 ms │   131 │ 240.18 µs ± 178.03 ( 27.66 ‥ 430.35) │ gpu__transfer_to_map_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, Dy ⋯
│    0.64% │    26.3 ms │    92 │ 285.92 µs ± 0.96   (283.72 ‥ 288.25) │ void trsv_lt_exec<double, 32u, 32u, 4u, true, false>(int, double const*, long, double*, long, int*)                               ⋯
│    0.47% │   19.32 ms │    92 │ 210.03 µs ± 0.83   (208.14 ‥ 212.67) │ void trsv_ln_exec<double, 32u, 32u, 4u, true>(int, double const*, long, double*, long, int*)                                      ⋯
│    0.29% │   12.04 ms │    55 │ 218.89 µs ± 27.14  (206.71 ‥ 309.71) │ [copy pageable to device memory]                                                                                                  ⋯
│    0.20% │    8.31 ms │   836 │   9.93 µs ± 4.32   (  5.48 ‥ 20.27)  │ [copy device to device memory]                                                                                                    ⋯
│    0.12% │    4.82 ms │   236 │  20.41 µs ± 6.66   (  12.4 ‥ 28.61)  │ void cusparse::csrmv_v3_kernel<std::integral_constant<bool, false>, int, int, double, double, double, double, void>(cusparse::Ker ⋯
│    0.11% │    4.57 ms │   344 │  13.28 µs ± 2.83   (  7.15 ‥ 18.36)  │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64> ⋯
│    0.11% │    4.55 ms │   184 │  24.74 µs ± 9.56   ( 14.54 ‥ 36.24)  │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double, void>(cusparse::KernelCoeffs<double>, int cons ⋯
│    0.11% │    4.34 ms │   922 │    4.7 µs ± 1.85   (  2.62 ‥ 10.73)  │ _34(CuKernelContext, CuDeviceArray<Float64, 1l, 1l>, Broadcasted<CuArrayStyle<1l, DeviceMemory>, Tuple<OneTo<Int64>>, identity, D ⋯
│    0.10% │    4.24 ms │   420 │   10.1 µs ± 2.05   (   6.2 ‥ 12.16)  │ void cusparse::csrmv_v3_partition_kernel<std::integral_constant<bool, false>, 256, int, int, double, double, double>(int const*,  ⋯
│    0.10% │    4.02 ms │    40 │ 100.39 µs ± 0.79   ( 98.94 ‥ 102.52) │ void cudss::finalize_permute_ker<int, double, 256>(long, double*, long, int*, int*, int*, int*, int, int, int)                    ⋯

julia> CUDA.@profile MadNCL.solve!(solver2)  # cuDSS + K1s
MadNCL algorithm

Total number of variables............................:      350008
Total number of constraints..........................:      450011

outer  inner     objective    inf_pr   inf_du    η        μ       ρ 
   0      0 -5.8505490e+01 1.55e-09 1.08e-09 1.00e-01 1.0e-01 1.00e+02
178 -5.8505490e+01 1.84e+03 1.00e+00  -1.0 5.00e-01    -  1.00e+00 1.00e+00h  1
179 -5.5166787e+01 1.11e-01 1.09e-01  -1.0 1.84e+03    -  9.79e-01 1.00e+00h  1
   1    179 -5.9272031e+01 5.50e-02 1.09e-01 2.00e-03 2.0e-02 1.00e+02
179 -4.6956298e+01 1.11e-01 5.50e+00  -1.7 1.84e+03    -  9.79e-01 1.00e+00h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
180 -4.9640344e+01 9.55e-02 2.55e+00  -1.7 3.04e+01    -  2.41e-01 1.00e+00h  1
181 -5.1953591e+01 2.77e-02 2.43e-01  -1.7 3.81e+00    -  7.63e-01 1.00e+00h  1
   2    181 -6.7131864e+01 3.11e-02 2.43e-01 2.00e-03 2.0e-02 1.00e+03
181  1.2008474e+01 2.77e-02 2.80e+01  -1.7 3.81e+00    -  7.63e-01 1.00e+00h  1
182 -1.5580249e+00 2.10e-02 3.59e+00  -1.7 5.33e+00    -  8.58e-01 8.72e-01h  1
183  8.6557668e+01 4.09e-03 6.72e-02  -1.7 3.59e+01    -  9.21e-01 1.00e+00h  1
   3    183 -9.8649629e+01 1.57e-02 6.72e-02 2.00e-03 2.0e-02 1.00e+04
183  1.6508346e+03 4.09e-03 1.41e+02  -1.7 3.59e+01    -  9.21e-01 1.00e+00h  1
184  2.4071412e+02 3.74e-02 4.74e-02  -1.7 5.39e+01    -  1.00e+00 1.00e+00h  1
   4    184 -4.4719599e+01 3.84e-03 4.74e-02 2.00e-03 2.0e-02 1.00e+05
184  2.7638497e+03 3.74e-02 3.45e+02  -1.7 5.39e+01    -  1.00e+00 1.00e+00h  1
185  1.7232743e+02 3.58e-02 1.71e-01  -1.7 3.28e+01    -  1.00e+00 1.00e+00h  1
   5    185 -1.1885306e+01 1.15e-03 1.71e-01 4.00e-04 4.0e-03 1.00e+05
185  5.3821209e+02 3.58e-02 1.15e+02  -2.4 3.28e+01    -  1.00e+00 1.00e+00h  1
186  1.1329393e+02 3.42e-02 5.30e-02  -2.4 7.88e+00    -  9.94e-01 1.00e+00h  1
   6    186 -4.0055187e+00 3.22e-04 5.30e-02 8.00e-05 8.0e-04 1.00e+05
186  1.4238670e+02 3.42e-02 3.22e+01  -3.1 7.88e+00    -  9.94e-01 1.00e+00h  1
187  3.3451678e+01 3.76e-02 5.50e-02  -3.1 2.23e+00    -  8.96e-01 1.00e+00h  1
   7    187 -1.7762561e+00 1.06e-04 5.50e-02 8.00e-05 8.0e-04 1.00e+06
187  4.3728795e+01 3.76e-02 9.52e+01  -3.1 2.23e+00    -  8.96e-01 1.00e+00h  1
188  3.1460631e+01 1.66e-03 7.04e-01  -3.1 5.90e-01    -  1.00e+00 1.00e+00h  1
189  3.3545820e+01 9.10e-05 1.95e-04  -3.1 2.39e-01    -  1.00e+00 1.00e+00h  1
   8    189 -1.6436231e+00 6.74e-05 1.95e-04 1.60e-05 1.6e-04 1.00e+06
189  4.7776421e+01 9.10e-05 6.74e+01  -3.8 2.39e-01    -  1.00e+00 1.00e+00h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
190  6.4907326e+00 3.69e-02 6.74e-01  -3.8 7.72e-01    -  9.80e-01 1.00e+00h  1
191  7.4583766e+00 2.79e-04 7.20e-02  -3.8 1.41e+00    -  1.00e+00 1.00e+00h  1
192  7.3868438e+00 4.18e-06 9.48e-04  -3.8 2.60e-01    -  1.00e+00 1.00e+00h  1
   9    192 -1.1206517e+00 1.16e-05 9.48e-04 3.20e-06 3.2e-05 1.00e+06
192  8.2356167e+00 4.18e-06 1.16e+01  -4.5 2.60e-01    -  1.00e+00 1.00e+00h  1
193  7.6833040e-01 3.68e-03 5.86e-01  -4.5 8.87e-01    -  1.00e+00 1.00e+00h  1
194  1.5744121e-01 2.44e-03 3.38e-01  -4.5 8.78e-01  -5.0 1.00e+00 1.00e+00h  1
195 -3.1471483e-01 1.39e-03 1.57e-01  -4.5 4.42e-01  -4.6 1.00e+00 1.00e+00h  1
196 -6.3479489e-01 4.24e-04 2.88e-02  -4.5 1.99e-01  -4.2 1.00e+00 1.00e+00h  1
197 -1.2323752e+00 3.12e-03 2.00e+00  -4.5 1.15e+00  -4.7 1.00e+00 5.00e-01h  2
198 -2.0016308e+00 3.30e-03 5.70e-02  -4.5 4.60e-01  -4.2 1.00e+00 1.00e+00h  1
199 -2.3625247e+00 4.59e-03 8.95e-01  -4.5 1.56e+02    -  1.13e-02 2.06e-02h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
200 -2.6550833e+00 4.33e-03 2.03e-01  -4.5 4.77e+00    -  4.13e-01 2.83e-01h  1
201 -2.8621516e+00 1.51e-03 1.06e-02  -4.5 6.39e-01    -  1.00e+00 1.00e+00h  1
202 -2.8558890e+00 6.83e-05 3.90e-04  -4.5 2.34e-01    -  1.00e+00 1.00e+00h  1
  10    202 -1.0279300e+00 3.80e-05 3.90e-04 3.20e-06 3.2e-05 1.00e+07
202  2.1027269e+01 6.83e-05 3.42e+02  -4.5 2.34e-01    -  1.00e+00 1.00e+00h  1
203  3.1390477e-01 2.86e-02 1.20e+00  -4.5 2.36e+00    -  8.01e-01 1.00e+00h  1
204  2.4291486e-01 2.40e-02 1.98e+00  -4.5 4.32e+01    -  6.58e-01 1.58e-01h  1
205 -2.4868202e-02 7.69e-03 1.51e+00  -4.5 2.24e+00    -  4.01e-01 1.00e+00h  1
206 -1.7954904e-02 2.28e-05 5.91e-03  -4.5 5.43e+00    -  1.00e+00 1.00e+00h  1
207 -1.7298746e-02 8.88e-07 1.89e-04  -4.5 9.90e-01    -  1.00e+00 1.00e+00h  1
  11    207 -1.0248676e+00 7.52e-06 1.89e-04 3.20e-06 3.2e-05 1.00e+08
207  1.3043705e+00 8.88e-07 6.77e+02  -4.5 9.90e-01    -  1.00e+00 1.00e+00h  1
208 -2.2420396e-01 5.05e-03 2.91e+01  -4.5 4.94e-01    -  4.22e-01 1.00e+00h  1
209 -6.2104087e-02 5.83e-04 1.13e-03  -4.5 1.01e-01    -  1.00e+00 1.00e+00h  1
  12    209 -1.0196317e+00 1.27e-06 1.13e-03 6.40e-07 6.4e-06 1.00e+08
209  3.8332369e-01 5.83e-04 1.27e+02  -5.2 1.01e-01    -  1.00e+00 1.00e+00h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
210 -1.0994861e+00 7.86e-02 1.36e+00  -5.2 1.18e+00    -  7.82e-01 1.00e+00h  1
211 -1.0290005e+00 7.50e-04 7.83e-02  -5.2 8.21e-01    -  1.00e+00 1.00e+00h  1
212 -1.0506425e+00 3.24e-04 1.06e-02  -5.2 3.54e-01    -  1.00e+00 1.00e+00h  1
213 -1.0508466e+00 1.90e-06 1.40e-05  -5.2 2.62e-02    -  1.00e+00 1.00e+00h  1
  13    213 -1.0063088e+00 3.52e-07 1.40e-05 1.16e-07 1.2e-06 1.00e+08
213 -1.0142439e+00 1.90e-06 3.52e+01  -5.9 2.62e-02    -  1.00e+00 1.00e+00h  1
214 -1.2114719e+00 2.09e-02 8.82e-02  -5.9 5.31e-01    -  8.86e-01 1.00e+00h  1
215 -1.2541043e+00 6.25e-03 3.20e-02  -5.9 4.56e-01    -  1.00e+00 7.58e-01h  1
216 -1.2627006e+00 9.16e-04 1.19e-02  -5.9 2.89e-01    -  1.00e+00 1.00e+00h  1
217 -1.2656838e+00 1.31e-04 8.35e-04  -5.9 1.52e-01    -  1.00e+00 1.00e+00h  1
218 -1.2666239e+00 1.34e-05 9.43e-05  -5.9 4.21e-02    -  1.00e+00 1.00e+00h  1
219 -1.2668814e+00 9.50e-07 7.85e-06  -5.9 1.86e-02    -  1.00e+00 1.00e+00h  1
  14    219 -1.0018806e+00 2.03e-07 7.85e-06 1.65e-08 1.6e-07 1.00e+08
219 -1.2196587e+00 9.50e-07 2.03e+01  -6.8 1.86e-02    -  1.00e+00 1.00e+00h  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
220 -1.2374110e+00 6.10e-04 1.31e+01  -6.8 5.37e-01    -  9.26e-01 3.55e-01h  1
221 -1.2779345e+00 6.36e-03 3.47e-02  -6.8 9.97e-01    -  1.00e+00 1.00e+00h  1
222 -1.2759012e+00 4.48e-04 5.78e-05  -6.8 4.44e-01    -  1.00e+00 1.00e+00h  1
223 -1.2757999e+00 1.02e-05 4.20e-06  -6.8 8.63e-02    -  1.00e+00 1.00e+00h  1
224 -1.2757987e+00 1.55e-09 1.08e-09  -6.8 1.58e-03    -  1.00e+00 1.00e+00h  1
  15    224 -1.0004278e+00 2.17e-07 1.08e-09 1.77e-09 1.8e-08 1.00e+08
Profiler ran for 8.07 s, capturing 1422567 events.

Host-side activity: calling CUDA APIs took 1.11 s (13.71% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                       │ Name                        │
├──────────┼────────────┼───────┼─────────────────────────────────────────┼─────────────────────────────┤
│   41.85% │     3.38 s │ 10075 │ 335.06 µs ± 4523.92 (  0.48 ‥ 63260.79) │ cuStreamSynchronize         │
│    1.01% │   81.73 ms │ 16579 │   4.93 µs ± 0.97   (  3.58 ‥ 33.62)     │ cuLaunchKernel              │
│    0.60% │   48.36 ms │  3587 │  13.48 µs ± 2.32   ( 11.44 ‥ 123.98)    │ cuMemcpyDtoHAsync           │
│    0.37% │   29.65 ms │  9804 │   3.02 µs ± 1.43   (  1.19 ‥ 78.68)     │ cuMemAllocFromPoolAsync     │
│    0.20% │   16.13 ms │    65 │ 248.09 µs ± 47.4   (221.97 ‥ 399.35)    │ cuMemcpyHtoDAsync           │
│    0.20% │   15.78 ms │  3266 │   4.83 µs ± 0.83   (  3.81 ‥ 20.27)     │ cudaLaunchKernel            │
│    0.19% │   15.26 ms │  8278 │   1.84 µs ± 1.0    (  0.95 ‥ 62.47)     │ cuMemFreeAsync              │
│    0.06% │     4.6 ms │   351 │  13.12 µs ± 3.5    (  8.34 ‥ 31.95)     │ cudaMemcpyAsync             │
│    0.06% │    4.56 ms │   406 │  11.24 µs ± 1.42   (  8.11 ‥ 18.12)     │ cuMemcpyDtoDAsync           │
│    0.04% │    2.99 ms │   553 │   5.41 µs ± 3.0    (  2.38 ‥ 38.86)     │ cudaMemsetAsync             │
│    0.01% │   448.7 µs │   129 │   3.48 µs ± 0.83   (  2.62 ‥ 7.15)      │ cudaFuncGetAttributes       │
│    0.01% │  437.74 µs │   183 │   2.39 µs ± 0.29   (  1.67 ‥ 4.05)      │ cudaStreamSynchronize       │
│    0.00% │  208.62 µs │   645 │ 323.44 ns ± 189.77 (   0.0 ‥ 1430.51)   │ cudaStreamGetCaptureInfo_v2 │
│    0.00% │  167.13 µs │   129 │    1.3 µs ± 0.93   (  0.48 ‥ 3.58)      │ cudaEventRecord             │
│    0.00% │  165.46 µs │   894 │ 185.08 ns ± 303.1  (   0.0 ‥ 6675.72)   │ cudaGetLastError            │
│    0.00% │  129.22 µs │   812 │ 159.14 ns ± 139.64 (   0.0 ‥ 715.26)    │ cuCtxPushCurrent            │
│    0.00% │  111.58 µs │   812 │ 137.41 ns ± 135.58 (   0.0 ‥ 476.84)    │ cuCtxPopCurrent             │
│    0.00% │   79.15 µs │   812 │  97.48 ns ± 127.59 (   0.0 ‥ 476.84)    │ cuCtxGetDevice              │
│    0.00% │   62.47 µs │   820 │  76.18 ns ± 122.52 (   0.0 ‥ 476.84)    │ cuDeviceGet                 │
│    0.00% │  238.42 ns │     9 │  26.49 ns ± 79.47  (   0.0 ‥ 238.42)    │ cuDeviceGetCount            │
└──────────┴────────────┴───────┴─────────────────────────────────────────┴─────────────────────────────┘

Device-side activity: GPU was busy for 5.0 s (62.03% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   51.61% │     4.16 s │    55 │  75.69 ms ± 0.13   ( 75.42 ‥ 75.97)  │ gpu__transfer_jtsj_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1 ⋯
│    3.41% │  275.07 ms │    55 │    5.0 ms ± 0.04   (  4.96 ‥ 5.18)   │ void cudss::factorize_ker<long, double, int, double, 32, 1, 0, 0, 1, 0, 0, 0, 1>(int, int, int, double*, double*, long const*, in ⋯
│    1.76% │  141.62 ms │    55 │   2.57 ms ± 0.0    (  2.56 ‥ 2.59)   │ void cudss::factorize_v3_ker<long, double, int, double, 256, 1, 0, 0, 1, 1, 0, 0, 4>(int, int, int, int, double*, double*, long c ⋯
│    1.36% │  109.76 ms │   397 │ 276.49 µs ± 134.17 (  17.4 ‥ 351.91) │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double, void>(cusparse::KernelCoeffs<double>, int cons ⋯
│    0.67% │   54.43 ms │    84 │ 647.98 µs ± 0.57   (646.83 ‥ 650.41) │ void cudss::bwd_v2_ker<long, double, int, 32, 16, 1, 0, 0>(int const*, int const*, int, int, double*, double*, int const*, long c ⋯
│    0.61% │   49.03 ms │    84 │ 583.74 µs ± 2.94   (578.17 ‥ 592.71) │ void cudss::bwd_ker<long, double, int, 128, 128, 16, 8, 8, 1, 0, 1, 0>(int const*, int const*, int, int, double*, double*, int co ⋯
│    0.53% │   42.56 ms │    84 │ 506.61 µs ± 1.97   (503.06 ‥ 512.12) │ void cudss::fwd_v2_ker<long, double, int, 32, 1, 0, 0, 32, 1>(int const*, int const*, int, int, double*, int const*, long const*, ⋯
│    0.21% │    17.3 ms │   109 │ 158.68 µs ± 161.73 ( 17.88 ‥ 344.28) │ gpu__transfer_to_map_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, Dy ⋯
│    0.18% │   14.43 ms │    65 │ 221.97 µs ± 29.14  (207.19 ‥ 308.75) │ [copy pageable to device memory]                                                                                                  ⋯
│    0.12% │     9.6 ms │    84 │ 114.26 µs ± 2.47   ( 111.1 ‥ 122.79) │ void cudss::fwd_v2_ker<long, double, int, 256, 1, 0, 0, 256, 1>(int const*, int const*, int, int, double*, int const*, long const ⋯
│    0.10% │    7.91 ms │  1511 │   5.23 µs ± 0.21   (  4.77 ‥ 6.2)    │ partial_mapreduce_grid(identity, _, Bool, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64>>>, V ⋯
│    0.09% │    6.86 ms │   649 │  10.57 µs ± 1.69   (   6.2 ‥ 12.16)  │ void cusparse::csrmv_v3_partition_kernel<std::integral_constant<bool, false>, 256, int, int, double, double, double>(int const*,  ⋯
│    0.08% │    6.69 ms │  3770 │   1.77 µs ± 0.17   (  1.43 ‥ 2.38)   │ [copy device to pageable memory]                                                                                                  ⋯
│    0.07% │    5.82 ms │   574 │  10.14 µs ± 4.84   (  5.72 ‥ 20.27)  │ [copy device to device memory]                                                                                                    ⋯
│    0.06% │    5.24 ms │   719 │   7.28 µs ± 3.73   (  3.58 ‥ 17.17)  │ _Z3_3415CuKernelContext13CuDeviceArrayI7Float64Ll1ELl1EE11BroadcastedI12CuArrayStyleILl1E12DeviceMemoryE5TupleI5OneToI5Int64EE1_S ⋯
│    0.06% │     4.9 ms │  1511 │   3.24 µs ± 0.16   (  2.62 ‥ 3.81)   │ partial_mapreduce_grid(identity, _, Bool, CartesianIndices<2l, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2l, Tuple<One ⋯
│    0.06% │    4.66 ms │  1511 │   3.08 µs ± 0.52   (  2.38 ‥ 5.48)   │ _34(CuKernelContext, CuDeviceArray<Bool, 1l, 1l>, Broadcasted<CuArrayStyle<1l, DeviceMemory>, Tuple<OneTo<Int64>>, _19<CuArraySty ⋯
│    0.06% │    4.65 ms │   252 │  18.47 µs ± 4.2    (  12.4 ‥ 23.37)  │ void cusparse::csrmv_v3_kernel<std::integral_constant<bool, false>, int, int, double, double, double, double, void>(cusparse::Ker ⋯
│    0.06% │    4.51 ms │   355 │  12.72 µs ± 2.46   (  7.15 ‥ 15.74)  │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64> ⋯
│    0.05% │    4.26 ms │    92 │  46.26 µs ± 1.98   ( 43.39 ‥ 49.35)  │ _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Float64Ll1ELl1EE11BroadcastedI12CuArrayStyleILl1E12DeviceMemoryE5TupleI5OneToI5In ⋯
│    0.04% │    2.99 ms │    55 │  54.39 µs ± 0.31   ( 53.64 ‥ 55.07)  │ void cudss::copy_matrix_ker<long, double, int, 128>(int, int const*, long const*, double const*, double*, int)                    ⋯
│    0.04% │    2.86 ms │   551 │   5.18 µs ± 2.01   (  1.19 ‥ 9.3)    │ _6(CuKernelContext, CuDeviceArray<Float64, 1l, 1l>, Float64)                                                                      ⋯
│    0.03% │    2.79 ms │   845 │    3.3 µs ± 0.39   (  2.62 ‥ 4.77)   │ getindex_kernel(CuKernelContext, CuDeviceArray<Float64, 1l, 1l>, CuDeviceArray<Float64, 1l, 1l>, Tuple<Int64>, CuDeviceArray<CuDe ⋯
│    0.03% │    2.79 ms │   253 │  11.01 µs ± 6.02   (  4.77 ‥ 19.79)  │ void axpy_kernel_val<double, double>(cublasAxpyParamsVal<double, double, double>)

amontoison · 2025-01-15T15:44:13Z

Note that I got this error when I wanted to check the Jacobian on GPU:

julia> jac(exa2, exa2.meta.x0)
ERROR: GPU compilation of MethodInstance for ExaModelsKernelAbstractions.gpu_kerj(::KernelAbstractions.CompilerMetadata{…}, ::Vector{…}, ::Vector{…}, ::ExaModels.SIMDFunction{…}, ::CuDeviceVector{…}, ::Nothing, ::Float64) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type Vector{Int64}, which is not isbits:
  .ref is of type MemoryRef{Int64} which is not isbits.
    .mem is of type Memory{Int64} which is not isbits.


Stacktrace:
  [1] check_invocation(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/validation.jl:92
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:92 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/6KVfH/src/TimerOutput.jl:253 [inlined]
  [4] codegen(output::Symbol, job::GPUCompiler.CompilerJob; toplevel::Bool, libraries::Bool, optimize::Bool, cleanup::Bool, validate::Bool, strip::Bool, only_entry::Bool, parent_job::Nothing)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:90
  [5] codegen
    @ ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:82 [inlined]
  [6] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:79
  [7] compile
    @ ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:74 [inlined]
  [8] #1145
    @ ~/.julia/packages/CUDA/2kjXI/src/compiler/compilation.jl:250 [inlined]
  [9] JuliaContext(f::CUDA.var"#1145#1148"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:34
 [10] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:25
 [11] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/2kjXI/src/compiler/compilation.jl:249
 [12] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/execution.jl:237
 [13] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/execution.jl:151
 [14] macro expansion
    @ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:380 [inlined]
 [15] macro expansion
    @ ./lock.jl:273 [inlined]
 [16] cufunction(f::typeof(ExaModelsKernelAbstractions.gpu_kerj), tt::Type{Tuple{…}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Nothing})
    @ CUDA ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:375
 [17] macro expansion
    @ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:112 [inlined]
 [18] (::KernelAbstractions.Kernel{…})(::Vector{…}, ::Vararg{…}; ndrange::Int64, workgroupsize::Nothing)
    @ CUDA.CUDAKernels ~/.julia/packages/CUDA/2kjXI/src/CUDAKernels.jl:103
 [19] Kernel
    @ ~/.julia/packages/CUDA/2kjXI/src/CUDAKernels.jl:89 [inlined]
 [20] sjacobian!
    @ ~/.julia/packages/ExaModels/CGCQ6/ext/ExaModelsKernelAbstractions.jl:504 [inlined]
 [21] _jac_structure!(backend::CUDABackend, cons::ExaModels.Constraint{ExaModels.Constraint{…}, ExaModels.SIMDFunction{…}, CuArray{…}, Int64}, rows::Vector{Int64}, cols::Vector{Int64})
    @ ExaModelsKernelAbstractions ~/.julia/packages/ExaModels/CGCQ6/ext/ExaModelsKernelAbstractions.jl:175
 [22] jac_structure!
    @ ~/.julia/packages/ExaModels/CGCQ6/ext/ExaModelsKernelAbstractions.jl:170 [inlined]
 [23] jac_structure
    @ ~/.julia/packages/NLPModels/uC4QP/src/nlp/api.jl:171 [inlined]
 [24] jac(nlp::ExaModel{Float64, CuArray{…}, ExaModelsKernelAbstractions.KAExtension{…}, ExaModels.Objective{…}, ExaModels.Constraint{…}}, x::CuArray{Float64, 1, CUDA.DeviceMemory})
    @ NLPModels ~/.julia/packages/NLPModels/uC4QP/src/nlp/api.jl:271
 [25] top-level scope
    @ REPL[33]:1
Some type information was truncated. Use `show(err)` to see complete types.

I was still able to compute it on CPU and we have this sparsity pattern:

I can form J' * J in less than one second but J * J' takes a very long time so a part of this matrix is completely dense.

@jbcaillau Does it make sense to reformulate the problem to avoid a dense column or / and use a different KKT formulation?

amontoison · 2025-01-15T15:52:46Z

I confirm that it's the culprit!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize sparse condensed Hessian kernels on GPU #399

Optimize sparse condensed Hessian kernels on GPU #399

amontoison commented Jan 10, 2025

sshin23 commented Jan 10, 2025

amontoison commented Jan 10, 2025

amontoison commented Jan 11, 2025 •

edited

Loading

jbcaillau commented Jan 11, 2025

jbcaillau commented Jan 11, 2025

sshin23 commented Jan 11, 2025

amontoison commented Jan 11, 2025 •

edited

Loading

jbcaillau commented Jan 11, 2025 •

edited

Loading

sshin23 commented Jan 11, 2025 •

edited

Loading

jbcaillau commented Jan 12, 2025

amontoison commented Jan 15, 2025 •

edited

Loading

amontoison commented Jan 15, 2025 •

edited

Loading

amontoison commented Jan 15, 2025 •

edited

Loading

Optimize sparse condensed Hessian kernels on GPU #399

Optimize sparse condensed Hessian kernels on GPU #399

Comments

amontoison commented Jan 10, 2025

sshin23 commented Jan 10, 2025

amontoison commented Jan 10, 2025

amontoison commented Jan 11, 2025 • edited Loading

jbcaillau commented Jan 11, 2025

jbcaillau commented Jan 11, 2025

sshin23 commented Jan 11, 2025

amontoison commented Jan 11, 2025 • edited Loading

jbcaillau commented Jan 11, 2025 • edited Loading

sshin23 commented Jan 11, 2025 • edited Loading

jbcaillau commented Jan 12, 2025

amontoison commented Jan 15, 2025 • edited Loading

amontoison commented Jan 15, 2025 • edited Loading

amontoison commented Jan 15, 2025 • edited Loading

amontoison commented Jan 11, 2025 •

edited

Loading

amontoison commented Jan 11, 2025 •

edited

Loading

jbcaillau commented Jan 11, 2025 •

edited

Loading

sshin23 commented Jan 11, 2025 •

edited

Loading

amontoison commented Jan 15, 2025 •

edited

Loading

amontoison commented Jan 15, 2025 •

edited

Loading

amontoison commented Jan 15, 2025 •

edited

Loading