-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize sparse condensed Hessian kernels on GPU #399
Comments
could you also post the problem script as well? |
Yes, it's the content of this file with |
@jbcaillau, do you have a dense row in the Jacobian of the constraint in your goddard problem? |
@amontoison what to think of the first two calls
|
@amontoison as should be clear from the constraint structure in goddard-exa2.jl, each Jacobian row should be very sparse. a typical constraint involves, at line I am also writing a pure ADNLPModels version on which it will be easy to check the sparsity structure.
|
It might be good to check (1) global scope variable is causing performance issue and (2) CUDA.@Profile itself is causing something. In my experience this type of problem the bottleneck is the symbolic factorization. I don't see anything in the problem that may cause performance bottleneck |
It means that these two kernels in MadNLP.jl are not effficient and should be improved (if not related to global scope) |
@sshin23 @amontoison regarding point (1) below, I have changed variables from the global scope to constants in goddard-exa.jl so there should not be any issue on this side. below is the new run, still with
julia> CUDA.@profile madnlp(exa2; tol=tol)
This is MadNLP version v0.8.5, running with cuDSS v0.4.0
Number of nonzeros in constraint Jacobian............: 1350017
Number of nonzeros in Lagrangian Hessian.............: 1300008
Total number of variables............................: 350008
variables with only lower bounds: 0
variables with lower and upper bounds: 0
variables with only upper bounds: 0
Total number of equality constraints.................: 300007
Total number of inequality constraints...............: 150004
inequality constraints with only lower bounds: 50002
inequality constraints with lower and upper bounds: 100002
inequality constraints with only upper bounds: 0
[...]
Number of Iterations....: 55
(scaled) (unscaled)
Objective...............: -1.0264249724384082e+00 -1.0264249724384082e+00
Dual infeasibility......: 9.1620584206984594e-13 9.1620584206984594e-13
Constraint violation....: 8.1824980736335345e-10 8.1824980736335345e-10
Complementarity.........: 9.0909098505047791e-09 9.0909098505047791e-09
Overall NLP error.......: 9.0909098505047791e-09 9.0909098505047791e-09
Number of objective function evaluations = 56
Number of objective gradient evaluations = 56
Number of constraint evaluations = 56
Number of constraint Jacobian evaluations = 56
Number of Lagrangian Hessian evaluations = 55
Total wall-clock secs in solver (w/o fun. eval./lin. alg.) = 11.129
Total wall-clock secs in linear solver = 1.444
Total wall-clock secs in NLP function evaluations = 0.231
Total wall-clock secs = 12.803
EXIT: Optimal Solution Found (tol = 1.0e-07).
Profiler ran for 12.8 s, capturing 1265706 events.
Host-side activity: calling CUDA APIs took 2.39 s (18.69% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ 66.94% │ 8.57 s │ 7479 │ 1.15 ms ± 8.37 ( 0.0 ‥ 76.55) │ cuStreamSynchronize │
│ 5.79% │ 741.08 ms │ 386 │ 1.92 ms ± 22.82 ( 0.01 ‥ 401.2) │ cudaMemcpyAsync │
│ 1.14% │ 146.36 ms │ 14 │ 10.45 ms ± 38.78 ( 0.02 ‥ 145.19) │ cudaFree │
│ 1.05% │ 134.48 ms │ 12206 │ 11.02 µs ± 15.52 ( 2.86 ‥ 1585.01) │ cuLaunchKernel │
│ 0.53% │ 68.31 ms │ 63 │ 1.08 ms ± 1.04 ( 0.71 ‥ 8.08) │ cuMemcpyHtoDAsync │
│ 0.43% │ 54.86 ms │ 4714 │ 11.64 µs ± 5.37 ( 3.58 ‥ 57.46) │ cudaLaunchKernel │
│ 0.39% │ 49.56 ms │ 5930 │ 8.36 µs ± 20.97 ( 1.19 ‥ 1075.74) │ cuMemAllocFromPoolAsync │
│ 0.31% │ 39.88 ms │ 2228 │ 17.9 µs ± 6.56 ( 10.73 ‥ 70.1) │ cuMemcpyDtoHAsync │
│ 0.24% │ 30.21 ms │ 10263 │ 2.94 µs ± 1.94 ( 1.43 ‥ 41.96) │ cuMemFreeAsync │
│ 0.13% │ 16.78 ms │ 572 │ 29.33 µs ± 15.69 ( 7.63 ‥ 93.94) │ cuMemcpyDtoDAsync │
│ 0.10% │ 12.9 ms │ 784 │ 16.46 µs ± 13.86 ( 2.15 ‥ 76.77) │ cudaMemsetAsync │
│ 0.01% │ 1.49 ms │ 8 │ 186.15 µs ± 132.94 ( 38.15 ‥ 478.51) │ cudaMalloc │
│ 0.01% │ 710.49 µs │ 73 │ 9.73 µs ± 1.78 ( 1.91 ‥ 11.44) │ cudaStreamSynchronize │
│ 0.00% │ 535.01 µs │ 1102 │ 485.49 ns ± 468.02 ( 0.0 ‥ 2384.19) │ cudaGetLastError │
│ 0.00% │ 430.11 µs │ 1144 │ 375.97 ns ± 314.69 ( 0.0 ‥ 1668.93) │ cuCtxPushCurrent │
│ 0.00% │ 283.72 µs │ 1144 │ 248.01 ns ± 178.15 ( 0.0 ‥ 953.67) │ cuCtxPopCurrent │
│ 0.00% │ 243.9 µs │ 1144 │ 213.2 ns ± 169.28 ( 0.0 ‥ 715.26) │ cuDeviceGet │
│ 0.00% │ 240.09 µs │ 1144 │ 209.87 ns ± 182.07 ( 0.0 ‥ 953.67) │ cuCtxGetDevice │
│ 0.00% │ 84.4 µs │ 1 │ │ cuMemGetInfo │
│ 0.00% │ 53.41 µs │ 3 │ 17.8 µs ± 20.76 ( 4.53 ‥ 41.72) │ cuMemsetD32Async │
│ 0.00% │ 12.87 µs │ 6 │ 2.15 µs ± 0.99 ( 0.72 ‥ 3.34) │ cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│ 0.00% │ 2.15 µs │ 2 │ 1.07 µs ± 0.84 ( 0.48 ‥ 1.67) │ cuMemPoolGetAttribute │
│ 0.00% │ 1.91 µs │ 3 │ 635.78 ns ± 688.26 (238.42 ‥ 1430.51) │ cudaDeviceGetAttribute │
│ 0.00% │ 1.67 µs │ 1 │ │ cudaGetDevice │
└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────────────────────────────────────────┘
Device-side activity: GPU was busy for 10.13 s (79.13% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────
│ 40.23% │ 5.15 s │ 67 │ 76.89 ms ± 0.2 ( 76.53 ‥ 77.38) │ gpu__transfer_jtsj_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Cart ⋯
│ 22.10% │ 2.83 s │ 111 │ 25.5 ms ± 25.79 ( 0.05 ‥ 51.91) │ gpu__transfer_to_csc_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Ca ⋯
│ 2.63% │ 336.15 ms │ 67 │ 5.02 ms ± 0.03 ( 4.96 ‥ 5.22) │ void cudss::factorize_ker<long, double, int, double, 32, 1, 0, 0, 1, 0, 0, 0, 1>( ⋯
│ 1.75% │ 224.71 ms │ 2 │ 112.35 ms ± 157.64 ( 0.89 ‥ 223.82) │ void cudss::radix_sort_ker<int, int, 1, 20, 4, 0>(int, int const*, int*, int*, in ⋯
│ 1.35% │ 173.45 ms │ 67 │ 2.59 ms ± 0.0 ( 2.58 ‥ 2.6) │ void cudss::factorize_v3_ker<long, double, int, double, 256, 1, 0, 0, 1, 1, 0, 0, ⋯
│ 1.14% │ 145.91 ms │ 1 │ │ void cudss::radix_sort_ker<long, int, 1, 20, 4, 1>(int, long const*, int*, int*, ⋯
│ 1.05% │ 134.73 ms │ 1 │ │ void cudss::map_ker<long, int, int, 128, 1, 2>(int, int const*, int const*, int c ⋯
│ 1.01% │ 128.83 ms │ 497 │ 259.22 µs ± 155.13 ( 20.27 ‥ 382.42) │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double ⋯
│ 0.96% │ 123.07 ms │ 1 │ │ void cudss::nnz_per_col_ker<int, int, 1, 0>(int, int const*, int const*, int cons ⋯
│ 0.80% │ 102.91 ms │ 1 │ │ void cudss::trans_columns_ker<int, 2, 128>(int, int const*, int const*, int*, int ⋯
│ 0.74% │ 95.35 ms │ 147 │ 648.65 µs ± 0.62 (647.31 ‥ 650.64) │ void cudss::bwd_v2_ker<long, double, int, 32, 16, 1, 0, 0>(int const*, int const* ⋯
│ 0.68% │ 87.66 ms │ 147 │ 596.31 µs ± 3.41 (589.85 ‥ 605.11) │ void cudss::bwd_ker<long, double, int, 128, 128, 16, 8, 8, 1, 0, 1, 0>(int const* ⋯
│ 0.57% │ 72.64 ms │ 147 │ 494.17 µs ± 1.89 (490.19 ‥ 499.25) │ void cudss::fwd_v2_ker<long, double, int, 32, 1, 0, 0, 32, 1>(int const*, int con ⋯
│ 0.44% │ 56.56 ms │ 66 │ 856.98 µs ± 1019.88 ( 1.43 ‥ 7880.45) │ [copy pageable to device memory] ⋯
│ 0.35% │ 45.4 ms │ 2 │ 22.7 ms ± 32.05 ( 0.04 ‥ 45.37) │ gpu__set_coo_to_csc_map_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, ⋯
│ 0.28% │ 35.58 ms │ 1 │ │ void cudss::trans_nnz_per_row_ker<int, 2, 128>(int, int const*, int const*, int*, ⋯
│ 0.19% │ 24.35 ms │ 23 │ 1.06 ms ± 0.15 ( 0.67 ‥ 1.25) │ comparator_small_kernel(Tuple<CuDeviceArray<Tuple<Int64, Int64>, 1, 1>, CuDeviceA ⋯
│ 0.18% │ 22.6 ms │ 105 │ 215.25 µs ± 46.07 ( 95.61 ‥ 312.09) │ comparator_kernel(Tuple<CuDeviceArray<Tuple<Int64, Int64>, 1, 1>, CuDeviceArray<I ⋯
│ 0.17% │ 22.08 ms │ 1 │ │ void cudss::adjncy_ker<int, int, 128, 2>(int, int const*, int const*, int*, int*, ⋯
│ 0.13% │ 16.86 ms │ 147 │ 114.69 µs ± 2.06 (112.53 ‥ 121.36) │ void cudss::fwd_v2_ker<long, double, int, 256, 1, 0, 0, 256, 1>(int const*, int c ⋯
│ 0.12% │ 15.2 ms │ 1 │ │ gpu__build_condensed_aug_symbolic_hess_kernel_(CompilerMetadata<DynamicSize, Dyna ⋯
│ 0.12% │ 14.86 ms │ 1 │ │ void cudss::map_offsets_ker<long, int, int, 128, 1>(int, int const*, int const*, ⋯
│ 0.10% │ 12.96 ms │ 876 │ 14.79 µs ± 9.87 ( 6.2 ‥ 40.05) │ [copy device to device memory] ⋯
│ 0.10% │ 12.72 ms │ 2307 │ 5.51 µs ± 159.21 ( 1.43 ‥ 7647.28) │ [copy device to pageable memory] ⋯
│ 0.09% │ 11.9 ms │ 110 │ 108.21 µs ± 1.25 (104.43 ‥ 110.86) │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│ 0.09% │ 10.97 ms │ 42 │ 261.12 µs ± 35.6 (178.34 ‥ 303.03) │ comparator_small_kernel(CuDeviceArray<Tuple<Tuple<Int32, Int32>, Int64>, 1, 1>, I ⋯
│ 0.09% │ 10.97 ms │ 10 │ 1.1 ms ± 3.17 ( 0.0 ‥ 10.08) │ void cudss::dependency_map_ker<int, int, 32>(int, int const*, int const*, int con ⋯
│ 0.08% │ 9.92 ms │ 1 │ │ void cudss::xadj_ker<int, int, 128, 2>(int, int const*, int const*, int*, int*, i ⋯
│ 0.08% │ 9.92 ms │ 938 │ 10.57 µs ± 1.84 ( 6.2 ‥ 12.87) │ void cusparse::csrmv_v3_partition_kernel<std::integral_constant<bool, false>, 256 ⋯
│ 0.07% │ 8.64 ms │ 63 │ 137.16 µs ± 17.18 ( 96.56 ‥ 161.41) │ comparator_small_kernel(CuDeviceArray<Tuple<Int64, Int64>, 1, 1>, Int32, Int32, I ⋯
│ 0.07% │ 8.43 ms │ 441 │ 19.11 µs ± 3.53 ( 13.83 ‥ 22.41) │ void cusparse::csrmv_v3_kernel<std::integral_constant<bool, false>, int, int, dou ⋯
│ 0.06% │ 7.75 ms │ 1 │ │ void cudss::fwd_bwd_order_ker<long, int, 256>(int, int, int*, int*, long const*, ⋯
│ 0.06% │ 7.74 ms │ 463 │ 16.72 µs ± 3.96 ( 6.68 ‥ 20.27) │ partial_mapreduce_grid(norm, max, Float64, CartesianIndices<1, Tuple<OneTo<Int64> ⋯
│ 0.06% │ 7.45 ms │ 551 │ 13.52 µs ± 10.79 ( 5.72 ‥ 31.71) │ void axpy_kernel_val<double, double>(cublasAxpyParamsVal<double, double, double>) ⋯
│ 0.06% │ 7.24 ms │ 132 │ 54.86 µs ± 11.59 ( 25.75 ‥ 78.44) │ comparator_kernel(CuDeviceArray<Tuple<Tuple<Int32, Int32>, Int64>, 1, 1>, Int32, ⋯
│ 0.05% │ 5.89 ms │ 198 │ 29.75 µs ± 6.54 ( 14.07 ‥ 42.68) │ comparator_kernel(CuDeviceArray<Tuple<Int64, Int64>, 1, 1>, Int32, Int32, Int32, ⋯
│ 0.04% │ 5.69 ms │ 10 │ 568.7 µs ± 1743.12 ( 1.91 ‥ 5529.17) │ void cudss::csc_rows_ker<long, int, int, 256>(int, int, int const*, int const*, l ⋯
│ 0.04% │ 5.63 ms │ 2 │ 2.82 ms ± 0.01 ( 2.81 ‥ 2.82) │ void offsets_par_ker<int, int, int, 128, 1>(int, int*, int*, int*, int*, int) ⋯
│ 0.04% │ 5.1 ms │ 294 │ 17.33 µs ± 0.25 ( 16.69 ‥ 18.84) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.03% │ 4.41 ms │ 67 │ 65.78 µs ± 0.31 ( 65.09 ‥ 66.76) │ void cudss::copy_matrix_ker<long, double, int, 128>(int, int const*, long const*, ⋯
│ 0.03% │ 4.27 ms │ 404 │ 10.58 µs ± 1.68 ( 8.58 ‥ 13.83) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.03% │ 4.09 ms │ 294 │ 13.9 µs ± 2.44 ( 9.54 ‥ 17.64) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.03% │ 3.52 ms │ 603 │ 5.84 µs ± 2.56 ( 1.43 ‥ 12.16) │ gpu_fill_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndic ⋯
│ 0.03% │ 3.43 ms │ 147 │ 23.31 µs ± 0.22 ( 22.89 ‥ 23.84) │ void cudss::diag_ker<long, double, int, 256>(int, int, double*, long, double cons ⋯
│ 0.02% │ 3.12 ms │ 1 │ │ void cudss::nnz_per_col_ker<int, int, 1, 1>(int, int const*, int const*, int cons ⋯
│ 0.02% │ 3.12 ms │ 1 │ │ void offsets_par_ker<int, int, int, 128, 2>(int, int*, int*, int*, int*, int) ⋯
│ 0.02% │ 3.09 ms │ 1 │ │ void offsets_par_ker<long, long, int, 128, 2>(long, long*, long*, int*, int*, int ⋯
│ 0.02% │ 3.07 ms │ 1 │ │ void offsets_par_ker<long, long, long, 128, 2>(long, long*, long*, long*, int*, i ⋯
│ 0.02% │ 2.98 ms │ 787 │ 3.79 µs ± 7.07 ( 0.95 ‥ 97.99) │ [set device memory] ⋯
│ 0.02% │ 2.97 ms │ 294 │ 10.11 µs ± 0.29 ( 9.3 ‥ 10.73) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.02% │ 2.9 ms │ 810 │ 3.58 µs ± 0.25 ( 2.86 ‥ 4.29) │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<2, Tuple<OneTo<In ⋯
│ 0.02% │ 2.81 ms │ 1 │ │ void offsets_par_ker<long, int, int, 128, 1>(long, long*, int*, int*, int*, int) ⋯
│ 0.02% │ 2.76 ms │ 147 │ 18.76 µs ± 0.27 ( 18.12 ‥ 19.79) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.02% │ 2.5 ms │ 18 │ 138.65 µs ± 124.72 ( 16.21 ‥ 307.08) │ partial_scan(add_sum, CuDeviceArray<Int64, 1, 1>, CuDeviceArray<Bool, 1, 1>, Cart ⋯
│ 0.02% │ 2.46 ms │ 147 │ 16.76 µs ± 0.27 ( 16.21 ‥ 17.88) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.02% │ 2.27 ms │ 1 │ │ void cudss::modify_update_ker<long, int, 128>(int, int, long const*, int*, int co ⋯
│ 0.02% │ 2.08 ms │ 1 │ │ gpu__build_condensed_aug_symbolic_jt_kernel_(CompilerMetadata<DynamicSize, Dynami ⋯
│ 0.02% │ 2.08 ms │ 226 │ 9.19 µs ± 3.8 ( 6.2 ‥ 16.21) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.02% │ 2.07 ms │ 134 │ 15.47 µs ± 3.19 ( 11.68 ‥ 19.55) │ gpu__transfer_hessian_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, C ⋯
│ 0.02% │ 2.05 ms │ 337 │ 6.09 µs ± 2.49 ( 5.25 ‥ 29.33) │ partial_mapreduce_grid(identity, _, Bool, CartesianIndices<1, Tuple<OneTo<Int64>> ⋯
│ 0.02% │ 1.98 ms │ 224 │ 8.85 µs ± 2.54 ( 5.96 ‥ 11.92) │ partial_mapreduce_grid(ComposedFunction<float, norm>, _, Float64, CartesianIndice ⋯
│ 0.02% │ 1.95 ms │ 791 │ 2.47 µs ± 0.58 ( 1.91 ‥ 3.81) │ void cusparse::vector_scalar_multiply_kernel<256, cusparse::AlignedVectorScalarMu ⋯
│ 0.01% │ 1.91 ms │ 336 │ 5.7 µs ± 4.91 ( 3.58 ‥ 55.55) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.01% │ 1.79 ms │ 147 │ 12.17 µs ± 0.8 ( 10.73 ‥ 13.59) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.01% │ 1.69 ms │ 148 │ 11.41 µs ± 0.31 ( 10.97 ‥ 14.31) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.01% │ 1.63 ms │ 147 │ 11.12 µs ± 0.19 ( 10.73 ‥ 11.68) │ gpu__diag_operation_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.01% │ 1.62 ms │ 110 │ 14.72 µs ± 0.94 ( 13.35 ‥ 15.97) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.01% │ 1.61 ms │ 110 │ 14.63 µs ± 0.92 ( 13.35 ‥ 16.21) │ partial_mapreduce_grid(identity, _, Float64, CartesianIndices<1, Tuple<OneTo<Int6 ⋯
│ 0.01% │ 1.6 ms │ 499 │ 3.21 µs ± 0.19 ( 2.62 ‥ 3.81) │ partial_mapreduce_grid(identity, _, Float64, CartesianIndices<2, Tuple<OneTo<Int6 ⋯
│ 0.01% │ 1.53 ms │ 55 │ 27.8 µs ± 0.18 ( 27.42 ‥ 28.13) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.01% │ 1.52 ms │ 55 │ 27.71 µs ± 0.23 ( 27.18 ‥ 28.37) │ partial_mapreduce_grid(identity, _, Float64, CartesianIndices<1, Tuple<OneTo<Int6 ⋯
│ 0.01% │ 1.45 ms │ 56 │ 25.94 µs ± 0.3 ( 25.51 ‥ 26.46) │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│ 0.01% │ 1.43 ms │ 118 │ 12.09 µs ± 2.02 ( 9.78 ‥ 15.02) │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│ 0.01% │ 1.4 ms │ 110 │ 12.75 µs ± 1.2 ( 10.49 ‥ 14.54) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.01% │ 1.3 ms │ 147 │ 8.82 µs ± 0.19 ( 8.34 ‥ 9.3) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.01% │ 1.29 ms │ 110 │ 11.72 µs ± 0.45 ( 10.97 ‥ 12.64) │ partial_mapreduce_grid(identity, _, Float64, CartesianIndices<1, Tuple<OneTo<Int6 ⋯
│ 0.01% │ 1.23 ms │ 118 │ 10.42 µs ± 1.46 ( 8.58 ‥ 12.4) │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│ 0.01% │ 1.22 ms │ 110 │ 11.1 µs ± 0.25 ( 10.49 ‥ 11.68) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.01% │ 1.2 ms │ 110 │ 10.88 µs ± 0.65 ( 9.78 ‥ 12.16) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.01% │ 1.16 ms │ 341 │ 3.4 µs ± 0.56 ( 2.86 ‥ 6.68) │ partial_mapreduce_grid(identity, _, Bool, CartesianIndices<2, Tuple<OneTo<Int64>, ⋯
│ 0.01% │ 1.08 ms │ 114 │ 9.46 µs ± 0.27 ( 8.82 ‥ 10.49) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.01% │ 1.07 ms │ 167 │ 6.41 µs ± 0.86 ( 5.48 ‥ 8.11) │ partial_mapreduce_grid(_136<promote_<Float64>>, add_sum, Float64, CartesianIndice ⋯
│ 0.01% │ 1.04 ms │ 55 │ 18.89 µs ± 0.4 ( 18.36 ‥ 20.03) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.01% │ 1.03 ms │ 112 │ 9.23 µs ± 1.01 ( 7.87 ‥ 11.68) │ gpu_linear_copy_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Cartesi ⋯
│ 0.01% │ 1.02 ms │ 1 │ │ void cudss::radix_sort_ker<long, int, 1, 20, 4, 0>(int, long const*, int*, int*, ⋯
│ 0.01% │ 945.57 µs │ 147 │ 6.43 µs ± 0.17 ( 5.96 ‥ 6.91) │ void cudss::perm_ker<double, int, int, 128, 1>(int, double*, double*, int*) ⋯
│ 0.01% │ 838.99 µs │ 147 │ 5.71 µs ± 0.16 ( 5.48 ‥ 5.96) │ void cudss::perm_ker<double, int, int, 128, 0>(int, double*, double*, int*) ⋯
│ 0.01% │ 803.47 µs │ 1 │ │ void cudss::compute_hybrid_minimum_chunk_size_ker<long, double, int, 128, 1>(int, ⋯
│ 0.01% │ 783.21 µs │ 55 │ 14.24 µs ± 0.61 ( 12.87 ‥ 15.97) │ partial_mapreduce_grid(identity, min, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│ 0.01% │ 761.51 µs │ 220 │ 3.46 µs ± 0.17 ( 2.86 ‥ 4.05) │ partial_mapreduce_grid(identity, min, Float64, CartesianIndices<2, Tuple<OneTo<In ⋯
│ 0.01% │ 760.79 µs │ 57 │ 13.35 µs ± 0.26 ( 12.87 ‥ 13.83) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.01% │ 733.61 µs │ 55 │ 13.34 µs ± 0.75 ( 11.68 ‥ 15.02) │ partial_mapreduce_grid(identity, min, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│ 0.01% │ 655.65 µs │ 111 │ 5.91 µs ± 1.45 ( 4.29 ‥ 7.87) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 625.37 µs │ 67 │ 9.33 µs ± 1.61 ( 5.72 ‥ 10.49) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 615.36 µs │ 55 │ 11.19 µs ± 1.5 ( 8.82 ‥ 15.02) │ partial_mapreduce_grid(identity, min, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│ 0.00% │ 606.54 µs │ 55 │ 11.03 µs ± 0.26 ( 10.49 ‥ 11.68) │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│ 0.00% │ 598.91 µs │ 55 │ 10.89 µs ± 1.11 ( 9.78 ‥ 14.31) │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│ 0.00% │ 582.93 µs │ 31 │ 18.8 µs ± 23.52 ( 1.91 ‥ 65.09) │ aggregate_partial_scan(add_sum, CuDeviceArray<Int64, 1, 1>, CuDeviceArray<Int64, ⋯
│ 0.00% │ 579.83 µs │ 111 │ 5.22 µs ± 0.34 ( 4.77 ‥ 5.96) │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│ 0.00% │ 557.18 µs │ 55 │ 10.13 µs ± 0.19 ( 9.54 ‥ 10.49) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.00% │ 556.71 µs │ 114 │ 4.88 µs ± 0.22 ( 4.53 ‥ 6.2) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 544.07 µs │ 55 │ 9.89 µs ± 1.29 ( 8.58 ‥ 13.11) │ partial_mapreduce_grid(identity, min, Float64, CartesianIndices<1, Tuple<OneTo<In ⋯
│ 0.00% │ 538.11 µs │ 167 │ 3.22 µs ± 0.15 ( 2.86 ‥ 3.58) │ partial_mapreduce_grid(identity, add_sum, Float64, CartesianIndices<2, Tuple<OneT ⋯
│ 0.00% │ 518.32 µs │ 55 │ 9.42 µs ± 0.27 ( 8.82 ‥ 10.01) │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│ 0.00% │ 505.45 µs │ 57 │ 8.87 µs ± 0.16 ( 8.58 ‥ 9.06) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 473.02 µs │ 15 │ 31.53 µs ± 28.49 ( 4.05 ‥ 78.2) │ findall ⋯
│ 0.00% │ 411.51 µs │ 80 │ 5.14 µs ± 2.05 ( 3.58 ‥ 8.58) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 400.3 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 380.99 µs │ 56 │ 6.8 µs ± 0.21 ( 6.44 ‥ 7.87) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 372.65 µs │ 1 │ │ void cudss::define_superpanel_ker<long, int, 256>(int, int, int, long const*, int ⋯
│ 0.00% │ 363.83 µs │ 57 │ 6.38 µs ± 0.42 ( 5.72 ‥ 7.63) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 337.12 µs │ 147 │ 2.29 µs ± 0.16 ( 1.91 ‥ 2.62) │ void cusparse::vector_scalar_multiply_kernel<256, cusparse::UnalignedVectorScalar ⋯
│ 0.00% │ 332.83 µs │ 1 │ │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│ 0.00% │ 288.72 µs │ 64 │ 4.51 µs ± 0.46 ( 4.05 ‥ 6.91) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 282.53 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 263.93 µs │ 57 │ 4.63 µs ± 0.16 ( 4.29 ‥ 5.01) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 262.26 µs │ 1 │ │ void cudss::count_dep_fwd_bwd_ker<long, int, 256>(int, int, int*, int*, long cons ⋯
│ 0.00% │ 241.04 µs │ 117 │ 2.06 µs ± 2.76 ( 1.19 ‥ 26.7) │ gpu_fill_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndic ⋯
│ 0.00% │ 232.22 µs │ 57 │ 4.07 µs ± 0.16 ( 3.81 ‥ 4.53) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 230.07 µs │ 67 │ 3.43 µs ± 0.12 ( 3.1 ‥ 3.58) │ void cudss::independent_ker<long, double, int, double, 64, 1, 0, 0>(int, int, dou ⋯
│ 0.00% │ 215.77 µs │ 57 │ 3.79 µs ± 0.13 ( 3.58 ‥ 4.29) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 215.05 µs │ 57 │ 3.77 µs ± 0.14 ( 3.58 ‥ 4.05) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 214.34 µs │ 57 │ 3.76 µs ± 0.16 ( 3.58 ‥ 4.29) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 211.72 µs │ 57 │ 3.71 µs ± 0.14 ( 3.58 ‥ 4.05) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 206.71 µs │ 57 │ 3.63 µs ± 0.11 ( 3.58 ‥ 4.05) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 191.45 µs │ 57 │ 3.36 µs ± 0.1 ( 3.1 ‥ 3.58) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 190.02 µs │ 1 │ │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│ 0.00% │ 182.87 µs │ 1 │ │ void cudss::updates_ker<long, int, 128>(int, int const*, int const*, long*, int c ⋯
│ 0.00% │ 167.85 µs │ 1 │ │ void cudss::updates_offsets_ker<long, int, 128>(int, int const*, long const*, int ⋯
│ 0.00% │ 162.12 µs │ 56 │ 2.9 µs ± 0.14 ( 2.62 ‥ 3.1) │ partial_mapreduce_grid(identity, add_sum, Float64, CartesianIndices<1, Tuple<OneT ⋯
│ 0.00% │ 160.22 µs │ 13 │ 12.32 µs ± 8.06 ( 6.44 ‥ 37.67) │ partial_scan(add_sum, CuDeviceArray<Int64, 1, 1>, CuDeviceArray<Int64, 1, 1>, Car ⋯
│ 0.00% │ 148.53 µs │ 57 │ 2.61 µs ± 0.16 ( 2.38 ‥ 2.86) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 148.06 µs │ 57 │ 2.6 µs ± 0.15 ( 2.38 ‥ 2.86) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 146.63 µs │ 57 │ 2.57 µs ± 0.14 ( 2.38 ‥ 2.86) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 143.05 µs │ 57 │ 2.51 µs ± 0.14 ( 2.15 ‥ 2.62) │ gpu_compress_to_dense(CompilerMetadata<DynamicSize, DynamicCheck, void, Cartesian ⋯
│ 0.00% │ 135.9 µs │ 57 │ 2.38 µs ± 0.19 ( 1.91 ‥ 2.62) │ gpu_kerg(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 133.04 µs │ 31 │ 4.29 µs ± 1.69 ( 1.91 ‥ 7.15) │ gpu_linear_copy_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Cartesi ⋯
│ 0.00% │ 124.22 µs │ 1 │ │ void cudss::supernode_map_ker<int, 1>(int, int const*, int const*, int*, int, int ⋯
│ 0.00% │ 122.55 µs │ 57 │ 2.15 µs ± 0.18 ( 1.91 ‥ 2.38) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 115.63 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 111.1 µs │ 56 │ 1.98 µs ± 0.18 ( 1.67 ‥ 2.38) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 103.47 µs │ 57 │ 1.82 µs ± 0.16 ( 1.43 ‥ 2.15) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 103.47 µs │ 57 │ 1.82 µs ± 0.16 ( 1.67 ‥ 2.15) │ gpu_kerf(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 102.04 µs │ 57 │ 1.79 µs ± 0.16 ( 1.43 ‥ 2.15) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 101.33 µs │ 57 │ 1.78 µs ± 0.18 ( 1.43 ‥ 2.15) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 101.09 µs │ 19 │ 5.32 µs ± 2.87 ( 1.91 ‥ 11.44) │ scan ⋯
│ 0.00% │ 100.61 µs │ 57 │ 1.77 µs ± 0.13 ( 1.67 ‥ 2.15) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 100.14 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 95.61 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 87.5 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 87.26 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 87.02 µs │ 31 │ 2.81 µs ± 0.84 ( 1.19 ‥ 4.29) │ gpu_fill_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndic ⋯
│ 0.00% │ 82.73 µs │ 4 │ 20.68 µs ± 0.53 ( 20.03 ‥ 21.22) │ gpu_linear_copy_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Cartesi ⋯
│ 0.00% │ 82.49 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 77.25 µs │ 2 │ 38.62 µs ± 2.7 ( 36.72 ‥ 40.53) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 77.25 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 76.77 µs │ 2 │ 38.39 µs ± 8.09 ( 32.66 ‥ 44.11) │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│ 0.00% │ 75.82 µs │ 1 │ │ void cudss::supernode_map_offsets_ker<int, 1>(int, int const*, int const*, int*, ⋯
│ 0.00% │ 75.1 µs │ 55 │ 1.37 µs ± 0.16 ( 1.19 ‥ 1.67) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.00% │ 73.19 µs │ 55 │ 1.33 µs ± 0.16 ( 1.19 ‥ 1.67) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.00% │ 72.48 µs │ 55 │ 1.32 µs ± 0.15 ( 1.19 ‥ 1.67) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.00% │ 72.24 µs │ 55 │ 1.31 µs ± 0.14 ( 0.95 ‥ 1.67) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.00% │ 72.0 µs │ 55 │ 1.31 µs ± 0.15 ( 1.19 ‥ 1.67) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.00% │ 70.81 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 66.52 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 63.42 µs │ 2 │ 31.71 µs ± 3.03 ( 29.56 ‥ 33.86) │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 60.32 µs │ 55 │ 1.1 µs ± 0.17 ( 0.95 ‥ 1.43) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.00% │ 60.08 µs │ 55 │ 1.09 µs ± 0.16 ( 0.95 ‥ 1.43) │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 59.6 µs │ 55 │ 1.08 µs ± 0.16 ( 0.95 ‥ 1.43) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.00% │ 59.6 µs │ 55 │ 1.08 µs ± 0.17 ( 0.95 ‥ 1.43) │ gpu_kerh2(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, T ⋯
│ 0.00% │ 52.45 µs │ 1 │ │ gpu__set_con_scale_sparse_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, voi ⋯
│ 0.00% │ 50.07 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 50.07 µs │ 1 │ │ gpu__set_colptr_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, Cartesi ⋯
│ 0.00% │ 48.64 µs │ 2 │ 24.32 µs ± 0.67 ( 23.84 ‥ 24.8) │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 48.4 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 47.92 µs │ 2 │ 23.96 µs ± 0.17 ( 23.84 ‥ 24.08) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 43.63 µs │ 2 │ 21.82 µs ± 7.92 ( 16.21 ‥ 27.42) │ gpu__set_coo_to_colptr_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, ⋯
│ 0.00% │ 40.53 µs │ 12 │ 3.38 µs ± 0.09 ( 3.34 ‥ 3.58) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 39.34 µs │ 2 │ 19.67 µs ± 0.17 ( 19.55 ‥ 19.79) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 38.86 µs │ 1 │ │ void cudss::blocks_ker<long, int, 128>(int, int const*, long const*, int const*, ⋯
│ 0.00% │ 38.15 µs │ 5 │ 7.63 µs ± 2.68 ( 4.77 ‥ 10.97) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 37.67 µs │ 4 │ 9.42 µs ± 2.64 ( 6.2 ‥ 12.64) │ partial_mapreduce_grid(is_valid, _, Bool, CartesianIndices<1, Tuple<OneTo<Int64>> ⋯
│ 0.00% │ 35.05 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 33.86 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 31.95 µs │ 1 │ │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 29.33 µs │ 2 │ 14.66 µs ± 3.88 ( 11.92 ‥ 17.4) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 29.09 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 28.13 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 27.18 µs │ 1 │ │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│ 0.00% │ 26.94 µs │ 3 │ 8.98 µs ± 0.28 ( 8.82 ‥ 9.3) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 26.7 µs │ 2 │ 13.35 µs ± 0.0 ( 13.35 ‥ 13.35) │ gpu_setindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│ 0.00% │ 24.08 µs │ 2 │ 12.04 µs ± 0.84 ( 11.44 ‥ 12.64) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 23.84 µs │ 1 │ │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 21.7 µs │ 4 │ 5.42 µs ± 0.23 ( 5.25 ‥ 5.72) │ partial_mapreduce_grid(identity, add_sum, Int64, CartesianIndices<1, Tuple<OneTo< ⋯
│ 0.00% │ 20.74 µs │ 2 │ 10.37 µs ± 0.17 ( 10.25 ‥ 10.49) │ gpu_setindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│ 0.00% │ 18.84 µs │ 3 │ 6.28 µs ± 0.14 ( 6.2 ‥ 6.44) │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│ 0.00% │ 18.12 µs │ 1 │ │ void cudss::offsets_ker<int, 1>(int, int*) ⋯
│ 0.00% │ 17.88 µs │ 2 │ 8.94 µs ± 0.17 ( 8.82 ‥ 9.06) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 15.5 µs │ 3 │ 5.17 µs ± 0.96 ( 4.29 ‥ 6.2) │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│ 0.00% │ 15.02 µs │ 2 │ 7.51 µs ± 0.51 ( 7.15 ‥ 7.87) │ gpu_fill_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndic ⋯
│ 0.00% │ 14.54 µs │ 3 │ 4.85 µs ± 0.77 ( 4.29 ‥ 5.72) │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│ 0.00% │ 14.31 µs │ 2 │ 7.15 µs ± 1.01 ( 6.44 ‥ 7.87) │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│ 0.00% │ 14.07 µs │ 1 │ │ gpu__force_lower_triangular_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, v ⋯
│ 0.00% │ 12.87 µs │ 2 │ 6.44 µs ± 0.34 ( 6.2 ‥ 6.68) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 12.64 µs │ 3 │ 4.21 µs ± 0.28 ( 4.05 ‥ 4.53) │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│ 0.00% │ 12.4 µs │ 1 │ │ partial_mapreduce_grid(identity, _, Int64, CartesianIndices<1, Tuple<OneTo<Int64> ⋯
│ 0.00% │ 12.16 µs │ 4 │ 3.04 µs ± 0.23 ( 2.86 ‥ 3.34) │ partial_mapreduce_grid(identity, add_sum, Int64, CartesianIndices<2, Tuple<OneTo< ⋯
│ 0.00% │ 11.44 µs │ 1 │ │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│ 0.00% │ 10.97 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 10.73 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 10.25 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 10.25 µs │ 1 │ │ void cudss::nnz_count_ker<long, int, 128>(int, int const*, int const*, long*, lon ⋯
│ 0.00% │ 8.34 µs │ 1 │ │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 7.87 µs │ 1 │ │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 6.91 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 6.44 µs │ 1 │ │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 6.2 µs │ 1 │ │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 6.2 µs │ 1 │ │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 6.2 µs │ 1 │ │ partial_mapreduce_grid(identity, _, Int64, CartesianIndices<2, Tuple<OneTo<Int64> ⋯
│ 0.00% │ 6.2 µs │ 1 │ │ gpu_getindex_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIn ⋯
│ 0.00% │ 5.48 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 5.48 µs │ 1 │ │ gpu_map_kernel(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices ⋯
│ 0.00% │ 5.01 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 4.29 µs │ 1 │ │ void cudss::supernode_dependant_ker<int, 128>(int, int*, int*, int*, int*) ⋯
│ 0.00% │ 4.29 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 4.05 µs │ 1 │ │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 3.58 µs │ 1 │ │ void cudss::set_default_ker<int, 128>(int, int*) ⋯
│ 0.00% │ 3.58 µs │ 1 │ │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, Car ⋯
│ 0.00% │ 3.1 µs │ 1 │ │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 2.86 µs │ 1 │ │ gpu_kerj(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 2.86 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 2.62 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 2.62 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 2.62 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 2.38 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 2.38 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 2.38 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 2.38 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
│ 0.00% │ 1.91 µs │ 1 │ │ gpu_kerh(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tu ⋯
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────── |
Now I think I see where the bottleneck is coming from. Variable tf appears in a large number of constrains. The kernels that are currently bottlenecked performs compression from coo to csc. For each csc entry we run serial for loop. So far we didn't have problem where a large number of uncompressed coo entries are mapped to a csc entry. |
@sshin23 good point. this is typical of problem with free final time ( |
@jbcaillau @sshin23 @frapac julia> CUDA.@profile MadNCL.solve!(solver) # cuDSS + K2r
MadNCL algorithm
Total number of variables............................: 350008
Total number of constraints..........................: 450011
outer inner objective inf_pr inf_du η μ ρ
0 0 -5.8505490e+01 1.02e-08 1.09e-09 1.00e-01 1.0e-01 1.00e+02
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
50 -5.8505490e+01 1.84e+03 1.00e+00 -1.0 5.00e-01 - 1.00e+00 1.00e+00h 1
51 -5.5166787e+01 1.11e-01 1.09e-01 -1.0 1.84e+03 - 9.79e-01 1.00e+00h 1
1 51 -5.9272031e+01 5.50e-02 1.09e-01 2.00e-03 2.0e-02 1.00e+02
51 -4.6956298e+01 1.11e-01 5.50e+00 -1.7 1.84e+03 - 9.79e-01 1.00e+00h 1
52 -4.9640344e+01 9.55e-02 2.55e+00 -1.7 3.04e+01 - 2.41e-01 1.00e+00h 1
53 -5.1953591e+01 2.77e-02 2.43e-01 -1.7 3.81e+00 - 7.63e-01 1.00e+00h 1
2 53 -6.7131864e+01 3.11e-02 2.43e-01 2.00e-03 2.0e-02 1.00e+03
53 1.2008474e+01 2.77e-02 2.80e+01 -1.7 3.81e+00 - 7.63e-01 1.00e+00h 1
54 -1.5580249e+00 2.10e-02 3.59e+00 -1.7 5.33e+00 - 8.58e-01 8.72e-01h 1
55 8.6557668e+01 4.09e-03 6.72e-02 -1.7 3.59e+01 - 9.21e-01 1.00e+00h 1
3 55 -9.8649629e+01 1.57e-02 6.72e-02 2.00e-03 2.0e-02 1.00e+04
55 1.6508346e+03 4.09e-03 1.41e+02 -1.7 3.59e+01 - 9.21e-01 1.00e+00h 1
56 2.4071415e+02 3.74e-02 4.74e-02 -1.7 5.39e+01 - 1.00e+00 1.00e+00h 1
4 56 -4.4719603e+01 3.84e-03 4.74e-02 2.00e-03 2.0e-02 1.00e+05
56 2.7638501e+03 3.74e-02 3.45e+02 -1.7 5.39e+01 - 1.00e+00 1.00e+00h 1
57 1.7232742e+02 3.58e-02 1.71e-01 -1.7 3.28e+01 - 1.00e+00 1.00e+00h 1
5 57 -1.1885306e+01 1.15e-03 1.71e-01 4.00e-04 4.0e-03 1.00e+05
57 5.3821207e+02 3.58e-02 1.15e+02 -2.4 3.28e+01 - 1.00e+00 1.00e+00h 1
58 1.1329393e+02 3.42e-02 5.30e-02 -2.4 7.88e+00 - 9.94e-01 1.00e+00h 1
6 58 -4.0055188e+00 3.22e-04 5.30e-02 8.00e-05 8.0e-04 1.00e+05
58 1.4238671e+02 3.42e-02 3.22e+01 -3.1 7.88e+00 - 9.94e-01 1.00e+00h 1
59 3.3451678e+01 3.76e-02 5.50e-02 -3.1 2.23e+00 - 8.96e-01 1.00e+00h 1
7 59 -1.7762561e+00 1.06e-04 5.50e-02 8.00e-05 8.0e-04 1.00e+06
59 4.3728795e+01 3.76e-02 9.52e+01 -3.1 2.23e+00 - 8.96e-01 1.00e+00h 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
60 3.1460631e+01 1.66e-03 7.04e-01 -3.1 5.90e-01 - 1.00e+00 1.00e+00h 1
61 3.3545820e+01 9.10e-05 1.95e-04 -3.1 2.39e-01 - 1.00e+00 1.00e+00h 1
8 61 -1.6436231e+00 6.74e-05 1.95e-04 1.60e-05 1.6e-04 1.00e+06
61 4.7776421e+01 9.10e-05 6.74e+01 -3.8 2.39e-01 - 1.00e+00 1.00e+00h 1
62 6.4907327e+00 3.69e-02 6.74e-01 -3.8 7.72e-01 - 9.80e-01 1.00e+00h 1
63 7.4583765e+00 2.79e-04 7.20e-02 -3.8 1.41e+00 - 1.00e+00 1.00e+00h 1
64 7.3868438e+00 4.18e-06 9.48e-04 -3.8 2.60e-01 - 1.00e+00 1.00e+00h 1
9 64 -1.1206517e+00 1.16e-05 9.48e-04 3.20e-06 3.2e-05 1.00e+06
64 8.2356167e+00 4.18e-06 1.16e+01 -4.5 2.60e-01 - 1.00e+00 1.00e+00h 1
65 7.6833037e-01 3.68e-03 5.86e-01 -4.5 8.87e-01 - 1.00e+00 1.00e+00h 1
66 5.1726664e-01 4.57e-04 8.08e-03 -4.5 5.67e-02 -3.4 1.00e+00 1.00e+00h 1
67 4.9360534e-01 8.48e-06 1.70e-03 -4.5 1.30e-01 -3.9 1.00e+00 1.00e+00h 1
10 67 -1.0245727e+00 5.49e-06 1.70e-03 3.20e-06 3.2e-05 1.00e+07
67 1.0188374e+00 8.48e-06 4.95e+01 -4.5 1.30e-01 -3.9 1.00e+00 1.00e+00h 1
68 -5.9019994e-02 4.95e-03 7.09e+00 -4.5 2.49e+00 - 4.80e-01 1.00e+00H 1
69 -1.6284289e-02 6.94e-04 1.66e-02 -4.5 1.12e+00 - 1.00e+00 1.00e+00h 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
70 -1.7971694e-02 1.09e-04 8.74e-04 -4.5 7.59e-01 - 1.00e+00 1.00e+00h 1
11 70 -1.0248683e+00 7.54e-06 8.74e-04 3.20e-06 3.2e-05 1.00e+08
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
70 1.3064956e+00 1.09e-04 6.79e+02 -4.5 7.59e-01 - 1.00e+00 1.00e+00h 1
71 -2.2420547e-01 5.05e-03 2.93e+01 -4.5 4.97e-01 - 4.20e-01 1.00e+00h 1
72 -6.2036534e-02 5.85e-04 1.15e-03 -4.5 1.05e-01 - 1.00e+00 1.00e+00h 1
12 72 -1.0196324e+00 1.27e-06 1.15e-03 6.40e-07 6.4e-06 1.00e+08
72 3.8344221e-01 5.85e-04 1.27e+02 -5.2 1.05e-01 - 1.00e+00 1.00e+00h 1
73 -1.0994926e+00 7.86e-02 1.36e+00 -5.2 1.18e+00 - 7.82e-01 1.00e+00h 1
74 -1.0290048e+00 7.50e-04 7.83e-02 -5.2 8.21e-01 - 1.00e+00 1.00e+00h 1
75 -1.0506458e+00 3.24e-04 1.06e-02 -5.2 3.54e-01 - 1.00e+00 1.00e+00h 1
76 -1.0508501e+00 1.89e-06 1.40e-05 -5.2 2.62e-02 - 1.00e+00 1.00e+00h 1
13 76 -1.0063088e+00 3.52e-07 1.40e-05 1.16e-07 1.2e-06 1.00e+08
76 -1.0142486e+00 1.89e-06 3.52e+01 -5.9 2.62e-02 - 1.00e+00 1.00e+00h 1
77 -1.2114770e+00 2.09e-02 8.82e-02 -5.9 5.31e-01 - 8.86e-01 1.00e+00h 1
78 -1.2541092e+00 6.25e-03 3.20e-02 -5.9 4.56e-01 - 1.00e+00 7.58e-01h 1
79 -1.2627063e+00 9.16e-04 1.19e-02 -5.9 2.89e-01 - 1.00e+00 1.00e+00h 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
80 -1.2656895e+00 1.31e-04 8.35e-04 -5.9 1.52e-01 - 1.00e+00 1.00e+00h 1
81 -1.2666296e+00 1.34e-05 9.43e-05 -5.9 4.21e-02 - 1.00e+00 1.00e+00h 1
82 -1.2668871e+00 9.50e-07 7.85e-06 -5.9 1.86e-02 - 1.00e+00 1.00e+00h 1
14 82 -1.0018805e+00 2.03e-07 7.85e-06 1.65e-08 1.6e-07 1.00e+08
82 -1.2196642e+00 9.50e-07 2.03e+01 -6.8 1.86e-02 - 1.00e+00 1.00e+00h 1
83 -1.2374161e+00 6.10e-04 1.31e+01 -6.8 5.37e-01 - 9.26e-01 3.55e-01h 1
84 -1.2779401e+00 6.36e-03 3.47e-02 -6.8 9.97e-01 - 1.00e+00 1.00e+00h 1
85 -1.2759068e+00 4.48e-04 5.78e-05 -6.8 3.97e-01 - 1.00e+00 1.00e+00h 1
86 -1.2758056e+00 8.54e-06 4.20e-06 -6.8 7.20e-02 - 1.00e+00 1.00e+00h 1
87 -1.2758043e+00 1.55e-09 1.08e-09 -6.8 1.58e-03 - 1.00e+00 1.00e+00h 1
15 87 -1.0004278e+00 2.17e-07 1.08e-09 1.77e-09 1.8e-08 1.00e+08
Profiler ran for 4.12 s, capturing 867992 events.
Host-side activity: calling CUDA APIs took 396.24 ms (9.63% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼─────────────────────────────────────────┼─────────────────────────────┤
│ 27.88% │ 1.15 s │ 6350 │ 180.68 µs ± 2271.95 ( 0.48 ‥ 31003.48) │ cuStreamSynchronize │
│ 1.24% │ 50.96 ms │ 10296 │ 4.95 µs ± 0.98 ( 3.58 ‥ 31.71) │ cuLaunchKernel │
│ 0.65% │ 26.62 ms │ 1987 │ 13.4 µs ± 1.27 ( 11.44 ‥ 24.08) │ cuMemcpyDtoHAsync │
│ 0.38% │ 15.76 ms │ 3356 │ 4.69 µs ± 0.62 ( 3.81 ‥ 11.44) │ cudaLaunchKernel │
│ 0.36% │ 14.91 ms │ 4286 │ 3.48 µs ± 1.33 ( 1.43 ‥ 19.31) │ cuMemAllocFromPoolAsync │
│ 0.32% │ 13.32 ms │ 55 │ 242.25 µs ± 42.7 (222.68 ‥ 383.38) │ cuMemcpyHtoDAsync │
│ 0.16% │ 6.53 ms │ 652 │ 10.02 µs ± 1.79 ( 7.63 ‥ 28.85) │ cuMemcpyDtoDAsync │
│ 0.10% │ 4.16 ms │ 331 │ 12.58 µs ± 2.96 ( 8.58 ‥ 23.6) │ cudaMemcpyAsync │
│ 0.06% │ 2.54 ms │ 504 │ 5.04 µs ± 2.05 ( 2.38 ‥ 31.23) │ cudaMemsetAsync │
│ 0.02% │ 1.02 ms │ 420 │ 2.44 µs ± 0.59 ( 1.43 ‥ 5.48) │ cuMemFreeAsync │
│ 0.01% │ 413.18 µs │ 1276 │ 323.81 ns ± 183.02 ( 0.0 ‥ 2384.19) │ cudaStreamGetCaptureInfo_v2 │
│ 0.01% │ 373.36 µs │ 108 │ 3.46 µs ± 0.76 ( 2.38 ‥ 7.15) │ cudaFuncGetAttributes │
│ 0.01% │ 360.73 µs │ 147 │ 2.45 µs ± 0.29 ( 1.91 ‥ 3.34) │ cudaStreamSynchronize │
│ 0.01% │ 280.14 µs │ 292 │ 959.39 ns ± 556.05 (238.42 ‥ 3814.7) │ cudaEventRecord │
│ 0.00% │ 205.76 µs │ 1380 │ 149.1 ns ± 172.67 ( 0.0 ‥ 1430.51) │ cudaGetLastError │
│ 0.00% │ 158.07 µs │ 1304 │ 121.22 ns ± 133.07 ( 0.0 ‥ 476.84) │ cuCtxPushCurrent │
│ 0.00% │ 144.48 µs │ 1304 │ 110.8 ns ± 132.49 ( 0.0 ‥ 476.84) │ cuCtxPopCurrent │
│ 0.00% │ 115.16 µs │ 1304 │ 88.31 ns ± 129.11 ( 0.0 ‥ 476.84) │ cuCtxGetDevice │
│ 0.00% │ 104.43 µs │ 1304 │ 80.08 ns ± 123.01 ( 0.0 ‥ 476.84) │ cuDeviceGet │
│ 0.00% │ 70.57 µs │ 80 │ 882.15 ns ± 664.03 (238.42 ‥ 3337.86) │ cudaFuncSetAttribute │
│ 0.00% │ 37.43 µs │ 92 │ 406.87 ns ± 167.8 (238.42 ‥ 953.67) │ cudaMalloc │
└──────────┴────────────┴───────┴─────────────────────────────────────────┴─────────────────────────────┘
Device-side activity: GPU was busy for 2.62 s (63.56% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ 21.10% │ 868.11 ms │ 40 │ 21.7 ms ± 0.35 ( 21.13 ‥ 22.93) │ void cudss::update_ker<long, double, int, double, 256, 1, 0, 0>(int, int, double*, double*, int const*, int const*, int*, int con ⋯
│ 9.43% │ 387.97 ms │ 40 │ 9.7 ms ± 0.02 ( 9.67 ‥ 9.78) │ void cudss::factorize_ker<long, double, int, double, 32, 1, 0, 0, 1, 0, 0, 0, 1>(int, int, int, double*, double*, long const*, in ⋯
│ 5.83% │ 239.9 ms │ 40 │ 6.0 ms ± 0.05 ( 5.85 ‥ 6.07) │ void cudss::factorize_v3_ker<long, double, int, double, 256, 1, 0, 0, 1, 1, 0, 0, 4>(int, int, int, int, double*, double*, long c ⋯
│ 5.60% │ 230.33 ms │ 92 │ 2.5 ms ± 0.0 ( 2.5 ‥ 2.51) │ void cudss::bwd_v2_ker<long, double, int, 32, 16, 1, 0, 0>(int const*, int const*, int, int, double*, double*, int const*, long c ⋯
│ 5.39% │ 221.61 ms │ 40 │ 5.54 ms ± 0.0 ( 5.53 ‥ 5.55) │ void cudss::kernel<cudss::getrf_params_<double, 2, 256, 1, 64, 64, 68, 16, 1, 1>>(int, int, void*, int, void*, int, int, int, int ⋯
│ 4.37% │ 179.82 ms │ 92 │ 1.95 ms ± 0.01 ( 1.94 ‥ 1.97) │ void cudss::fwd_v2_ker<long, double, int, 256, 1, 0, 0, 256, 1>(int const*, int const*, int, int, double*, int const*, long const ⋯
│ 3.67% │ 150.89 ms │ 92 │ 1.64 ms ± 0.0 ( 1.63 ‥ 1.65) │ void cudss::fwd_v2_ker<long, double, int, 32, 1, 0, 0, 32, 1>(int const*, int const*, int, int, double*, int const*, long const*, ⋯
│ 3.27% │ 134.45 ms │ 92 │ 1.46 ms ± 0.03 ( 1.41 ‥ 1.53) │ void cudss::bwd_ker<long, double, int, 128, 128, 16, 8, 8, 1, 0, 1, 0>(int const*, int const*, int, int, double*, double*, int co ⋯
│ 0.76% │ 31.46 ms │ 131 │ 240.18 µs ± 178.03 ( 27.66 ‥ 430.35) │ gpu__transfer_to_map_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, Dy ⋯
│ 0.64% │ 26.3 ms │ 92 │ 285.92 µs ± 0.96 (283.72 ‥ 288.25) │ void trsv_lt_exec<double, 32u, 32u, 4u, true, false>(int, double const*, long, double*, long, int*) ⋯
│ 0.47% │ 19.32 ms │ 92 │ 210.03 µs ± 0.83 (208.14 ‥ 212.67) │ void trsv_ln_exec<double, 32u, 32u, 4u, true>(int, double const*, long, double*, long, int*) ⋯
│ 0.29% │ 12.04 ms │ 55 │ 218.89 µs ± 27.14 (206.71 ‥ 309.71) │ [copy pageable to device memory] ⋯
│ 0.20% │ 8.31 ms │ 836 │ 9.93 µs ± 4.32 ( 5.48 ‥ 20.27) │ [copy device to device memory] ⋯
│ 0.12% │ 4.82 ms │ 236 │ 20.41 µs ± 6.66 ( 12.4 ‥ 28.61) │ void cusparse::csrmv_v3_kernel<std::integral_constant<bool, false>, int, int, double, double, double, double, void>(cusparse::Ker ⋯
│ 0.11% │ 4.57 ms │ 344 │ 13.28 µs ± 2.83 ( 7.15 ‥ 18.36) │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64> ⋯
│ 0.11% │ 4.55 ms │ 184 │ 24.74 µs ± 9.56 ( 14.54 ‥ 36.24) │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double, void>(cusparse::KernelCoeffs<double>, int cons ⋯
│ 0.11% │ 4.34 ms │ 922 │ 4.7 µs ± 1.85 ( 2.62 ‥ 10.73) │ _34(CuKernelContext, CuDeviceArray<Float64, 1l, 1l>, Broadcasted<CuArrayStyle<1l, DeviceMemory>, Tuple<OneTo<Int64>>, identity, D ⋯
│ 0.10% │ 4.24 ms │ 420 │ 10.1 µs ± 2.05 ( 6.2 ‥ 12.16) │ void cusparse::csrmv_v3_partition_kernel<std::integral_constant<bool, false>, 256, int, int, double, double, double>(int const*, ⋯
│ 0.10% │ 4.02 ms │ 40 │ 100.39 µs ± 0.79 ( 98.94 ‥ 102.52) │ void cudss::finalize_permute_ker<int, double, 256>(long, double*, long, int*, int*, int*, int*, int, int, int) ⋯ julia> CUDA.@profile MadNCL.solve!(solver2) # cuDSS + K1s
MadNCL algorithm
Total number of variables............................: 350008
Total number of constraints..........................: 450011
outer inner objective inf_pr inf_du η μ ρ
0 0 -5.8505490e+01 1.55e-09 1.08e-09 1.00e-01 1.0e-01 1.00e+02
178 -5.8505490e+01 1.84e+03 1.00e+00 -1.0 5.00e-01 - 1.00e+00 1.00e+00h 1
179 -5.5166787e+01 1.11e-01 1.09e-01 -1.0 1.84e+03 - 9.79e-01 1.00e+00h 1
1 179 -5.9272031e+01 5.50e-02 1.09e-01 2.00e-03 2.0e-02 1.00e+02
179 -4.6956298e+01 1.11e-01 5.50e+00 -1.7 1.84e+03 - 9.79e-01 1.00e+00h 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
180 -4.9640344e+01 9.55e-02 2.55e+00 -1.7 3.04e+01 - 2.41e-01 1.00e+00h 1
181 -5.1953591e+01 2.77e-02 2.43e-01 -1.7 3.81e+00 - 7.63e-01 1.00e+00h 1
2 181 -6.7131864e+01 3.11e-02 2.43e-01 2.00e-03 2.0e-02 1.00e+03
181 1.2008474e+01 2.77e-02 2.80e+01 -1.7 3.81e+00 - 7.63e-01 1.00e+00h 1
182 -1.5580249e+00 2.10e-02 3.59e+00 -1.7 5.33e+00 - 8.58e-01 8.72e-01h 1
183 8.6557668e+01 4.09e-03 6.72e-02 -1.7 3.59e+01 - 9.21e-01 1.00e+00h 1
3 183 -9.8649629e+01 1.57e-02 6.72e-02 2.00e-03 2.0e-02 1.00e+04
183 1.6508346e+03 4.09e-03 1.41e+02 -1.7 3.59e+01 - 9.21e-01 1.00e+00h 1
184 2.4071412e+02 3.74e-02 4.74e-02 -1.7 5.39e+01 - 1.00e+00 1.00e+00h 1
4 184 -4.4719599e+01 3.84e-03 4.74e-02 2.00e-03 2.0e-02 1.00e+05
184 2.7638497e+03 3.74e-02 3.45e+02 -1.7 5.39e+01 - 1.00e+00 1.00e+00h 1
185 1.7232743e+02 3.58e-02 1.71e-01 -1.7 3.28e+01 - 1.00e+00 1.00e+00h 1
5 185 -1.1885306e+01 1.15e-03 1.71e-01 4.00e-04 4.0e-03 1.00e+05
185 5.3821209e+02 3.58e-02 1.15e+02 -2.4 3.28e+01 - 1.00e+00 1.00e+00h 1
186 1.1329393e+02 3.42e-02 5.30e-02 -2.4 7.88e+00 - 9.94e-01 1.00e+00h 1
6 186 -4.0055187e+00 3.22e-04 5.30e-02 8.00e-05 8.0e-04 1.00e+05
186 1.4238670e+02 3.42e-02 3.22e+01 -3.1 7.88e+00 - 9.94e-01 1.00e+00h 1
187 3.3451678e+01 3.76e-02 5.50e-02 -3.1 2.23e+00 - 8.96e-01 1.00e+00h 1
7 187 -1.7762561e+00 1.06e-04 5.50e-02 8.00e-05 8.0e-04 1.00e+06
187 4.3728795e+01 3.76e-02 9.52e+01 -3.1 2.23e+00 - 8.96e-01 1.00e+00h 1
188 3.1460631e+01 1.66e-03 7.04e-01 -3.1 5.90e-01 - 1.00e+00 1.00e+00h 1
189 3.3545820e+01 9.10e-05 1.95e-04 -3.1 2.39e-01 - 1.00e+00 1.00e+00h 1
8 189 -1.6436231e+00 6.74e-05 1.95e-04 1.60e-05 1.6e-04 1.00e+06
189 4.7776421e+01 9.10e-05 6.74e+01 -3.8 2.39e-01 - 1.00e+00 1.00e+00h 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
190 6.4907326e+00 3.69e-02 6.74e-01 -3.8 7.72e-01 - 9.80e-01 1.00e+00h 1
191 7.4583766e+00 2.79e-04 7.20e-02 -3.8 1.41e+00 - 1.00e+00 1.00e+00h 1
192 7.3868438e+00 4.18e-06 9.48e-04 -3.8 2.60e-01 - 1.00e+00 1.00e+00h 1
9 192 -1.1206517e+00 1.16e-05 9.48e-04 3.20e-06 3.2e-05 1.00e+06
192 8.2356167e+00 4.18e-06 1.16e+01 -4.5 2.60e-01 - 1.00e+00 1.00e+00h 1
193 7.6833040e-01 3.68e-03 5.86e-01 -4.5 8.87e-01 - 1.00e+00 1.00e+00h 1
194 1.5744121e-01 2.44e-03 3.38e-01 -4.5 8.78e-01 -5.0 1.00e+00 1.00e+00h 1
195 -3.1471483e-01 1.39e-03 1.57e-01 -4.5 4.42e-01 -4.6 1.00e+00 1.00e+00h 1
196 -6.3479489e-01 4.24e-04 2.88e-02 -4.5 1.99e-01 -4.2 1.00e+00 1.00e+00h 1
197 -1.2323752e+00 3.12e-03 2.00e+00 -4.5 1.15e+00 -4.7 1.00e+00 5.00e-01h 2
198 -2.0016308e+00 3.30e-03 5.70e-02 -4.5 4.60e-01 -4.2 1.00e+00 1.00e+00h 1
199 -2.3625247e+00 4.59e-03 8.95e-01 -4.5 1.56e+02 - 1.13e-02 2.06e-02h 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
200 -2.6550833e+00 4.33e-03 2.03e-01 -4.5 4.77e+00 - 4.13e-01 2.83e-01h 1
201 -2.8621516e+00 1.51e-03 1.06e-02 -4.5 6.39e-01 - 1.00e+00 1.00e+00h 1
202 -2.8558890e+00 6.83e-05 3.90e-04 -4.5 2.34e-01 - 1.00e+00 1.00e+00h 1
10 202 -1.0279300e+00 3.80e-05 3.90e-04 3.20e-06 3.2e-05 1.00e+07
202 2.1027269e+01 6.83e-05 3.42e+02 -4.5 2.34e-01 - 1.00e+00 1.00e+00h 1
203 3.1390477e-01 2.86e-02 1.20e+00 -4.5 2.36e+00 - 8.01e-01 1.00e+00h 1
204 2.4291486e-01 2.40e-02 1.98e+00 -4.5 4.32e+01 - 6.58e-01 1.58e-01h 1
205 -2.4868202e-02 7.69e-03 1.51e+00 -4.5 2.24e+00 - 4.01e-01 1.00e+00h 1
206 -1.7954904e-02 2.28e-05 5.91e-03 -4.5 5.43e+00 - 1.00e+00 1.00e+00h 1
207 -1.7298746e-02 8.88e-07 1.89e-04 -4.5 9.90e-01 - 1.00e+00 1.00e+00h 1
11 207 -1.0248676e+00 7.52e-06 1.89e-04 3.20e-06 3.2e-05 1.00e+08
207 1.3043705e+00 8.88e-07 6.77e+02 -4.5 9.90e-01 - 1.00e+00 1.00e+00h 1
208 -2.2420396e-01 5.05e-03 2.91e+01 -4.5 4.94e-01 - 4.22e-01 1.00e+00h 1
209 -6.2104087e-02 5.83e-04 1.13e-03 -4.5 1.01e-01 - 1.00e+00 1.00e+00h 1
12 209 -1.0196317e+00 1.27e-06 1.13e-03 6.40e-07 6.4e-06 1.00e+08
209 3.8332369e-01 5.83e-04 1.27e+02 -5.2 1.01e-01 - 1.00e+00 1.00e+00h 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
210 -1.0994861e+00 7.86e-02 1.36e+00 -5.2 1.18e+00 - 7.82e-01 1.00e+00h 1
211 -1.0290005e+00 7.50e-04 7.83e-02 -5.2 8.21e-01 - 1.00e+00 1.00e+00h 1
212 -1.0506425e+00 3.24e-04 1.06e-02 -5.2 3.54e-01 - 1.00e+00 1.00e+00h 1
213 -1.0508466e+00 1.90e-06 1.40e-05 -5.2 2.62e-02 - 1.00e+00 1.00e+00h 1
13 213 -1.0063088e+00 3.52e-07 1.40e-05 1.16e-07 1.2e-06 1.00e+08
213 -1.0142439e+00 1.90e-06 3.52e+01 -5.9 2.62e-02 - 1.00e+00 1.00e+00h 1
214 -1.2114719e+00 2.09e-02 8.82e-02 -5.9 5.31e-01 - 8.86e-01 1.00e+00h 1
215 -1.2541043e+00 6.25e-03 3.20e-02 -5.9 4.56e-01 - 1.00e+00 7.58e-01h 1
216 -1.2627006e+00 9.16e-04 1.19e-02 -5.9 2.89e-01 - 1.00e+00 1.00e+00h 1
217 -1.2656838e+00 1.31e-04 8.35e-04 -5.9 1.52e-01 - 1.00e+00 1.00e+00h 1
218 -1.2666239e+00 1.34e-05 9.43e-05 -5.9 4.21e-02 - 1.00e+00 1.00e+00h 1
219 -1.2668814e+00 9.50e-07 7.85e-06 -5.9 1.86e-02 - 1.00e+00 1.00e+00h 1
14 219 -1.0018806e+00 2.03e-07 7.85e-06 1.65e-08 1.6e-07 1.00e+08
219 -1.2196587e+00 9.50e-07 2.03e+01 -6.8 1.86e-02 - 1.00e+00 1.00e+00h 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
220 -1.2374110e+00 6.10e-04 1.31e+01 -6.8 5.37e-01 - 9.26e-01 3.55e-01h 1
221 -1.2779345e+00 6.36e-03 3.47e-02 -6.8 9.97e-01 - 1.00e+00 1.00e+00h 1
222 -1.2759012e+00 4.48e-04 5.78e-05 -6.8 4.44e-01 - 1.00e+00 1.00e+00h 1
223 -1.2757999e+00 1.02e-05 4.20e-06 -6.8 8.63e-02 - 1.00e+00 1.00e+00h 1
224 -1.2757987e+00 1.55e-09 1.08e-09 -6.8 1.58e-03 - 1.00e+00 1.00e+00h 1
15 224 -1.0004278e+00 2.17e-07 1.08e-09 1.77e-09 1.8e-08 1.00e+08
Profiler ran for 8.07 s, capturing 1422567 events.
Host-side activity: calling CUDA APIs took 1.11 s (13.71% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼─────────────────────────────────────────┼─────────────────────────────┤
│ 41.85% │ 3.38 s │ 10075 │ 335.06 µs ± 4523.92 ( 0.48 ‥ 63260.79) │ cuStreamSynchronize │
│ 1.01% │ 81.73 ms │ 16579 │ 4.93 µs ± 0.97 ( 3.58 ‥ 33.62) │ cuLaunchKernel │
│ 0.60% │ 48.36 ms │ 3587 │ 13.48 µs ± 2.32 ( 11.44 ‥ 123.98) │ cuMemcpyDtoHAsync │
│ 0.37% │ 29.65 ms │ 9804 │ 3.02 µs ± 1.43 ( 1.19 ‥ 78.68) │ cuMemAllocFromPoolAsync │
│ 0.20% │ 16.13 ms │ 65 │ 248.09 µs ± 47.4 (221.97 ‥ 399.35) │ cuMemcpyHtoDAsync │
│ 0.20% │ 15.78 ms │ 3266 │ 4.83 µs ± 0.83 ( 3.81 ‥ 20.27) │ cudaLaunchKernel │
│ 0.19% │ 15.26 ms │ 8278 │ 1.84 µs ± 1.0 ( 0.95 ‥ 62.47) │ cuMemFreeAsync │
│ 0.06% │ 4.6 ms │ 351 │ 13.12 µs ± 3.5 ( 8.34 ‥ 31.95) │ cudaMemcpyAsync │
│ 0.06% │ 4.56 ms │ 406 │ 11.24 µs ± 1.42 ( 8.11 ‥ 18.12) │ cuMemcpyDtoDAsync │
│ 0.04% │ 2.99 ms │ 553 │ 5.41 µs ± 3.0 ( 2.38 ‥ 38.86) │ cudaMemsetAsync │
│ 0.01% │ 448.7 µs │ 129 │ 3.48 µs ± 0.83 ( 2.62 ‥ 7.15) │ cudaFuncGetAttributes │
│ 0.01% │ 437.74 µs │ 183 │ 2.39 µs ± 0.29 ( 1.67 ‥ 4.05) │ cudaStreamSynchronize │
│ 0.00% │ 208.62 µs │ 645 │ 323.44 ns ± 189.77 ( 0.0 ‥ 1430.51) │ cudaStreamGetCaptureInfo_v2 │
│ 0.00% │ 167.13 µs │ 129 │ 1.3 µs ± 0.93 ( 0.48 ‥ 3.58) │ cudaEventRecord │
│ 0.00% │ 165.46 µs │ 894 │ 185.08 ns ± 303.1 ( 0.0 ‥ 6675.72) │ cudaGetLastError │
│ 0.00% │ 129.22 µs │ 812 │ 159.14 ns ± 139.64 ( 0.0 ‥ 715.26) │ cuCtxPushCurrent │
│ 0.00% │ 111.58 µs │ 812 │ 137.41 ns ± 135.58 ( 0.0 ‥ 476.84) │ cuCtxPopCurrent │
│ 0.00% │ 79.15 µs │ 812 │ 97.48 ns ± 127.59 ( 0.0 ‥ 476.84) │ cuCtxGetDevice │
│ 0.00% │ 62.47 µs │ 820 │ 76.18 ns ± 122.52 ( 0.0 ‥ 476.84) │ cuDeviceGet │
│ 0.00% │ 238.42 ns │ 9 │ 26.49 ns ± 79.47 ( 0.0 ‥ 238.42) │ cuDeviceGetCount │
└──────────┴────────────┴───────┴─────────────────────────────────────────┴─────────────────────────────┘
Device-side activity: GPU was busy for 5.0 s (62.03% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ 51.61% │ 4.16 s │ 55 │ 75.69 ms ± 0.13 ( 75.42 ‥ 75.97) │ gpu__transfer_jtsj_kernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1 ⋯
│ 3.41% │ 275.07 ms │ 55 │ 5.0 ms ± 0.04 ( 4.96 ‥ 5.18) │ void cudss::factorize_ker<long, double, int, double, 32, 1, 0, 0, 1, 0, 0, 0, 1>(int, int, int, double*, double*, long const*, in ⋯
│ 1.76% │ 141.62 ms │ 55 │ 2.57 ms ± 0.0 ( 2.56 ‥ 2.59) │ void cudss::factorize_v3_ker<long, double, int, double, 256, 1, 0, 0, 1, 1, 0, 0, 4>(int, int, int, int, double*, double*, long c ⋯
│ 1.36% │ 109.76 ms │ 397 │ 276.49 µs ± 134.17 ( 17.4 ‥ 351.91) │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double, void>(cusparse::KernelCoeffs<double>, int cons ⋯
│ 0.67% │ 54.43 ms │ 84 │ 647.98 µs ± 0.57 (646.83 ‥ 650.41) │ void cudss::bwd_v2_ker<long, double, int, 32, 16, 1, 0, 0>(int const*, int const*, int, int, double*, double*, int const*, long c ⋯
│ 0.61% │ 49.03 ms │ 84 │ 583.74 µs ± 2.94 (578.17 ‥ 592.71) │ void cudss::bwd_ker<long, double, int, 128, 128, 16, 8, 8, 1, 0, 1, 0>(int const*, int const*, int, int, double*, double*, int co ⋯
│ 0.53% │ 42.56 ms │ 84 │ 506.61 µs ± 1.97 (503.06 ‥ 512.12) │ void cudss::fwd_v2_ker<long, double, int, 32, 1, 0, 0, 32, 1>(int const*, int const*, int, int, double*, int const*, long const*, ⋯
│ 0.21% │ 17.3 ms │ 109 │ 158.68 µs ± 161.73 ( 17.88 ‥ 344.28) │ gpu__transfer_to_map_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1l, Tuple<OneTo<Int64>>>, NDRange<1l, Dy ⋯
│ 0.18% │ 14.43 ms │ 65 │ 221.97 µs ± 29.14 (207.19 ‥ 308.75) │ [copy pageable to device memory] ⋯
│ 0.12% │ 9.6 ms │ 84 │ 114.26 µs ± 2.47 ( 111.1 ‥ 122.79) │ void cudss::fwd_v2_ker<long, double, int, 256, 1, 0, 0, 256, 1>(int const*, int const*, int, int, double*, int const*, long const ⋯
│ 0.10% │ 7.91 ms │ 1511 │ 5.23 µs ± 0.21 ( 4.77 ‥ 6.2) │ partial_mapreduce_grid(identity, _, Bool, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64>>>, V ⋯
│ 0.09% │ 6.86 ms │ 649 │ 10.57 µs ± 1.69 ( 6.2 ‥ 12.16) │ void cusparse::csrmv_v3_partition_kernel<std::integral_constant<bool, false>, 256, int, int, double, double, double>(int const*, ⋯
│ 0.08% │ 6.69 ms │ 3770 │ 1.77 µs ± 0.17 ( 1.43 ‥ 2.38) │ [copy device to pageable memory] ⋯
│ 0.07% │ 5.82 ms │ 574 │ 10.14 µs ± 4.84 ( 5.72 ‥ 20.27) │ [copy device to device memory] ⋯
│ 0.06% │ 5.24 ms │ 719 │ 7.28 µs ± 3.73 ( 3.58 ‥ 17.17) │ _Z3_3415CuKernelContext13CuDeviceArrayI7Float64Ll1ELl1EE11BroadcastedI12CuArrayStyleILl1E12DeviceMemoryE5TupleI5OneToI5Int64EE1_S ⋯
│ 0.06% │ 4.9 ms │ 1511 │ 3.24 µs ± 0.16 ( 2.62 ‥ 3.81) │ partial_mapreduce_grid(identity, _, Bool, CartesianIndices<2l, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2l, Tuple<One ⋯
│ 0.06% │ 4.66 ms │ 1511 │ 3.08 µs ± 0.52 ( 2.38 ‥ 5.48) │ _34(CuKernelContext, CuDeviceArray<Bool, 1l, 1l>, Broadcasted<CuArrayStyle<1l, DeviceMemory>, Tuple<OneTo<Int64>>, _19<CuArraySty ⋯
│ 0.06% │ 4.65 ms │ 252 │ 18.47 µs ± 4.2 ( 12.4 ‥ 23.37) │ void cusparse::csrmv_v3_kernel<std::integral_constant<bool, false>, int, int, double, double, double, double, void>(cusparse::Ker ⋯
│ 0.06% │ 4.51 ms │ 355 │ 12.72 µs ± 2.46 ( 7.15 ‥ 15.74) │ partial_mapreduce_grid(identity, max, Float64, CartesianIndices<1l, Tuple<OneTo<Int64>>>, CartesianIndices<1l, Tuple<OneTo<Int64> ⋯
│ 0.05% │ 4.26 ms │ 92 │ 46.26 µs ± 1.98 ( 43.39 ‥ 49.35) │ _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Float64Ll1ELl1EE11BroadcastedI12CuArrayStyleILl1E12DeviceMemoryE5TupleI5OneToI5In ⋯
│ 0.04% │ 2.99 ms │ 55 │ 54.39 µs ± 0.31 ( 53.64 ‥ 55.07) │ void cudss::copy_matrix_ker<long, double, int, 128>(int, int const*, long const*, double const*, double*, int) ⋯
│ 0.04% │ 2.86 ms │ 551 │ 5.18 µs ± 2.01 ( 1.19 ‥ 9.3) │ _6(CuKernelContext, CuDeviceArray<Float64, 1l, 1l>, Float64) ⋯
│ 0.03% │ 2.79 ms │ 845 │ 3.3 µs ± 0.39 ( 2.62 ‥ 4.77) │ getindex_kernel(CuKernelContext, CuDeviceArray<Float64, 1l, 1l>, CuDeviceArray<Float64, 1l, 1l>, Tuple<Int64>, CuDeviceArray<CuDe ⋯
│ 0.03% │ 2.79 ms │ 253 │ 11.01 µs ± 6.02 ( 4.77 ‥ 19.79) │ void axpy_kernel_val<double, double>(cublasAxpyParamsVal<double, double, double>) |
Note that I got this error when I wanted to check the Jacobian on GPU: julia> jac(exa2, exa2.meta.x0)
ERROR: GPU compilation of MethodInstance for ExaModelsKernelAbstractions.gpu_kerj(::KernelAbstractions.CompilerMetadata{…}, ::Vector{…}, ::Vector{…}, ::ExaModels.SIMDFunction{…}, ::CuDeviceVector{…}, ::Nothing, ::Float64) failed
KernelError: passing and using non-bitstype argument
Argument 3 to your kernel function is of type Vector{Int64}, which is not isbits:
.ref is of type MemoryRef{Int64} which is not isbits.
.mem is of type Memory{Int64} which is not isbits.
Stacktrace:
[1] check_invocation(job::GPUCompiler.CompilerJob)
@ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/validation.jl:92
[2] macro expansion
@ ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:92 [inlined]
[3] macro expansion
@ ~/.julia/packages/TimerOutputs/6KVfH/src/TimerOutput.jl:253 [inlined]
[4] codegen(output::Symbol, job::GPUCompiler.CompilerJob; toplevel::Bool, libraries::Bool, optimize::Bool, cleanup::Bool, validate::Bool, strip::Bool, only_entry::Bool, parent_job::Nothing)
@ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:90
[5] codegen
@ ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:82 [inlined]
[6] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
@ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:79
[7] compile
@ ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:74 [inlined]
[8] #1145
@ ~/.julia/packages/CUDA/2kjXI/src/compiler/compilation.jl:250 [inlined]
[9] JuliaContext(f::CUDA.var"#1145#1148"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}; kwargs::@Kwargs{})
@ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:34
[10] JuliaContext(f::Function)
@ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:25
[11] compile(job::GPUCompiler.CompilerJob)
@ CUDA ~/.julia/packages/CUDA/2kjXI/src/compiler/compilation.jl:249
[12] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/execution.jl:237
[13] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
@ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/execution.jl:151
[14] macro expansion
@ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:380 [inlined]
[15] macro expansion
@ ./lock.jl:273 [inlined]
[16] cufunction(f::typeof(ExaModelsKernelAbstractions.gpu_kerj), tt::Type{Tuple{…}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Nothing})
@ CUDA ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:375
[17] macro expansion
@ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:112 [inlined]
[18] (::KernelAbstractions.Kernel{…})(::Vector{…}, ::Vararg{…}; ndrange::Int64, workgroupsize::Nothing)
@ CUDA.CUDAKernels ~/.julia/packages/CUDA/2kjXI/src/CUDAKernels.jl:103
[19] Kernel
@ ~/.julia/packages/CUDA/2kjXI/src/CUDAKernels.jl:89 [inlined]
[20] sjacobian!
@ ~/.julia/packages/ExaModels/CGCQ6/ext/ExaModelsKernelAbstractions.jl:504 [inlined]
[21] _jac_structure!(backend::CUDABackend, cons::ExaModels.Constraint{ExaModels.Constraint{…}, ExaModels.SIMDFunction{…}, CuArray{…}, Int64}, rows::Vector{Int64}, cols::Vector{Int64})
@ ExaModelsKernelAbstractions ~/.julia/packages/ExaModels/CGCQ6/ext/ExaModelsKernelAbstractions.jl:175
[22] jac_structure!
@ ~/.julia/packages/ExaModels/CGCQ6/ext/ExaModelsKernelAbstractions.jl:170 [inlined]
[23] jac_structure
@ ~/.julia/packages/NLPModels/uC4QP/src/nlp/api.jl:171 [inlined]
[24] jac(nlp::ExaModel{Float64, CuArray{…}, ExaModelsKernelAbstractions.KAExtension{…}, ExaModels.Objective{…}, ExaModels.Constraint{…}}, x::CuArray{Float64, 1, CUDA.DeviceMemory})
@ NLPModels ~/.julia/packages/NLPModels/uC4QP/src/nlp/api.jl:271
[25] top-level scope
@ REPL[33]:1
Some type information was truncated. Use `show(err)` to see complete types. I was still able to compute it on CPU and we have this sparsity pattern: @jbcaillau Does it make sense to reformulate the problem to avoid a dense column or / and use a different KKT formulation? |
The text was updated successfully, but these errors were encountered: