From 5fb948b7f5de1daf1af3ab620985fc56e9c45002 Mon Sep 17 00:00:00 2001
From: hadaringonyama <hadar@ingonyama.com>
Date: Tue, 18 Jun 2024 13:43:39 +0300
Subject: [PATCH 01/11] first commit

---
 docs/docs/icicle/primitives/msm.md | 34 ++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/docs/docs/icicle/primitives/msm.md b/docs/docs/icicle/primitives/msm.md
index adfa8f025..62099a708 100644
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -54,9 +54,39 @@ You can learn more about how MSMs work from this [video](https://www.youtube.com
 - [Golang](../golang-bindings/msm.md)
 - [Rust](../rust-bindings//msm.md)
 
-## Supported algorithms
+## Algorithm description
+
+We follow the bucket method algorithm. The GPU implementation consists of four phases:
+
+1. Preparation phase - The scalars are split into smaller scalars of `c` bits each. These are the bucket indices. The points are grouped according to their corresponding bucket index and the buckets are sorted by size.
+2. Accumulation phase - Each bucket accumulates all of its points using a single thread. More than one thread is assigned to large buckets, in proportion to their size. A bucket is considered large if its size is above the large bucket threshold that is determined by the `large_bucket_factor` parameter. The large bucket threshold is the expected avarage bucket size times the `large_bucket_factor` parameter.
+3. Buckets Reduction phase - bucket results are multiplied by their corresponding bucket number and each bucket module is reduced to a small number of final results. By default this is done by and iterative algorithm which is highly parallel. Setting `is_big_triangle` to `true` will switch this phase to the running sum algorithm described in the above YouTube talk which is much less parallel.
+4. Final accumulation phase - The final results from the last phase are accumulated using the double-and-add algorithm.
+
+## MSM configuration
+
+
+
+## Choosing optimal parameters
+
+`is_big_triangle` should be `false` in almost all cases. It might provide better results only for very small MSMs (smaller than 2^8) with a large batch (larger than 100) but this should be tested per scenario.
+Large buckets exist in two cases:
+1. When the scalar distribution isn't uniform.
+2. When `c` does not divide the scalar bit-size.
+`large_bucket_factor` that is equal to 10 yields good results for most cases, but it's best to fine tune this parameter per `c` and per scalar distribution.
+The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around log2(msm_size) - 4. However, in most protocols the points are known in advanced and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined we a `c` value around log2(msm_size) - 1.
+
+## Memory usage estimation
+
+The main memory requirements of the MSM are the following:
+
+Scalars - scalar_size * msm_size * batch_size
+Scalar indices - unsigned_size * nof_bms * msm_size * batch_size * 6
+Points - affine_size * msm_size * precomp_factor * batch_size
+Buckets - projective_size * nof_bms * 2^c * batch_size
+
+## Example parameters
 
-Our MSM implementation supports two algorithms `Bucket accumulation` and `Large triangle accumulation`.
 
 ### Bucket accumulation
 

From 91ce169b900b936a488c5aae9ef01b3015aa37df Mon Sep 17 00:00:00 2001
From: hadaringonyama <hadar@ingonyama.com>
Date: Tue, 18 Jun 2024 13:47:30 +0300
Subject: [PATCH 02/11] test

---
 docs/docs/icicle/primitives/msm.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/docs/icicle/primitives/msm.md b/docs/docs/icicle/primitives/msm.md
index 62099a708..32ce9d98e 100644
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -73,6 +73,7 @@ We follow the bucket method algorithm. The GPU implementation consists of four p
 Large buckets exist in two cases:
 1. When the scalar distribution isn't uniform.
 2. When `c` does not divide the scalar bit-size.
+
 `large_bucket_factor` that is equal to 10 yields good results for most cases, but it's best to fine tune this parameter per `c` and per scalar distribution.
 The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around log2(msm_size) - 4. However, in most protocols the points are known in advanced and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined we a `c` value around log2(msm_size) - 1.
 

From 3afe4d267fed4ca4c847ad21c0991186aba3027a Mon Sep 17 00:00:00 2001
From: hadaringonyama <hadar@ingonyama.com>
Date: Tue, 18 Jun 2024 16:08:37 +0300
Subject: [PATCH 03/11] docs fixed

---
 .../golang-bindings/msm-pre-computation.md    |  47 +++---
 docs/docs/icicle/golang-bindings/msm.md       |   6 +-
 docs/docs/icicle/primitives/msm.md            | 143 +++++++++++++-----
 .../rust-bindings/msm-pre-computation.md      |  33 ++--
 docs/docs/icicle/rust-bindings/msm.md         |   6 +-
 5 files changed, 146 insertions(+), 89 deletions(-)

diff --git a/docs/docs/icicle/golang-bindings/msm-pre-computation.md b/docs/docs/icicle/golang-bindings/msm-pre-computation.md
index 31fde8570..888a01dd5 100644
--- a/docs/docs/icicle/golang-bindings/msm-pre-computation.md
+++ b/docs/docs/icicle/golang-bindings/msm-pre-computation.md
@@ -4,9 +4,9 @@ To understand the theory behind MSM pre computation technique refer to Niall Emm
 
 ## Core package
 
-### MSM PrecomputeBases
+### MSM PrecomputePoints
 
-`PrecomputeBases` and `G2PrecomputeBases` exists for all supported curves.
+`PrecomputePoints` and `G2PrecomputePoints` exists for all supported curves.
 
 #### Description
 
@@ -14,21 +14,20 @@ This function extends each provided base point $(P)$ with its multiples $(2^lP,
 
 The precomputation process is crucial for optimizing MSM operations, especially when dealing with large sets of points and scalars. By precomputing and storing multiples of the base points, the MSM function can more efficiently compute the scalar-point multiplications.
 
-#### `PrecomputeBases`
+#### `PrecomputePoints`
 
-Precomputes bases for MSM by extending each base point with its multiples.
+Precomputes points for MSM by extending each base point with its multiples.
 
 ```go
-func PrecomputeBases(points core.HostOrDeviceSlice, precomputeFactor int32, c int32, ctx *cr.DeviceContext, outputBases core.DeviceSlice) cr.CudaError
+func PrecomputePoints(points core.HostOrDeviceSlice, msmSize int, cfg *core.MSMConfig, outputBases core.DeviceSlice) cr.CudaError
 ```
 
 ##### Parameters
 
 - **`points`**: A slice of the original affine points to be extended with their multiples.
-- **`precomputeFactor`**: Determines the total number of points to precompute for each base point.
-- **`c`**: Currently unused; reserved for future compatibility.
-- **`ctx`**: CUDA device context specifying the execution environment.
-- **`outputBases`**: The device slice allocated for storing the extended bases.
+- **`msmSize`**: The size of a single msm in order to determine optimal parameters.
+- **`cfg`**: The MSM configuration parameters.
+- **`outputBases`**: The device slice allocated for storing the extended points.
 
 ##### Example
 
@@ -50,28 +49,27 @@ func main() {
 	var precomputeOut core.DeviceSlice
 	precomputeOut.Malloc(points[0].Size()*points.Len()*int(precomputeFactor), points[0].Size())
 
-	err := bn254.PrecomputeBases(points, precomputeFactor, 0, &cfg.Ctx, precomputeOut)
+	err := bn254.PrecomputePoints(points, 1024, &cfg, precomputeOut)
 	if err != cr.CudaSuccess {
 		log.Fatalf("PrecomputeBases failed: %v", err)
 	}
 }
 ```
 
-#### `G2PrecomputeBases`
+#### `G2PrecomputePoints`
 
-This method is the same as `PrecomputeBases` but for G2 points. Extends each G2 curve base point with its multiples for optimized MSM computations.
+This method is the same as `PrecomputePoints` but for G2 points. Extends each G2 curve base point with its multiples for optimized MSM computations.
 
 ```go
-func G2PrecomputeBases(points core.HostOrDeviceSlice, precomputeFactor int32, c int32, ctx *cr.DeviceContext, outputBases core.DeviceSlice) cr.CudaError
+func G2PrecomputePoints(points core.HostOrDeviceSlice, msmSize int, cfg *core.MSMConfig, outputBases core.DeviceSlice) cr.CudaError
 ```
 
 ##### Parameters
 
-- **`points`**: A slice of G2 curve points to be extended.
-- **`precomputeFactor`**: The total number of points to precompute for each base.
-- **`c`**: Reserved for future use to ensure compatibility with MSM operations.
-- **`ctx`**: Specifies the CUDA device context for execution.
-- **`outputBases`**: Allocated device slice for the extended bases.
+- **`points`**: A slice of the original affine points to be extended with their multiples.
+- **`msmSize`**: The size of a single msm in order to determine optimal parameters.
+- **`cfg`**: The MSM configuration parameters.
+- **`outputBases`**: The device slice allocated for storing the extended points.
 
 ##### Example
 
@@ -93,20 +91,9 @@ func main() {
 	var precomputeOut core.DeviceSlice
 	precomputeOut.Malloc(points[0].Size()*points.Len()*int(precomputeFactor), points[0].Size())
 
-	err := g2.G2PrecomputeBases(points, precomputeFactor, 0, &cfg.Ctx, precomputeOut)
+	err := g2.G2PrecomputePoints(points, 1024, 0, &cfg, precomputeOut)
 	if err != cr.CudaSuccess {
 		log.Fatalf("PrecomputeBases failed: %v", err)
 	}
 }
 ```
-
-### Benchmarks
-
-Benchmarks where performed on a Nvidia RTX 3090Ti.
-
-| Pre-computation factor | bn254 size `2^20` MSM, ms.  | bn254 size `2^12` MSM, size `2^10` batch, ms. | bls12-381 size `2^20` MSM, ms. | bls12-381 size `2^12` MSM, size `2^10` batch, ms. |
-| ------------- | ------------- | ------------- | ------------- | ------------- |
-| 1  | 14.1  | 82.8  | 25.5  | 136.7  |
-| 2  | 11.8  | 76.6  | 20.3  | 123.8  |
-| 4  | 10.9  | 73.8  | 18.1  | 117.8  |
-| 8  | 10.6  | 73.7  | 17.2  | 116.0  |
diff --git a/docs/docs/icicle/golang-bindings/msm.md b/docs/docs/icicle/golang-bindings/msm.md
index 5b8ee2095..94a2757fc 100644
--- a/docs/docs/icicle/golang-bindings/msm.md
+++ b/docs/docs/icicle/golang-bindings/msm.md
@@ -122,7 +122,7 @@ func GetDefaultMSMConfig() MSMConfig
 
 ## How do I toggle between the supported algorithms?
 
-When creating your MSM Config you may state which algorithm you wish to use. `cfg.Ctx.IsBigTriangle = true` will activate Large triangle accumulation and `cfg.Ctx.IsBigTriangle = false` will activate Bucket accumulation.
+When creating your MSM Config you may state which algorithm you wish to use. `cfg.Ctx.IsBigTriangle = true` will activate Large triangle reduction and `cfg.Ctx.IsBigTriangle = false` will activate iterative reduction.
 
 ```go
 ...
@@ -152,6 +152,10 @@ out.Malloc(batchSize*p.Size(), p.Size())
 ...
 ```
 
+## Parameters for optimal performance
+
+Please reffer to the [primitive description](../primitives/msm.md#choosing-optimal-parameters)
+
 ## Support for G2 group
 
 To activate G2 support first you must make sure you are building the static libraries with G2 feature enabled as described in the [Golang building instructions](../golang-bindings.md#using-icicle-golang-bindings-in-your-project).
diff --git a/docs/docs/icicle/primitives/msm.md b/docs/docs/icicle/primitives/msm.md
index 32ce9d98e..b0641b03a 100644
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -63,58 +63,133 @@ We follow the bucket method algorithm. The GPU implementation consists of four p
 3. Buckets Reduction phase - bucket results are multiplied by their corresponding bucket number and each bucket module is reduced to a small number of final results. By default this is done by and iterative algorithm which is highly parallel. Setting `is_big_triangle` to `true` will switch this phase to the running sum algorithm described in the above YouTube talk which is much less parallel.
 4. Final accumulation phase - The final results from the last phase are accumulated using the double-and-add algorithm.
 
-## MSM configuration
+## Batched MSM
+
+The MSM supports batch mode - running multiple MSMs in parallel. It's always better to use the batch mode instead of running single msms in serial as long as there is enough memory available. We support running a batch of MSMs that share the same points as well as a batch of MSMs that use different points.
 
+## MSM configuration
 
+```c++
+  /**
+   * @struct MSMConfig
+   * Struct that encodes MSM parameters to be passed into the [MSM](@ref MSM) function. The intended use of this struct
+   * is to create it using [default_msm_config](@ref default_msm_config) function and then you'll hopefully only need to
+   * change a small number of default values for each of your MSMs.
+   */
+  struct MSMConfig {
+    device_context::DeviceContext ctx; /**< Details related to the device such as its id and stream id. */
+    int points_size;         /**< Number of points in the MSM. If a batch of MSMs needs to be computed, this should be
+                              *   a number of different points. So, if each MSM re-uses the same set of points, this
+                              *   variable is set equal to the MSM size. And if every MSM uses a distinct set of
+                              *   points, it should be set to the product of MSM size and [batch_size](@ref
+                              *   batch_size). Default value: 0 (meaning it's equal to the MSM size). */
+    int precompute_factor;   /**< The number of extra points to pre-compute for each point. See the
+                              *   [precompute_msm_points](@ref precompute_msm_points) function, `precompute_factor` passed
+                              *   there needs to be equal to the one used here. Larger values decrease the
+                              *   number of computations to make, on-line memory footprint, but increase the static
+                              *   memory footprint. Default value: 1 (i.e. don't pre-compute). */
+    int c;                   /**< \f$ c \f$ value, or "window bitsize" which is the main parameter of the "bucket
+                              *   method" that we use to solve the MSM problem. As a rule of thumb, larger value
+                              *   means more on-line memory footprint but also more parallelism and less computational
+                              *   complexity (up to a certain point). Currently pre-computation is independent of
+                              *   \f$ c \f$, however in the future value of \f$ c \f$ here and the one passed into the
+                              *   [precompute_msm_points](@ref precompute_msm_points) function will need to be identical.
+                              *    Default value: 0 (the optimal value of \f$ c \f$ is chosen automatically).  */
+    int bitsize;             /**< Number of bits of the largest scalar. Typically equals the bitsize of scalar field,
+                              *   but if a different (better) upper bound is known, it should be reflected in this
+                              *   variable. Default value: 0 (set to the bitsize of scalar field). */
+    int large_bucket_factor; /**< Variable that controls how sensitive the algorithm is to the buckets that occur
+                              *   very frequently. Useful for efficient treatment of non-uniform distributions of
+                              *   scalars and "top windows" with few bits. Can be set to 0 to disable separate
+                              *   treatment of large buckets altogether. Default value: 10. */
+    int batch_size;          /**< The number of MSMs to compute. Default value: 1. */
+    bool are_scalars_on_device;       /**< True if scalars are on device and false if they're on host. Default value:
+                                       *   false. */
+    bool are_scalars_montgomery_form; /**< True if scalars are in Montgomery form and false otherwise. Default value:
+                                       *   true. */
+    bool are_points_on_device; /**< True if points are on device and false if they're on host. Default value: false. */
+    bool are_points_montgomery_form; /**< True if coordinates of points are in Montgomery form and false otherwise.
+                                      *   Default value: true. */
+    bool are_results_on_device; /**< True if the results should be on device and false if they should be on host. If set
+                                 *   to false, `is_async` won't take effect because a synchronization is needed to
+                                 *   transfer results to the host. Default value: false. */
+    bool is_big_triangle;       /**< Whether to do "bucket accumulation" serially. Decreases computational complexity
+                                 *   but also greatly decreases parallelism, so only suitable for large batches of MSMs.
+                                 *   Default value: false. */
+    bool is_async;              /**< Whether to run the MSM asynchronously. If set to true, the MSM function will be
+                                 *   non-blocking and you'd need to synchronize it explicitly by running
+                                 *   `cudaStreamSynchronize` or `cudaDeviceSynchronize`. If set to false, the MSM
+                                 *   function will block the current CPU thread. */
+  };
+```
 
 ## Choosing optimal parameters
 
-`is_big_triangle` should be `false` in almost all cases. It might provide better results only for very small MSMs (smaller than 2^8) with a large batch (larger than 100) but this should be tested per scenario.
+`is_big_triangle` should be `false` in almost all cases. It might provide better results only for very small MSMs (smaller than 2^8^) with a large batch (larger than 100) but this should be tested per scenario.
 Large buckets exist in two cases:
 1. When the scalar distribution isn't uniform.
 2. When `c` does not divide the scalar bit-size.
 
 `large_bucket_factor` that is equal to 10 yields good results for most cases, but it's best to fine tune this parameter per `c` and per scalar distribution.
-The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around log2(msm_size) - 4. However, in most protocols the points are known in advanced and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined we a `c` value around log2(msm_size) - 1.
+The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around $log_2(msmSize) - 4$. However, in most protocols the points are known in advanced and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined we a `c` value around $log_2(msmSize) - 1$.
 
 ## Memory usage estimation
 
 The main memory requirements of the MSM are the following:
 
-Scalars - scalar_size * msm_size * batch_size
-Scalar indices - unsigned_size * nof_bms * msm_size * batch_size * 6
-Points - affine_size * msm_size * precomp_factor * batch_size
-Buckets - projective_size * nof_bms * 2^c * batch_size
-
-## Example parameters
-
-
-### Bucket accumulation
-
-The Bucket Accumulation algorithm is a method of dividing the overall MSM task into smaller, more manageable sub-tasks. It involves partitioning scalars and their corresponding points into different "buckets" based on the scalar values.
+Scalars - `sizeof(scalar_t) * msm_size * batch_size`
+Scalar indices - `~6 * sizeof(unsigned) * nof_bucket_modules * msm_size * batch_size`
+Points - `sizeof(affine_t) * msm_size * precomp_factor * batch_size`
+Buckets - `sizeof(projective_t) * nof_bucket_modules * 2^c * batch_size`
 
-Bucket Accumulation can be more parallel-friendly because it involves dividing the computation into smaller, independent tasks, distributing scalar-point pairs into buckets and summing points within each bucket. This division makes it well suited for parallel processing on GPUs.
+where `nof_bucket_modules =  ceil(ceil(bitsize / c) / precompute_factor)`
 
-#### When should I use Bucket accumulation?
+During the MSM computation first the memory for scalars and scalar indices is allocated, then the indices are freed and points and buckets are allocated. This is why a good estimation fot the required memory is the following formula:
 
-In scenarios involving large MSM computations with many scalar-point pairs, the ability to parallelize operations makes Bucket Accumulation more efficient. The larger the MSM task, the more significant the potential gains from parallelization.
+$max(scalars + scalarIndices, scalars + points + buckets)$
 
-### Large triangle accumulation
+This gives a good approximation within 10% of the actuall required memory for most cases.
 
-Large Triangle Accumulation is a method for optimizing MSM which focuses on reducing the number of point doublings in the computation. This algorithm is based on the observation that the number of point doublings can be minimized by structuring the computation in a specific manner.
-
-#### When should I use Large triangle accumulation?
-
-The Large Triangle Accumulation algorithm is more sequential in nature, as it builds upon each step sequentially (accumulating sums and then performing doubling). This structure can make it less suitable for parallelization but potentially more efficient for a **large batch of smaller MSM computations**.
-
-## MSM Modes
-
-ICICLE MSM also supports two different modes `Batch MSM` and `Single MSM`
-
-Batch MSM allows you to run many MSMs with a single API call while single MSM will launch a single MSM computation.
-
-### Which mode should I use?
-
-This decision is highly dependent on your use case and design. However, if your design allows for it, using batch mode can significantly improve efficiency. Batch processing allows you to perform multiple MSMs simultaneously, leveraging the parallel processing capabilities of GPUs.
+## Example parameters
 
-Single MSM mode should be used when batching isn't possible or when you have to run a single MSM.
+Here is a useful table showing optimal parameters for different MSMs. They are optimal for BLS12-377 curve when running on NVIDIA GeForce RTX 3090 Ti. This is the configuration used:
+
+```c++
+  msm::MSMConfig config = {
+    ctx,            // DeviceContext
+    N,              // points_size
+    precomp_factor, // precompute_factor
+    user_c,         // c
+    0,              // bitsize
+    10,             // large_bucket_factor
+    batch_size,     // batch_size
+    false,          // are_scalars_on_device
+    false,          // are_scalars_montgomery_form
+    true,           // are_points_on_device
+    false,          // are_points_montgomery_form
+    true,           // are_results_on_device
+    false,          // is_big_triangle
+    true            // is_async
+  };
+```
+
+Here are the parameters and the results for the different cases:
+
+| MSM size | Batch size | Precompute factor | c | Memory estimation (GB) | Actual memory (GB) | Single MSM time (ms) |
+| --- | --- | --- | --- | --- | --- | --- |
+| 10 | 1 | 1 | 9 | 0.00227 | 0.00277 | 9.2 |
+| 10 | 1 | 23 | 11 | 0.00227 | 0.00272 | 1.76 |
+| 10 | 1000 | 1 | 7 | 0.00227 | 1.09 | 0.051 |
+| 10 | 1000 | 23 | 11 | 0.00227 | 2.74 | 0.025 |
+| 15 | 1 | 1 | 11 | 0.011 | 0.019 | 9.9 |
+| 15 | 1 | 16 | 16 | 0.061 | 0.065 | 2.4 |
+| 15 | 100 | 1 | 11 | 1.91 | 1.92 | 0.84 |
+| 15 | 100 | 19 | 14 | 6.32 | 6.61 | 0.56 |
+| 18 | 1 | 1 | 14 | 0.128 | 0.128 | 14.4 |
+| 18 | 1 | 15 | 17 | 0.40 | 0.42 | 5.9 |
+| 22 | 1 | 1 | 17 | 1.64 | 1.65 | 68 |
+| 22 | 1 | 13 | 21 | 5.67 | 5.94 | 54 |
+| 24 | 1 | 1 | 18 | 6.58 | 6.61 | 232 |
+| 24 | 1 | 7 | 21 | 12.4 | 13.4 | 199 |
+
+The optimal values can vary per GPU and per curve. It is best to try a few combinations untill you get the best results for your specific case.
diff --git a/docs/docs/icicle/rust-bindings/msm-pre-computation.md b/docs/docs/icicle/rust-bindings/msm-pre-computation.md
index 37d33b590..687704fb2 100644
--- a/docs/docs/icicle/rust-bindings/msm-pre-computation.md
+++ b/docs/docs/icicle/rust-bindings/msm-pre-computation.md
@@ -2,26 +2,24 @@
 
 To understand the theory behind MSM pre computation technique refer to Niall Emmart's [talk](https://youtu.be/KAWlySN7Hm8?feature=shared&t=1734).
 
-## `precompute_bases`
+## `precompute_points`
 
 Precomputes bases for the multi-scalar multiplication (MSM) by extending each base point with its multiples, facilitating more efficient MSM calculations.
 
 ```rust
-pub fn precompute_bases<C: Curve + MSM<C>>(
-    points: &HostOrDeviceSlice<Affine<C>>,
-    precompute_factor: i32,
-    _c: i32,
-    ctx: &DeviceContext,
-    output_bases: &mut HostOrDeviceSlice<Affine<C>>,
+pub fn precompute_points<C: Curve + MSM<C>>(
+    points: &(impl HostOrDeviceSlice<Affine<C>> + ?Sized),
+    msm_size: i32,
+    cfg: &MSMConfig,
+    output_bases: &mut DeviceSlice<Affine<C>>,
 ) -> IcicleResult<()>
 ```
 
 ### Parameters
 
 - **`points`**: The original set of affine points (\(P_1, P_2, ..., P_n\)) to be used in the MSM. For batch MSM operations, this should include all unique points concatenated together.
-- **`precompute_factor`**: Specifies the total number of points to precompute for each base, including the base point itself. This parameter directly influences the memory requirements and the potential speedup of the MSM operation.
-- **`_c`**: Currently unused. Intended for future use to align with the `c` parameter in `MSMConfig`, ensuring the precomputation is compatible with the bucket method's window size used in MSM.
-- **`ctx`**: The device context specifying the device ID and stream for execution. This context determines where the precomputation is performed (e.g., on a specific GPU).
+- **`msm_size`**: The size of a single msm in order to determine optimal parameters.
+- **`cfg`**: The MSM configuration parameters.
 - **`output_bases`**: The output buffer for the extended bases. Its size must be `points.len() * precompute_factor`. This buffer should be allocated on the device for GPU computations.
 
 #### Returns
@@ -37,22 +35,11 @@ The precomputation process is crucial for optimizing MSM operations, especially
 #### Example Usage
 
 ```rust
-let device_context = DeviceContext::default_for_device(0); // Use the default device
+let cfg = MSMConfig::default();
 let precompute_factor = 4; // Number of points to precompute
 let mut extended_bases = HostOrDeviceSlice::cuda_malloc(expected_size).expect("Failed to allocate memory for extended bases");
 
 // Precompute the bases using the specified factor
-precompute_bases(&points, precompute_factor, 0, &device_context, &mut extended_bases)
+precompute_points(&points, msm_size, &cfg, &mut extended_bases)
     .expect("Failed to precompute bases");
 ```
-
-### Benchmarks
-
-Benchmarks where performed on a Nvidia RTX 3090Ti.
-
-| Pre-computation factor | bn254 size `2^20` MSM, ms.  | bn254 size `2^12` MSM, size `2^10` batch, ms. | bls12-381 size `2^20` MSM, ms. | bls12-381 size `2^12` MSM, size `2^10` batch, ms. |
-| ------------- | ------------- | ------------- | ------------- | ------------- |
-| 1  | 14.1  | 82.8  | 25.5  | 136.7  |
-| 2  | 11.8  | 76.6  | 20.3  | 123.8  |
-| 4  | 10.9  | 73.8  | 18.1  | 117.8  |
-| 8  | 10.6  | 73.7  | 17.2  | 116.0  |
diff --git a/docs/docs/icicle/rust-bindings/msm.md b/docs/docs/icicle/rust-bindings/msm.md
index 3541e00c2..4ffa3dffa 100644
--- a/docs/docs/icicle/rust-bindings/msm.md
+++ b/docs/docs/icicle/rust-bindings/msm.md
@@ -100,7 +100,7 @@ When performing MSM operations, it's crucial to match the size of the `scalars`
 
 ## How do I toggle between the supported algorithms?
 
-When creating your MSM Config you may state which algorithm you wish to use. `is_big_triangle=true` will activate Large triangle accumulation and `is_big_triangle=false` will activate Bucket accumulation.
+When creating your MSM Config you may state which algorithm you wish to use. `is_big_triangle=true` will activate Large triangle reduction and `is_big_triangle=false` will activate iterative reduction.
 
 ```rust
 ...
@@ -144,6 +144,10 @@ msm::msm(&scalars, &points, &cfg, &mut msm_results).unwrap();
 
 Here is a [reference](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961030e4e12bf1c9a78a2dadb2518/wrappers/rust/icicle-core/src/msm/mod.rs#L108) to the code which automatically sets the batch size. For more MSM examples have a look [here](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961030e4e12bf1c9a78a2dadb2518/examples/rust/msm/src/main.rs#L1).
 
+## Parameters for optimal performance
+
+Please reffer to the [primitive description](../primitives/msm.md#choosing-optimal-parameters)
+
 ## Support for G2 group
 
 MSM also supports G2 group.

From 5a4413e77af5c4f5f0876d0beccfec6d8249e61c Mon Sep 17 00:00:00 2001
From: hadaringonyama <hadar@ingonyama.com>
Date: Tue, 18 Jun 2024 16:13:55 +0300
Subject: [PATCH 04/11] spell-check

---
 docs/docs/icicle/golang-bindings/msm.md | 2 +-
 docs/docs/icicle/primitives/msm.md      | 8 ++++----
 docs/docs/icicle/rust-bindings/msm.md   | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/docs/icicle/golang-bindings/msm.md b/docs/docs/icicle/golang-bindings/msm.md
index 94a2757fc..089c7ded4 100644
--- a/docs/docs/icicle/golang-bindings/msm.md
+++ b/docs/docs/icicle/golang-bindings/msm.md
@@ -154,7 +154,7 @@ out.Malloc(batchSize*p.Size(), p.Size())
 
 ## Parameters for optimal performance
 
-Please reffer to the [primitive description](../primitives/msm.md#choosing-optimal-parameters)
+Please refer to the [primitive description](../primitives/msm.md#choosing-optimal-parameters)
 
 ## Support for G2 group
 
diff --git a/docs/docs/icicle/primitives/msm.md b/docs/docs/icicle/primitives/msm.md
index b0641b03a..e62b9f6bf 100644
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -59,7 +59,7 @@ You can learn more about how MSMs work from this [video](https://www.youtube.com
 We follow the bucket method algorithm. The GPU implementation consists of four phases:
 
 1. Preparation phase - The scalars are split into smaller scalars of `c` bits each. These are the bucket indices. The points are grouped according to their corresponding bucket index and the buckets are sorted by size.
-2. Accumulation phase - Each bucket accumulates all of its points using a single thread. More than one thread is assigned to large buckets, in proportion to their size. A bucket is considered large if its size is above the large bucket threshold that is determined by the `large_bucket_factor` parameter. The large bucket threshold is the expected avarage bucket size times the `large_bucket_factor` parameter.
+2. Accumulation phase - Each bucket accumulates all of its points using a single thread. More than one thread is assigned to large buckets, in proportion to their size. A bucket is considered large if its size is above the large bucket threshold that is determined by the `large_bucket_factor` parameter. The large bucket threshold is the expected average bucket size times the `large_bucket_factor` parameter.
 3. Buckets Reduction phase - bucket results are multiplied by their corresponding bucket number and each bucket module is reduced to a small number of final results. By default this is done by and iterative algorithm which is highly parallel. Setting `is_big_triangle` to `true` will switch this phase to the running sum algorithm described in the above YouTube talk which is much less parallel.
 4. Final accumulation phase - The final results from the last phase are accumulated using the double-and-add algorithm.
 
@@ -144,11 +144,11 @@ Buckets - `sizeof(projective_t) * nof_bucket_modules * 2^c * batch_size`
 
 where `nof_bucket_modules =  ceil(ceil(bitsize / c) / precompute_factor)`
 
-During the MSM computation first the memory for scalars and scalar indices is allocated, then the indices are freed and points and buckets are allocated. This is why a good estimation fot the required memory is the following formula:
+During the MSM computation first the memory for scalars and scalar indices is allocated, then the indices are freed and points and buckets are allocated. This is why a good estimation for the required memory is the following formula:
 
 $max(scalars + scalarIndices, scalars + points + buckets)$
 
-This gives a good approximation within 10% of the actuall required memory for most cases.
+This gives a good approximation within 10% of the actual required memory for most cases.
 
 ## Example parameters
 
@@ -192,4 +192,4 @@ Here are the parameters and the results for the different cases:
 | 24 | 1 | 1 | 18 | 6.58 | 6.61 | 232 |
 | 24 | 1 | 7 | 21 | 12.4 | 13.4 | 199 |
 
-The optimal values can vary per GPU and per curve. It is best to try a few combinations untill you get the best results for your specific case.
+The optimal values can vary per GPU and per curve. It is best to try a few combinations until you get the best results for your specific case.
diff --git a/docs/docs/icicle/rust-bindings/msm.md b/docs/docs/icicle/rust-bindings/msm.md
index 4ffa3dffa..f9e116f4e 100644
--- a/docs/docs/icicle/rust-bindings/msm.md
+++ b/docs/docs/icicle/rust-bindings/msm.md
@@ -146,7 +146,7 @@ Here is a [reference](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961
 
 ## Parameters for optimal performance
 
-Please reffer to the [primitive description](../primitives/msm.md#choosing-optimal-parameters)
+Please refer to the [primitive description](../primitives/msm.md#choosing-optimal-parameters)
 
 ## Support for G2 group
 

From 250bf90cb44fcbc9ddec0a2439065132df5858f9 Mon Sep 17 00:00:00 2001
From: HadarIngonyama <102164010+HadarIngonyama@users.noreply.github.com>
Date: Tue, 18 Jun 2024 16:22:13 +0300
Subject: [PATCH 05/11] Update docs/docs/icicle/primitives/msm.md

Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
---
 docs/docs/icicle/primitives/msm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/icicle/primitives/msm.md b/docs/docs/icicle/primitives/msm.md
index e62b9f6bf..964ec29e9 100644
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -60,7 +60,7 @@ We follow the bucket method algorithm. The GPU implementation consists of four p
 
 1. Preparation phase - The scalars are split into smaller scalars of `c` bits each. These are the bucket indices. The points are grouped according to their corresponding bucket index and the buckets are sorted by size.
 2. Accumulation phase - Each bucket accumulates all of its points using a single thread. More than one thread is assigned to large buckets, in proportion to their size. A bucket is considered large if its size is above the large bucket threshold that is determined by the `large_bucket_factor` parameter. The large bucket threshold is the expected average bucket size times the `large_bucket_factor` parameter.
-3. Buckets Reduction phase - bucket results are multiplied by their corresponding bucket number and each bucket module is reduced to a small number of final results. By default this is done by and iterative algorithm which is highly parallel. Setting `is_big_triangle` to `true` will switch this phase to the running sum algorithm described in the above YouTube talk which is much less parallel.
+3. Buckets Reduction phase - bucket results are multiplied by their corresponding bucket number and each bucket module is reduced to a small number of final results. By default, this is done by an iterative algorithm which is highly parallel. Setting `is_big_triangle` to `true` will switch this phase to the running sum algorithm described in the above YouTube talk which is much less parallel.
 4. Final accumulation phase - The final results from the last phase are accumulated using the double-and-add algorithm.
 
 ## Batched MSM

From c4fcaac2d7eadde2f5b64f3876d8d84ba8d802c3 Mon Sep 17 00:00:00 2001
From: HadarIngonyama <102164010+HadarIngonyama@users.noreply.github.com>
Date: Tue, 18 Jun 2024 16:22:30 +0300
Subject: [PATCH 06/11] Update docs/docs/icicle/rust-bindings/msm.md

Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
---
 docs/docs/icicle/rust-bindings/msm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/icicle/rust-bindings/msm.md b/docs/docs/icicle/rust-bindings/msm.md
index f9e116f4e..8cfccf053 100644
--- a/docs/docs/icicle/rust-bindings/msm.md
+++ b/docs/docs/icicle/rust-bindings/msm.md
@@ -146,7 +146,7 @@ Here is a [reference](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961
 
 ## Parameters for optimal performance
 
-Please refer to the [primitive description](../primitives/msm.md#choosing-optimal-parameters)
+Please refer to the [primitive description](../primitives/msm#choosing-optimal-parameters)
 
 ## Support for G2 group
 

From 077e090b43e58daa29d6d2e38f8887b0481fb257 Mon Sep 17 00:00:00 2001
From: HadarIngonyama <102164010+HadarIngonyama@users.noreply.github.com>
Date: Tue, 18 Jun 2024 16:22:41 +0300
Subject: [PATCH 07/11] Update docs/docs/icicle/golang-bindings/msm.md

Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
---
 docs/docs/icicle/golang-bindings/msm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/icicle/golang-bindings/msm.md b/docs/docs/icicle/golang-bindings/msm.md
index 089c7ded4..72710c551 100644
--- a/docs/docs/icicle/golang-bindings/msm.md
+++ b/docs/docs/icicle/golang-bindings/msm.md
@@ -154,7 +154,7 @@ out.Malloc(batchSize*p.Size(), p.Size())
 
 ## Parameters for optimal performance
 
-Please refer to the [primitive description](../primitives/msm.md#choosing-optimal-parameters)
+Please refer to the [primitive description](../primitives/msm#choosing-optimal-parameters)
 
 ## Support for G2 group
 

From f8f1e5afb66616a1af005e536f1ee08c7a44dc67 Mon Sep 17 00:00:00 2001
From: hadaringonyama <hadar@ingonyama.com>
Date: Tue, 18 Jun 2024 16:35:36 +0300
Subject: [PATCH 08/11] small fix

---
 docs/docs/icicle/primitives/msm.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/docs/icicle/primitives/msm.md b/docs/docs/icicle/primitives/msm.md
index 964ec29e9..571b3102b 100644
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -137,10 +137,10 @@ The two most important parameters for performance are `c` and the `precompute_fa
 
 The main memory requirements of the MSM are the following:
 
-Scalars - `sizeof(scalar_t) * msm_size * batch_size`
-Scalar indices - `~6 * sizeof(unsigned) * nof_bucket_modules * msm_size * batch_size`
-Points - `sizeof(affine_t) * msm_size * precomp_factor * batch_size`
-Buckets - `sizeof(projective_t) * nof_bucket_modules * 2^c * batch_size`
+- Scalars - `sizeof(scalar_t) * msm_size * batch_size`
+- Scalar indices - `~6 * sizeof(unsigned) * nof_bucket_modules * msm_size * batch_size`
+- Points - `sizeof(affine_t) * msm_size * precomp_factor * batch_size`
+- Buckets - `sizeof(projective_t) * nof_bucket_modules * 2^c * batch_size`
 
 where `nof_bucket_modules =  ceil(ceil(bitsize / c) / precompute_factor)`
 
@@ -178,9 +178,9 @@ Here are the parameters and the results for the different cases:
 | MSM size | Batch size | Precompute factor | c | Memory estimation (GB) | Actual memory (GB) | Single MSM time (ms) |
 | --- | --- | --- | --- | --- | --- | --- |
 | 10 | 1 | 1 | 9 | 0.00227 | 0.00277 | 9.2 |
-| 10 | 1 | 23 | 11 | 0.00227 | 0.00272 | 1.76 |
-| 10 | 1000 | 1 | 7 | 0.00227 | 1.09 | 0.051 |
-| 10 | 1000 | 23 | 11 | 0.00227 | 2.74 | 0.025 |
+| 10 | 1 | 23 | 11 | 0.00259 | 0.00272 | 1.76 |
+| 10 | 1000 | 1 | 7 | 0.094 | 1.09 | 0.051 |
+| 10 | 1000 | 23 | 11 | 2.59 | 2.74 | 0.025 |
 | 15 | 1 | 1 | 11 | 0.011 | 0.019 | 9.9 |
 | 15 | 1 | 16 | 16 | 0.061 | 0.065 | 2.4 |
 | 15 | 100 | 1 | 11 | 1.91 | 1.92 | 0.84 |

From 67d6e6f9861f4289150c4fe6f8e563423a6742de Mon Sep 17 00:00:00 2001
From: hadaringonyama <hadar@ingonyama.com>
Date: Wed, 19 Jun 2024 10:58:40 +0300
Subject: [PATCH 09/11] small fix

---
 docs/docs/icicle/primitives/msm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/icicle/primitives/msm.md b/docs/docs/icicle/primitives/msm.md
index 571b3102b..a3fb03548 100644
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -179,7 +179,7 @@ Here are the parameters and the results for the different cases:
 | --- | --- | --- | --- | --- | --- | --- |
 | 10 | 1 | 1 | 9 | 0.00227 | 0.00277 | 9.2 |
 | 10 | 1 | 23 | 11 | 0.00259 | 0.00272 | 1.76 |
-| 10 | 1000 | 1 | 7 | 0.094 | 1.09 | 0.051 |
+| 10 | 1000 | 1 | 7 | 0.94 | 1.09 | 0.051 |
 | 10 | 1000 | 23 | 11 | 2.59 | 2.74 | 0.025 |
 | 15 | 1 | 1 | 11 | 0.011 | 0.019 | 9.9 |
 | 15 | 1 | 16 | 16 | 0.061 | 0.065 | 2.4 |

From 73f39b951b5abf99cd169c6dde6ba54273802820 Mon Sep 17 00:00:00 2001
From: HadarIngonyama <102164010+HadarIngonyama@users.noreply.github.com>
Date: Wed, 19 Jun 2024 11:26:30 +0300
Subject: [PATCH 10/11] Update docs/docs/icicle/primitives/msm.md

Co-authored-by: Leon Hibnik <107353745+LeonHibnik@users.noreply.github.com>
---
 docs/docs/icicle/primitives/msm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/icicle/primitives/msm.md b/docs/docs/icicle/primitives/msm.md
index a3fb03548..82e086f17 100644
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -69,7 +69,7 @@ The MSM supports batch mode - running multiple MSMs in parallel. It's always bet
 
 ## MSM configuration
 
-```c++
+```cpp
   /**
    * @struct MSMConfig
    * Struct that encodes MSM parameters to be passed into the [MSM](@ref MSM) function. The intended use of this struct

From 79e055dc0ac102900a823b383024b7aab872138c Mon Sep 17 00:00:00 2001
From: HadarIngonyama <102164010+HadarIngonyama@users.noreply.github.com>
Date: Wed, 19 Jun 2024 11:26:39 +0300
Subject: [PATCH 11/11] Update docs/docs/icicle/primitives/msm.md

Co-authored-by: Leon Hibnik <107353745+LeonHibnik@users.noreply.github.com>
---
 docs/docs/icicle/primitives/msm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/icicle/primitives/msm.md b/docs/docs/icicle/primitives/msm.md
index 82e086f17..7cb4a0abf 100644
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -154,7 +154,7 @@ This gives a good approximation within 10% of the actual required memory for mos
 
 Here is a useful table showing optimal parameters for different MSMs. They are optimal for BLS12-377 curve when running on NVIDIA GeForce RTX 3090 Ti. This is the configuration used:
 
-```c++
+```cpp
   msm::MSMConfig config = {
     ctx,            // DeviceContext
     N,              // points_size