diff --git a/docs/README.md b/docs/README.md
index 9610d11ed..64ef3a89a 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,39 +1,155 @@
-# Website
+# ICICLE Developer Docs
 
-This website is built using [Docusaurus 2](https://docusaurus.io/), a modern static website generator.
+The developer docs for ICICLE is a static website built using [Docusaurus](https://docusaurus.io/).
 
-### Installation
+## Requirements
 
-```
-$ npm i
-```
+Docusaurus is written in Typescript and distributed as npm packages. npm is a prerequisite as is node.js
+
+If node.js or npm aren't installed, its suggested to use [nvm](https://github.com/nvm-sh/nvm?tab=readme-ov-file#installing-and-updating) to [install both](https://github.com/nvm-sh/nvm?tab=readme-ov-file#usage) at the same time.
 
-### Local Development
+## Install
 
+```sh
+npm install
 ```
-$ npm start
+
+## Versioning
+
+ICICLE docs are versioned, keeping the latest set of docs for previous major versions and the latest 4 sets of docs for the current major version.
+
+The [docs](./docs/) directory holds the next version's docs
+All **released** versions are under the [versioned_docs](./versioned_docs/) directory.
+
+### Releasing new versions
+
+In order to create a new version, run the following:
+
+```sh
+npm run docusaurus docs:version <version to create>
 ```
 
-This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.
+For example:
 
-### Build
+Assuming the next version is 5.6.0, we would run the following:
 
+```sh
+npm run docusaurus docs:version 5.6.0
 ```
-$ npm run build
+
+This command will:
+
+1. Add a new version for the specified `<version to create>` in the [versions file](./versions.json)
+2. Create a directory under [versioned_docs](./versioned_docs/) with the name `version-<version to create>` and copies everything in the [docs](./docs/) directory there.
+3. Create a file under [versioned_sidebars](./versioned_sidebars/) with the name `version-<version to create>-sidebars.json` and copies the current [sidebar.ts](./sidebars.ts) file there after converting it to a json object.
+
+### Removing old versions
+
+- Remove the version from versions.json.
+
+```json
+  [
+    "3.2.0",
+    "3.1.0",
+    "3.0.0",
+    "2.8.0",
+  - "1.10.1"
+  ]
 ```
 
-This command generates static content into the `build` directory and can be served using any static contents hosting service.
+- Delete the versioned docs directory for that version. Example: versioned_docs/version-1.10.1.
+- Delete the versioned sidebars file. Example: versioned_sidebars/version-1.10.1-sidebars.json.
+
+## Static assets
 
-### Deployment
+Static assets like images should be placed in the top level [static](./static/) directory **regardless** of which version it will be used in.
 
-Using SSH:
+Docusaurus adds all of the files in the directories listed as `staticDirectories` in the config to the root of the build output so they can be accessed directly from the root path.
 
+Read more on this [here](https://docusaurus.io/docs/static-assets)
+
+### Adding a new static directory
+
+To update where Docusaurus looks for static directories, add the directory name to the `statidDirectories` list in the config:
+
+```ts
+const config: Config = {
+  .
+  .
+  .
+  staticDirectories: ['static'/*, "another static dir" */],
+  .
+  .
+  .
+}
 ```
-$ USE_SSH=true npm run deploy
+
+### Linking to static assets in docs
+
+Since the static assets are at the root of the build output, static assets can be linked to directly from the root, maintaining the directory hierarchy they have in the static directory.
+
+For example:
+
+If an image is located at `static/images/poseidon.png`, it should be linked to as `/images/poseidon.png`
+
+## Local development
+
+To render the site, run the following
+
+```sh
+npm start
 ```
 
-Not using SSH:
+This command starts a local development server and opens up a browser window on port 3000. Most changes are reflected live (hot reloaded) without having to restart the server.
 
+By default, the next version's docs are not rendered. In order to view any changes in the next version's docs, update the following in the [config file](./docusaurus.config.ts):
+
+```ts
+const ingoPreset = {
+  docs: {
+    .
+    .
+    includeCurrentVersion: false, // Update this to true to render the next verion's docs
+    .
+    .
+    .
+
+  },
+  .
+  .
+  .
+} satisfies Preset.Options
 ```
-$ GIT_USER=<Your GitHub username> npm run deploy
+
+### Updating docs in old versions
+
+In order to update docs for old versions, the files under the specific version's [versioned_docs](./versioned_docs/) directory must be updated.
+
+### Updating docs across versions
+
+If docs need updating across multiple versions, including future versions, they need to be updated in each previous version's [versioned_docs](./versioned_docs/) and the next version's docs under the [docs](./docs/) directory
+
+## Adding or removing docs from rendering
+
+Each version has its own sidebar.json file located in the [versioned_sidebars](./versioned_sidebars/) directory.
+
+The next version's sidebar is found in [sidebar.ts](./sidebars.ts).
+
+You can add or remove a doc from there to change the sidebar and include or prevent docs from rendering.
+
+## Troubleshooting
+
+### Latex isn't rendering correctly
+
+Latex formula must have the `$$` on a separate line:
+
+```mdx
+$$
+M_{4} = \begin{pmatrix}
+5 & 7 & 1 & 3 \\
+4& 6 & 1 & 1 \\
+1 & 3 & 5 & 7\\
+1 & 1 & 4 & 6\\
+\end{pmatrix}
+$$
 ```
diff --git a/docs/docs/icicle/colab-instructions.md b/docs/docs/icicle/colab-instructions.md
index ff2f3d45f..5ed744362 100644
--- a/docs/docs/icicle/colab-instructions.md
+++ b/docs/docs/icicle/colab-instructions.md
@@ -9,11 +9,11 @@ First thing to do in a notebook is to set the runtime type to a T4 GPU.
 
 - in the upper corner click on the dropdown menu and select "change runtime type"
 
-![Change runtime](./static/img/colab_change_runtime.png)
+![Change runtime](/img/colab_change_runtime.png)
 
 - In the window select "T4 GPU" and press Save
 
-![T4 GPU](./static/img/t4_gpu.png)
+![T4 GPU](/img/t4_gpu.png)
 
 Installing Rust is rather simple, just execute the following command:
 
diff --git a/docs/docs/icicle/integrations.md b/docs/docs/icicle/integrations.md
index bb37a7284..3a9e29cd5 100644
--- a/docs/docs/icicle/integrations.md
+++ b/docs/docs/icicle/integrations.md
@@ -10,7 +10,7 @@ If you're interested in understanding these integrations better or learning how
 
 Lets illustrate an ICICLE integration, so you can understand the core API and design overview of ICICLE.
 
-![ICICLE architecture](./static/img/architecture-high-level.png)
+![ICICLE architecture](/img/architecture-high-level.png)
 
 Engineers usually use a cryptographic library to implement their ZK protocols. These libraries implement efficient primitives which are used as building blocks for the protocol; ICICLE is such a library. The difference is that ICICLE is designed from the start to run on GPUs; the Rust and Golang APIs abstract away all low level CUDA details. Our goal was to allow developers with no GPU experience to quickly get started with ICICLE.
 
diff --git a/docs/docs/icicle/libraries.md b/docs/docs/icicle/libraries.md
index eafcc5975..3f6295bc8 100644
--- a/docs/docs/icicle/libraries.md
+++ b/docs/docs/icicle/libraries.md
@@ -33,6 +33,7 @@ Each library has a corresponding crate. See [programmers guide](./programmers_gu
 | [Vector operations](./primitives/vec_ops) |                      ✅                       |                           ✅                           |                           ✅                           |                      ✅                      |    ✅     |
 | [Polynomials](./polynomials/overview)     |                      ✅                       |                           ✅                           |                           ✅                           |                      ✅                      |    ❌     |
 | [Poseidon](primitives/hash#poseidon)      |                      ✅                       |                           ✅                           |                           ✅                           |                      ✅                      |    ✅     |
+| [Poseidon2](primitives/hash#poseidon2)    |                      ✅                       |                           ✅                           |                           ✅                           |                      ✅                      |    ✅     |
 
 #### Supported fields and operations
 
@@ -43,6 +44,7 @@ Each library has a corresponding crate. See [programmers guide](./programmers_gu
 | [NTT](primitives/ntt)                     |                        ✅                         |                                                 ✅                                                  |   ❌   |
 | Extension Field                           |                        ✅                         |                                                 ❌                                                  |   ✅   |
 | [Poseidon](primitives/hash#poseidon)      |                        ✅                         |                                                 ✅                                                  |   ✅   |
+| [Poseidon2](primitives/hash#poseidon2)    |                        ✅                         |                                                 ✅                                                  |   ✅   |
 
 ### Misc
 
diff --git a/docs/docs/icicle/primitives/hash.md b/docs/docs/icicle/primitives/hash.md
index 5e7923bde..702487a5a 100644
--- a/docs/docs/icicle/primitives/hash.md
+++ b/docs/docs/icicle/primitives/hash.md
@@ -18,13 +18,13 @@ For scenarios where large datasets need to be hashed efficiently, ICICLE support
 
 ICICLE supports the following hash functions:
 
-1.  **Keccak-256**
-2.	**Keccak-512**
-3.	**SHA3-256**
-4.	**SHA3-512**
-5.	**Blake2s**
-6.	**Poseidon**
-7.	**Poseidon2**
+1. **Keccak-256**
+2. **Keccak-512**
+3. **SHA3-256**
+4. **SHA3-512**
+5. **Blake2s**
+6. **Poseidon**
+7. **Poseidon2**
 
 :::info
 Additional hash functions might be added in the future. Stay tuned!
@@ -40,18 +40,16 @@ Keccak can take input messages of any length and produce a fixed-size hash. It u
 
 [Blake2s](https://www.rfc-editor.org/rfc/rfc7693.txt) is an optimized cryptographic hash function that provides high performance while ensuring strong security. Blake2s is ideal for hashing small data (such as field elements), especially when speed is crucial. It produces a 256-bit (32-byte) output and is often used in cryptographic protocols.
 
-
 ### Poseidon
 
 [Poseidon](https://eprint.iacr.org/2019/458) is a cryptographic hash function designed specifically for field elements. It is highly optimized for zero-knowledge proofs (ZKPs) and is commonly used in ZK-SNARK systems. Poseidon’s main strength lies in its arithmetization-friendly design, meaning it can be efficiently expressed as arithmetic constraints within a ZK-SNARK circuit.
 
 Traditional hash functions, such as SHA-2, are difficult to represent within ZK circuits because they involve complex bitwise operations that don’t translate efficiently into arithmetic operations. Poseidon, however, is specifically designed to minimize the number of constraints required in these circuits, making it significantly more efficient for use in ZK-SNARKs and other cryptographic protocols that require hashing over field elements.
 
-Currently the Poseidon implementation is the Optimized Poseidon (https://hackmd.io/@jake/poseidon-spec#Optimized-Poseidon). Optimized Poseidon significantly decreases the calculation time of the hash.
+Currently the Poseidon implementation is the [Optimized Poseidon](https://hackmd.io/@jake/poseidon-spec#Optimized-Poseidon). Optimized Poseidon significantly decreases the calculation time of the hash.
 
 The optional `domain_tag` pointer parameter enables domain separation, allowing isolation of hash outputs across different contexts or applications.
 
-
 ### Poseidon2
 
 [Poseidon2](https://eprint.iacr.org/2023/323.pdf) is a cryptographic hash function designed specifically for field elements.
@@ -59,6 +57,19 @@ It is an improved version of the original [Poseidon](https://eprint.iacr.org/201
 
 The optional `domain_tag` pointer parameter enables domain separation, allowing isolation of hash outputs across different contexts or applications.
 
+:::info
+
+The supported values of state size ***t*** as defined in [eprint 2023/323](https://eprint.iacr.org/2023/323.pdf) are 2, 3, 4, 8, 12, 16, 20 and 24. Note that ***t*** sizes 8, 12, 16, 20 and 24 are supported only for small fields (babybear and m31).
+
+:::
+
+:::info
+
+The S box power alpha, number of full rounds and partial rounds, rounds constants, MDS matrix, and partial matrix for each field and ***t*** can be found in this [folder](https://github.com/ingonyama-zk/icicle/tree/9b1506cda9eab30fc6a8d0a338e2cfab877402f7/icicle/include/icicle/hash/poseidon2_constants/constants).
+
+:::
+
+In the current version the padding is not supported and should be performed by the user.
 
 ## Using Hash API
 
@@ -84,21 +95,20 @@ auto poseidon = Poseidon::create<scalar_t>(t);
 // The domain tag acts as the first input to the hash function, with the remaining t-1 inputs following it.
 scalar_t domain_tag = scalar_t::zero(); // Example using zero; this can be set to any valid field element.
 auto poseidon_with_domain_tag = Poseidon::create<scalar_t>(t, &domain_tag);
-// This version of the hasher with a domain tag expects t-1 additional inputs for hashing.
 // Poseidon2 requires specifying the field-type and t parameter (supported 2, 3, 4, 8, 12, 16, 20, 24) as defined by
-// the Poseidon2 paper. For large fields (field width >= 254) the supported values of t are 2, 3, 4.
+// the Poseidon2 paper. For large fields (field width >= 252) the supported values of t are 2, 3, 4.
 auto poseidon2 = Poseidon2::create<scalar_t>(t); 
 // Optionally, Poseidon2 can accept a domain-tag, which is a field element used to separate applications or contexts.
 // The domain tag acts as the first input to the hash function, with the remaining t-1 inputs following it.
 scalar_t domain_tag = scalar_t::zero(); // Example using zero; this can be set to any valid field element.
 auto poseidon2_with_domain_tag = Poseidon2::create<scalar_t>(t, &domain_tag);
-// This version of the hasher with a domain tag expects t-1 additional inputs for hashing.
+// This version of the hasher with a domain tag expects t-1 inputs per hasher.
 ```
 
 ### 2. Hashing Data
 
 Once you have a hasher object, you can hash any input data by passing the input, its size, a configuration, and an output buffer:
-   
+
 ```cpp
 /**
  * @brief Perform a hash operation.
@@ -160,11 +170,11 @@ eIcicleErr err = keccak256.hash(input.data(), input.size() / config.batch, confi
 
 ### 4. Poseidon sponge function
 
-Currently the poseidon sponge function (sponge function description could be found in Sec 2.1 of https://eprint.iacr.org/2019/458.pdf ) isn't implemented.
+Currently the poseidon sponge mode (sponge function description could be found in Sec 2.1 of [eprint 2019/458](https://eprint.iacr.org/2019/458.pdf)) isn't implemented.
 
 ### 5. Poseidon2 sponge function
 
-Currently the poseidon2 sponge function (sponge function description could be found in Sec 2.1 of https://eprint.iacr.org/2019/458.pdf ) isn't implemented.
+Currently the poseidon2 is implemented in compression mode, the sponge mode discussed in [eprint 2023/323](https://eprint.iacr.org/2023/323.pdf) is not implemented.
 
 ### Supported Bindings
 
diff --git a/docs/docs/icicle/primitives/poseidon2.md b/docs/docs/icicle/primitives/poseidon2.md
index aa4dfed95..e3d5733b5 100644
--- a/docs/docs/icicle/primitives/poseidon2.md
+++ b/docs/docs/icicle/primitives/poseidon2.md
@@ -1,22 +1,17 @@
 # Poseidon2
 
-TODO update for V3
+[Poseidon2](https://eprint.iacr.org/2023/323) is a recently released optimized version of Poseidon. The two versions differ in two crucial points. First, Poseidon is a sponge hash function, while Poseidon2 can be either a sponge or a compression function depending on the use case. Secondly, Poseidon2 is instantiated by new and more efficient linear layers with respect to Poseidon. These changes decrease the number of multiplications in the linear layer by up to 90% and the number of constraints in Plonk circuits by up to 70%. This makes Poseidon2 currently the fastest arithmetization-oriented hash function without lookups. Since the compression mode is efficient it is ideal for use in Merkle trees as well.
 
-[Poseidon2](https://eprint.iacr.org/2023/323) is a recently released optimized version of Poseidon1. The two versions differ in two crucial points. First, Poseidon is a sponge hash function, while Poseidon2 can be either a sponge or a compression function depending on the use case. Secondly, Poseidon2 is instantiated by new and more efficient linear layers with respect to Poseidon. These changes decrease the number of multiplications in the linear layer by up to 90% and the number of constraints in Plonk circuits by up to 70%. This makes Poseidon2 currently the fastest arithmetization-oriented hash function without lookups.
+An overview of the Poseidon2 hash is provided in the diagram below
 
+![alt text](/img/Poseidon2.png)
 
-## Using Poseidon2
+## Description
 
-ICICLE Poseidon2 is implemented for GPU and parallelization is performed for each state.
-We calculate multiple hash-sums over multiple pre-images in parallel, rather than going block by block over the input vector.
+### Round constants
 
-For example, for Poseidon2 of width 16, input rate 8, output elements 8 and input of size 1024 * 8, we would expect 1024 * 8 elements of output. Which means each input block would be of size 8, resulting in 1024 Poseidon2 hashes being performed.
-
-### Supported Bindings
-
-[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon2)
-
-### Constants
+* In the first full round and last full rounds Round constants are of the structure $[c_0,c_1,\ldots , c_{t-1}]$, where $c_i\in \mathbb{F}$
+* In the partial rounds the round constants is only added to first element $[\tilde{c}_0,0,0,\ldots, 0_{t-1}]$, where $\tilde{c_0}\in \mathbb{F}$
 
 Poseidon2 is also extremely customizable and using different constants will produce different hashes, security levels and performance results.
 
@@ -24,68 +19,161 @@ We support pre-calculated constants for each of the [supported curves](../librar
 
 You can also use your own set of constants as shown [here](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-fields/icicle-babybear/src/poseidon2/mod.rs#L290)
 
-### Rust API
+### S box
 
-This is the most basic way to use the Poseidon2 API.
+Allowed values of $\alpha$ for a given prime is the smallest integer such that $gcd(\alpha,p-1)=1$
 
-```rust
-let test_size = 1 << 10;
-let width = 16;
-let rate = 8;
-let ctx = get_default_device_context();
-let poseidon = Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
-let config = HashConfig::default();
+For ICICLE supported curves/fields
 
-let inputs = vec![F::one(); test_size * rate as usize];
-let outputs = vec![F::zero(); test_size];
-let mut input_slice = HostOrDeviceSlice::on_host(inputs);
-let mut output_slice = HostOrDeviceSlice::on_host(outputs);
-
-poseidon.hash_many::<F>(
-    &mut input_slice,
-    &mut output_slice,
-    test_size as u32,
-    rate as u32,
-    8, // Output length
-    &config,
-)
-.unwrap();
-```
+* Mersene $\alpha = 5$
+* Babybear $\alpha=7$
+* Bls12-377 $\alpha =11$
+* Bls12-381 $\alpha=5$
+* BN254 $\alpha = 5$
+* Grumpkin $\alpha = 5$
+* Stark252 $\alpha=3$
+
+### MDS matrix structure
+
+There are only two matrices: There is one type of matrix for full round and another for partial round. There are two cases available one for state size $t'=4\cdot t$ and another for $t=2,3$.
+
+#### $t=4\cdot t'$ where $t'$ is an integer
+
+**Full Matrix** $M_{full}$ (Referred in paper as $M_{\mathcal{E}}$). These are hard coded (same for all primes $p>2^{30}$) for any fixed state size $t=4\cdot t'$ where $t'$ is an integer.
+
+$$
+M_{4} = \begin{pmatrix}
+5 & 7 & 1 & 3 \\
+4& 6 & 1 & 1 \\
+1 & 3 & 5 & 7\\
+1 & 1 & 4 & 6\\
+\end{pmatrix}
+$$
+
+As per the [paper](https://eprint.iacr.org/2023/323.pdf) this structure is always maintained and is always MDS for any prime $p>2^{30}$.
+
+eg for $t=8$ the matrix looks like
+$$
+M_{full}^{8\times 8} = \begin{pmatrix}
+2\cdot M_4 & M_4 \\
+M_4 & 2\cdot M_4 \\
+\end{pmatrix}
+$$
+
+**Partial Matrix** $M_{partial}$(referred in paper as $M_{\mathcal{I}}$) - There is only ONE partial matrix for all the partial rounds and has non zero diagonal entries along the diagonal and $1$ everywhere else.
 
-In the example above `Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();` is used to load the correct constants based on width and curve. Here, the default MDS matrices and diffusion are used. If you want to get a Plonky3 compliant version, set them to `MdsType::Plonky` and `DiffusionStrategy::Montgomery` respectively.
+$$
+M_{Partial}^{t\times t} = \begin{pmatrix}
+\mu_0 &1 & \ldots & 1 \\
+1 &\mu_1 & \ldots & 1 \\
+\vdots & \vdots & \ddots & \vdots \\
+ 1 & 1 &\ldots & \mu_{t-1}\\
+\end{pmatrix}
+$$
 
-## The Tree Builder
+where $\mu_i \in \mathbb{F}$. In general this matrix is different for each prime since one has to find values that satisfy some inequalities in a field. However unlike Poseidon there is only one $M_{partial}$ for all partial rounds.
 
-Similar to Poseidon1, you can use Poseidon2 in a tree builder.
+### $t=2,3$
+
+These are special state sizes. In all ICICLE supported curves/fields the matrices for $t=3$ are
+
+$$
+M_{full} = \begin{pmatrix}
+2 & 1 &  1 \\
+1 & 2 & 1 \\
+1 & 1 & 2 \\
+\end{pmatrix} \ , \ M_{Partial} = \begin{pmatrix}
+2 & 1 &  1 \\
+1 & 2 & 1 \\
+1 & 1 & 3 \\
+\end{pmatrix}
+$$
+
+and the matrices for $t=2$ are
+
+$$
+M_{full} = \begin{pmatrix}
+2 & 1 \\
+1 & 2 \\
+\end{pmatrix} \ , \ M_{Partial} = \begin{pmatrix}
+2 & 1  \\
+1 & 3  \\
+\end{pmatrix}
+$$
+
+## Supported Bindings
+
+[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon2)
+
+## Rust API
+
+This is the most basic way to use the Poseidon2 API. See the [examples/poseidon2](https://github.com/ingonyama-zk/icicle/tree/b12d83e6bcb8ee598409de78015bd118458a55d0/examples/rust/poseidon2) folder for the relevant code
 
 ```rust
-use icicle_bn254::tree::Bn254TreeBuilder;
-use icicle_bn254::poseidon2::Poseidon2;
-
-let mut config = TreeBuilderConfig::default();
-let arity = 2;
-config.arity = arity as u32;
-let input_block_len = arity;
-let leaves = vec![F::one(); (1 << height) * arity];
-let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
-
-let leaves_slice = HostSlice::from_slice(&leaves);
-let digests_slice = HostSlice::from_mut_slice(&mut digests);
-
-let ctx = device_context::DeviceContext::default();
-let hash = Poseidon2::load(arity, arity, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
-
-let mut config = TreeBuilderConfig::default();
-config.keep_rows = 5;
-Bn254TreeBuilder::build_merkle_tree(
-    leaves_slice,
-    digests_slice,
-    height,
-    input_block_len,
-    &hash,
-    &hash,
-    &config,
-)
-.unwrap();
+let test_size = 4;
+let poseidon = Poseidon2::new::<F>(test_size,None).unwrap();
+let config = HashConfig::default();
+let inputs = vec![F::one(); test_size];
+let input_slice = HostSlice::from_slice(&inputs);
+//digest is a single element
+let out_init:F = F::zero();
+let mut binding = [out_init];
+let out_init_slice = HostSlice::from_mut_slice(&mut binding);
+
+poseidon.hash(input_slice, &config, out_init_slice).unwrap();
+println!("computed digest: {:?} ",out_init_slice.as_slice().to_vec()[0]);
 ```
 
+## Merkle Tree Builder
+
+You can use Poseidon2 in a Merkle tree builder. See the [examples/poseidon2](https://github.com/ingonyama-zk/icicle/tree/b12d83e6bcb8ee598409de78015bd118458a55d0/examples/rust/poseidon2) folder for the relevant code.
+
+```rust
+pub fn compute_binary_tree<F:FieldImpl>(
+    mut test_vec: Vec<F>,
+    leaf_size: u64,
+    hasher: Hasher,
+    compress: Hasher,
+    mut tree_config: MerkleTreeConfig,
+) -> MerkleTree
+{
+    let tree_height: usize = test_vec.len().ilog2() as usize;
+    //just to be safe
+    tree_config.padding_policy = PaddingPolicy::ZeroPadding;
+    let layer_hashes: Vec<&Hasher> = std::iter::once(&hasher)
+        .chain(std::iter::repeat(&compress).take(tree_height))
+        .collect();
+    let vec_slice: &mut HostSlice<F> = HostSlice::from_mut_slice(&mut test_vec[..]);
+    let merkle_tree: MerkleTree = MerkleTree::new(&layer_hashes, leaf_size, 0).unwrap();
+
+    let _ = merkle_tree
+        .build(vec_slice,&tree_config);
+    merkle_tree
+}
+
+//poseidon2 supports t=2,3,4,8,12,16,20,24. In this example we build a binary tree with Poseidon2 t=2.
+let poseidon_state_size = 2; 
+let leaf_size:u64 = 4;// each leaf is a 32 bit element 32/8 = 4 bytes
+
+let mut test_vec = vec![F::from_u32(random::<u32>()); 1024* (poseidon_state_size as usize)];   
+println!("Generated random vector of size {:?}", 1024* (poseidon_state_size as usize));
+//to use later for merkle proof
+let mut binding = test_vec.clone();
+let test_vec_slice = HostSlice::from_mut_slice(&mut binding);
+//define hash and compression functions (You can use different hashes here)
+//note:"None" does not work with generics, use F= Fm31, Fbabybear etc
+let hasher :Hasher = Poseidon2::new::<F>(poseidon_state_size.try_into().unwrap(),None).unwrap();
+let compress: Hasher = Poseidon2::new::<F>((hasher.output_size()*2).try_into().unwrap(),None).unwrap();
+//tree config
+let tree_config = MerkleTreeConfig::default();
+let merk_tree = compute_binary_tree(test_vec.clone(), leaf_size, hasher, compress,tree_config.clone());
+println!("computed Merkle root {:?}", merk_tree.get_root::<F>().unwrap());
+
+let random_test_index = rand::thread_rng().gen_range(0..1024*(poseidon_state_size as usize));
+print!("Generating proof for element {:?} at random test index {:?} ",test_vec[random_test_index], random_test_index);
+let merkle_proof = merk_tree.get_proof::<F>(test_vec_slice, random_test_index.try_into().unwrap(), false, &tree_config).unwrap();
+
+//actually should construct verifier tree :) 
+assert!(merk_tree.verify(&merkle_proof).unwrap());
+println!("\n Merkle proof verified successfully!");
+```
diff --git a/docs/sidebars.ts b/docs/sidebars.ts
index 998d0bd80..32f264d70 100644
--- a/docs/sidebars.ts
+++ b/docs/sidebars.ts
@@ -76,6 +76,11 @@ const cppApi = [
     label: "Hash",
     id: "icicle/primitives/hash",
   },
+  {
+    type: "doc",
+    label: "Poseidon2",
+    id: "icicle/primitives/poseidon2",
+  },
   {
     type: "doc",
     label: "Merkle-Tree",
diff --git a/docs/static/img/Poseidon2.png b/docs/static/img/Poseidon2.png
new file mode 100644
index 000000000..1204c15e0
Binary files /dev/null and b/docs/static/img/Poseidon2.png differ
diff --git a/docs/versioned_docs/version-3.2.0/icicle/primitives/hash.md b/docs/versioned_docs/version-3.2.0/icicle/primitives/hash.md
index 5e7923bde..46ec8a1f2 100644
--- a/docs/versioned_docs/version-3.2.0/icicle/primitives/hash.md
+++ b/docs/versioned_docs/version-3.2.0/icicle/primitives/hash.md
@@ -18,13 +18,13 @@ For scenarios where large datasets need to be hashed efficiently, ICICLE support
 
 ICICLE supports the following hash functions:
 
-1.  **Keccak-256**
-2.	**Keccak-512**
-3.	**SHA3-256**
-4.	**SHA3-512**
-5.	**Blake2s**
-6.	**Poseidon**
-7.	**Poseidon2**
+1. **Keccak-256**
+2. **Keccak-512**
+3. **SHA3-256**
+4. **SHA3-512**
+5. **Blake2s**
+6. **Poseidon**
+7. **Poseidon2**
 
 :::info
 Additional hash functions might be added in the future. Stay tuned!
@@ -40,18 +40,16 @@ Keccak can take input messages of any length and produce a fixed-size hash. It u
 
 [Blake2s](https://www.rfc-editor.org/rfc/rfc7693.txt) is an optimized cryptographic hash function that provides high performance while ensuring strong security. Blake2s is ideal for hashing small data (such as field elements), especially when speed is crucial. It produces a 256-bit (32-byte) output and is often used in cryptographic protocols.
 
-
 ### Poseidon
 
 [Poseidon](https://eprint.iacr.org/2019/458) is a cryptographic hash function designed specifically for field elements. It is highly optimized for zero-knowledge proofs (ZKPs) and is commonly used in ZK-SNARK systems. Poseidon’s main strength lies in its arithmetization-friendly design, meaning it can be efficiently expressed as arithmetic constraints within a ZK-SNARK circuit.
 
 Traditional hash functions, such as SHA-2, are difficult to represent within ZK circuits because they involve complex bitwise operations that don’t translate efficiently into arithmetic operations. Poseidon, however, is specifically designed to minimize the number of constraints required in these circuits, making it significantly more efficient for use in ZK-SNARKs and other cryptographic protocols that require hashing over field elements.
 
-Currently the Poseidon implementation is the Optimized Poseidon (https://hackmd.io/@jake/poseidon-spec#Optimized-Poseidon). Optimized Poseidon significantly decreases the calculation time of the hash.
+Currently the Poseidon implementation is the [Optimized Poseidon](https://hackmd.io/@jake/poseidon-spec#Optimized-Poseidon). Optimized Poseidon significantly decreases the calculation time of the hash.
 
 The optional `domain_tag` pointer parameter enables domain separation, allowing isolation of hash outputs across different contexts or applications.
 
-
 ### Poseidon2
 
 [Poseidon2](https://eprint.iacr.org/2023/323.pdf) is a cryptographic hash function designed specifically for field elements.
@@ -59,6 +57,10 @@ It is an improved version of the original [Poseidon](https://eprint.iacr.org/201
 
 The optional `domain_tag` pointer parameter enables domain separation, allowing isolation of hash outputs across different contexts or applications.
 
+The supported values of number of states ***t*** or ***T*** as defined in [eprint 2023/323](https://eprint.iacr.org/2023/323.pdf) are 2, 3, 4, 8, 12, 16, 20 and 24. Note that ***t*** of 8, 12, 16, 20 and 24 is supported only for the small fields (babybear and m31).
+The alpha, number of full rounds and partial rounds, rounds constants, MDS matrix, and partial matrix for each field and ***t*** could be found in the appropriate file in this [folder](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/icicle/hash/poseidon2_constants/constants).
+
+In the current version the padding is not supported and should be performed by the user.
 
 ## Using Hash API
 
@@ -84,21 +86,20 @@ auto poseidon = Poseidon::create<scalar_t>(t);
 // The domain tag acts as the first input to the hash function, with the remaining t-1 inputs following it.
 scalar_t domain_tag = scalar_t::zero(); // Example using zero; this can be set to any valid field element.
 auto poseidon_with_domain_tag = Poseidon::create<scalar_t>(t, &domain_tag);
-// This version of the hasher with a domain tag expects t-1 additional inputs for hashing.
 // Poseidon2 requires specifying the field-type and t parameter (supported 2, 3, 4, 8, 12, 16, 20, 24) as defined by
-// the Poseidon2 paper. For large fields (field width >= 254) the supported values of t are 2, 3, 4.
+// the Poseidon2 paper. For large fields (field width >= 252) the supported values of t are 2, 3, 4.
 auto poseidon2 = Poseidon2::create<scalar_t>(t); 
 // Optionally, Poseidon2 can accept a domain-tag, which is a field element used to separate applications or contexts.
 // The domain tag acts as the first input to the hash function, with the remaining t-1 inputs following it.
 scalar_t domain_tag = scalar_t::zero(); // Example using zero; this can be set to any valid field element.
 auto poseidon2_with_domain_tag = Poseidon2::create<scalar_t>(t, &domain_tag);
-// This version of the hasher with a domain tag expects t-1 additional inputs for hashing.
+// This version of the hasher with a domain tag expects t-1 inputs per hasher.
 ```
 
 ### 2. Hashing Data
 
 Once you have a hasher object, you can hash any input data by passing the input, its size, a configuration, and an output buffer:
-   
+
 ```cpp
 /**
  * @brief Perform a hash operation.
@@ -160,11 +161,11 @@ eIcicleErr err = keccak256.hash(input.data(), input.size() / config.batch, confi
 
 ### 4. Poseidon sponge function
 
-Currently the poseidon sponge function (sponge function description could be found in Sec 2.1 of https://eprint.iacr.org/2019/458.pdf ) isn't implemented.
+Currently the poseidon sponge mode (sponge function description could be found in Sec 2.1 of [eprint 2019/458](https://eprint.iacr.org/2019/458.pdf)) isn't implemented.
 
 ### 5. Poseidon2 sponge function
 
-Currently the poseidon2 sponge function (sponge function description could be found in Sec 2.1 of https://eprint.iacr.org/2019/458.pdf ) isn't implemented.
+Currently the poseidon2 is implemented in compression mode, the sponge mode discussed in [eprint 2023/323](https://eprint.iacr.org/2023/323.pdf) is not implemented.
 
 ### Supported Bindings
 
diff --git a/docs/versioned_docs/version-3.2.0/icicle/primitives/poseidon2.md b/docs/versioned_docs/version-3.2.0/icicle/primitives/poseidon2.md
index aa4dfed95..fc66dfec2 100644
--- a/docs/versioned_docs/version-3.2.0/icicle/primitives/poseidon2.md
+++ b/docs/versioned_docs/version-3.2.0/icicle/primitives/poseidon2.md
@@ -1,22 +1,17 @@
 # Poseidon2
 
-TODO update for V3
+[Poseidon2](https://eprint.iacr.org/2023/323) is a recently released optimized version of Poseidon. The two versions differ in two crucial points. First, Poseidon is a sponge hash function, while Poseidon2 can be either a sponge or a compression function depending on the use case. Secondly, Poseidon2 is instantiated by new and more efficient linear layers with respect to Poseidon. These changes decrease the number of multiplications in the linear layer by up to 90% and the number of constraints in Plonk circuits by up to 70%. This makes Poseidon2 currently the fastest arithmetization-oriented hash function without lookups. Since the compression mode is efficient it is ideal for use in Merkle trees as well.
 
-[Poseidon2](https://eprint.iacr.org/2023/323) is a recently released optimized version of Poseidon1. The two versions differ in two crucial points. First, Poseidon is a sponge hash function, while Poseidon2 can be either a sponge or a compression function depending on the use case. Secondly, Poseidon2 is instantiated by new and more efficient linear layers with respect to Poseidon. These changes decrease the number of multiplications in the linear layer by up to 90% and the number of constraints in Plonk circuits by up to 70%. This makes Poseidon2 currently the fastest arithmetization-oriented hash function without lookups.
+An overview of the Poseidon2 hash is provided in the diagram below
 
+![alt text](/img/Poseidon2.png)
 
-## Using Poseidon2
+## Description
 
-ICICLE Poseidon2 is implemented for GPU and parallelization is performed for each state.
-We calculate multiple hash-sums over multiple pre-images in parallel, rather than going block by block over the input vector.
+### Round constants
 
-For example, for Poseidon2 of width 16, input rate 8, output elements 8 and input of size 1024 * 8, we would expect 1024 * 8 elements of output. Which means each input block would be of size 8, resulting in 1024 Poseidon2 hashes being performed.
-
-### Supported Bindings
-
-[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon2)
-
-### Constants
+* In the first full round and last full rounds Round constants are of the structure $[c_0,c_1,\ldots , c_{t-1}]$, where $c_i\in \mathbb{F}$
+* In the partial rounds the round constants is only added to first element $[\tilde{c}_0,0,0,\ldots, 0_{t-1}]$, where $\tilde{c_0}\in \mathbb{F}$
 
 Poseidon2 is also extremely customizable and using different constants will produce different hashes, security levels and performance results.
 
@@ -24,68 +19,161 @@ We support pre-calculated constants for each of the [supported curves](../librar
 
 You can also use your own set of constants as shown [here](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-fields/icicle-babybear/src/poseidon2/mod.rs#L290)
 
-### Rust API
+### S box
+
+Allowed values of $\alpha$ for a given prime is the smallest integer such that $gcd(\alpha,p-1)=1$
+
+For ICICLE supported curves/fields
+
+* Mersene $\alpha = 5$
+* Babybear $\alpha=7$
+* Bls12-377 $\alpha =11$
+* Bls12-381 $\alpha=5$
+* BN254 $\alpha = 5$
+* Grumpkin $\alpha = 5$
+* Stark252 $\alpha=3$
+
+### MDS matrix structure
+
+There are only two matrices: There is one type of matrix for full round and another for partial round. There are two cases available one for state size $t'=4\cdot t$ and another for $t=2,3$.
+
+#### $t=4\cdot t'$ where $t'$ is an integer
+
+**Full Matrix** $M_{full}$ (Referred in paper as $M_{\mathcal{E}}$). These are hard coded (same for all primes $p>2^{30}$) for any fixed state size $t=4\cdot t'$ where $t'$ is an integer.
+$$M_{4} = \begin{pmatrix}
+5 & 7 & 1 & 3 \\
+4& 6 & 1 & 1 \\
+1 & 3 & 5 & 7\\
+1 & 1 & 4 & 6\\
+\end{pmatrix}
+$$
+
+As per the [paper](https://eprint.iacr.org/2023/323.pdf) this structure is always maintained and is always MDS for any prime $p>2^{30}$.
+
+eg for $t=8$ the matrix looks like
+
+$$
+M_{full}^{8\times 8} = \begin{pmatrix}
+2\cdot M_4 & M_4 \\
+M_4 & 2\cdot M_4 \\
+\end{pmatrix}
+$$
+
+**Partial Matrix** $M_{partial}$(referred in paper as $M_{\mathcal{I}}$) - There is only ONE partial matrix for all the partial rounds and has non zero diagonal entries along the diagonal and $1$ everywhere else.
 
-This is the most basic way to use the Poseidon2 API.
+$$
+M_{Partial}^{t\times t} = \begin{pmatrix}
+\mu_0 &1 & \ldots & 1 \\
+1 &\mu_1 & \ldots & 1 \\
+\vdots & \vdots & \ddots & \vdots \\
+ 1 & 1 &\ldots & \mu_{t-1}\\
+\end{pmatrix}
+$$
+
+where $\mu_i \in \mathbb{F}$. In general this matrix is different for each prime since one has to find values that satisfy some inequalities in a field. However unlike Poseidon there is only one $M_{partial}$ for all partial rounds.
+
+### $t=2,3$
+
+These are special state sizes. In all ICICLE supported curves/fields the matrices for $t=3$ are
+
+$$
+M_{full} = \begin{pmatrix}
+2 & 1 &  1 \\
+1 & 2 & 1 \\
+1 & 1 & 2 \\
+\end{pmatrix} \ , \ M_{Partial} = \begin{pmatrix}
+2 & 1 &  1 \\
+1 & 2 & 1 \\
+1 & 1 & 3 \\
+\end{pmatrix}
+$$
+
+and the matrices for $t=2$ are
+
+$$
+M_{full} = \begin{pmatrix}
+2 & 1 \\
+1 & 2 \\
+\end{pmatrix} \ , \ M_{Partial} = \begin{pmatrix}
+2 & 1  \\
+1 & 3  \\
+\end{pmatrix}
+$$
+
+## Supported Bindings
+
+[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon2)
+
+## Rust API
+
+This is the most basic way to use the Poseidon2 API. See the [examples/poseidon2](https://github.com/ingonyama-zk/icicle/tree/b12d83e6bcb8ee598409de78015bd118458a55d0/examples/rust/poseidon2) folder for the relevant code
 
 ```rust
-let test_size = 1 << 10;
-let width = 16;
-let rate = 8;
-let ctx = get_default_device_context();
-let poseidon = Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
+let test_size = 4;
+let poseidon = Poseidon2::new::<F>(test_size,None).unwrap();
 let config = HashConfig::default();
+let inputs = vec![F::one(); test_size];
+let input_slice = HostSlice::from_slice(&inputs);
+//digest is a single element
+let out_init:F = F::zero();
+let mut binding = [out_init];
+let out_init_slice = HostSlice::from_mut_slice(&mut binding);
 
-let inputs = vec![F::one(); test_size * rate as usize];
-let outputs = vec![F::zero(); test_size];
-let mut input_slice = HostOrDeviceSlice::on_host(inputs);
-let mut output_slice = HostOrDeviceSlice::on_host(outputs);
-
-poseidon.hash_many::<F>(
-    &mut input_slice,
-    &mut output_slice,
-    test_size as u32,
-    rate as u32,
-    8, // Output length
-    &config,
-)
-.unwrap();
-```
+poseidon.hash(input_slice, &config, out_init_slice).unwrap();
+println!("computed digest: {:?} ",out_init_slice.as_slice().to_vec()[0]);
 
-In the example above `Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();` is used to load the correct constants based on width and curve. Here, the default MDS matrices and diffusion are used. If you want to get a Plonky3 compliant version, set them to `MdsType::Plonky` and `DiffusionStrategy::Montgomery` respectively.
+```
 
-## The Tree Builder
+## Merkle Tree Builder
 
-Similar to Poseidon1, you can use Poseidon2 in a tree builder.
+You can use Poseidon2 in a Merkle tree builder. See the [examples/poseidon2](https://github.com/ingonyama-zk/icicle/tree/b12d83e6bcb8ee598409de78015bd118458a55d0/examples/rust/poseidon2) folder for the relevant code.
 
 ```rust
-use icicle_bn254::tree::Bn254TreeBuilder;
-use icicle_bn254::poseidon2::Poseidon2;
-
-let mut config = TreeBuilderConfig::default();
-let arity = 2;
-config.arity = arity as u32;
-let input_block_len = arity;
-let leaves = vec![F::one(); (1 << height) * arity];
-let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
-
-let leaves_slice = HostSlice::from_slice(&leaves);
-let digests_slice = HostSlice::from_mut_slice(&mut digests);
-
-let ctx = device_context::DeviceContext::default();
-let hash = Poseidon2::load(arity, arity, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
-
-let mut config = TreeBuilderConfig::default();
-config.keep_rows = 5;
-Bn254TreeBuilder::build_merkle_tree(
-    leaves_slice,
-    digests_slice,
-    height,
-    input_block_len,
-    &hash,
-    &hash,
-    &config,
-)
-.unwrap();
+pub fn compute_binary_tree<F:FieldImpl>(
+    mut test_vec: Vec<F>,
+    leaf_size: u64,
+    hasher: Hasher,
+    compress: Hasher,
+    mut tree_config: MerkleTreeConfig,
+) -> MerkleTree
+{
+    let tree_height: usize = test_vec.len().ilog2() as usize;
+    //just to be safe
+    tree_config.padding_policy = PaddingPolicy::ZeroPadding;
+    let layer_hashes: Vec<&Hasher> = std::iter::once(&hasher)
+        .chain(std::iter::repeat(&compress).take(tree_height))
+        .collect();
+    let vec_slice: &mut HostSlice<F> = HostSlice::from_mut_slice(&mut test_vec[..]);
+    let merkle_tree: MerkleTree = MerkleTree::new(&layer_hashes, leaf_size, 0).unwrap();
+
+    let _ = merkle_tree
+        .build(vec_slice,&tree_config);
+    merkle_tree
+}
+
+//poseidon2 supports t=2,3,4,8,12,16,20,24. In this example we build a binary tree with Poseidon2 t=2.
+let poseidon_state_size = 2;
+let leaf_size:u64 = 4;// each leaf is a 32 bit element 32/8 = 4 bytes
+
+let mut test_vec = vec![F::from_u32(random::<u32>()); 1024* (poseidon_state_size as usize)];
+println!("Generated random vector of size {:?}", 1024* (poseidon_state_size as usize));
+//to use later for merkle proof
+let mut binding = test_vec.clone();
+let test_vec_slice = HostSlice::from_mut_slice(&mut binding);
+//define hash and compression functions (You can use different hashes here)
+//note:"None" does not work with generics, use F= Fm31, Fbabybear etc
+let hasher :Hasher = Poseidon2::new::<F>(poseidon_state_size.try_into().unwrap(),None).unwrap();
+let compress: Hasher = Poseidon2::new::<F>((hasher.output_size()*2).try_into().unwrap(),None).unwrap();
+//tree config
+let tree_config = MerkleTreeConfig::default();
+let merk_tree = compute_binary_tree(test_vec.clone(), leaf_size, hasher, compress,tree_config.clone());
+println!("computed Merkle root {:?}", merk_tree.get_root::<F>().unwrap());
+
+let random_test_index = rand::thread_rng().gen_range(0..1024*(poseidon_state_size as usize));
+print!("Generating proof for element {:?} at random test index {:?} ",test_vec[random_test_index], random_test_index);
+let merkle_proof = merk_tree.get_proof::<F>(test_vec_slice, random_test_index.try_into().unwrap(), false, &tree_config).unwrap();
+
+//actually should construct verifier tree :)
+assert!(merk_tree.verify(&merkle_proof).unwrap());
+println!("\n Merkle proof verified successfully!");
 ```
-
diff --git a/docs/versioned_sidebars/version-3.2.0-sidebars.json b/docs/versioned_sidebars/version-3.2.0-sidebars.json
index e948901ba..07c05ac03 100644
--- a/docs/versioned_sidebars/version-3.2.0-sidebars.json
+++ b/docs/versioned_sidebars/version-3.2.0-sidebars.json
@@ -116,6 +116,11 @@
               "label": "Hash",
               "id": "icicle/primitives/hash"
             },
+            {
+              "type": "doc",
+              "label": "Poseidon2",
+              "id": "icicle/primitives/poseidon2"
+            },
             {
               "type": "doc",
               "label": "Merkle-Tree",
diff --git a/examples/rust/Cargo.toml b/examples/rust/Cargo.toml
index 5cf43c028..dafa73bdd 100644
--- a/examples/rust/Cargo.toml
+++ b/examples/rust/Cargo.toml
@@ -5,7 +5,8 @@ members = [
     "ntt",
     "polynomials",
     "arkworks-icicle-conversions",
-    "hash-and-merkle",
+    "hash-and-merkle", 
+    "poseidon2",
 ]
 
 [workspace.dependencies]
@@ -15,6 +16,7 @@ icicle-core = { path = "../../wrappers/rust/icicle-core" }
 icicle-bn254 = { path = "../../wrappers/rust/icicle-curves/icicle-bn254" }
 icicle-bls12-377 = { path = "../../wrappers/rust/icicle-curves/icicle-bls12-377" }
 icicle-babybear = { path = "../../wrappers/rust/icicle-fields/icicle-babybear" }
+icicle-m31 = {path = "../../wrappers/rust/icicle-fields/icicle-m31" }
 rand = "0.8"
 clap = { version = "<=4.4.12", features = ["derive"] }
 
diff --git a/examples/rust/poseidon2/Cargo.toml b/examples/rust/poseidon2/Cargo.toml
new file mode 100644
index 000000000..9416d34f8
--- /dev/null
+++ b/examples/rust/poseidon2/Cargo.toml
@@ -0,0 +1,22 @@
+[package]
+name = "poseidon2"
+version = "0.1.0"
+edition = "2018"
+
+[dependencies]
+icicle-core = {path = "../../../wrappers/rust/icicle-core" }
+icicle-runtime = {path = "../../../wrappers/rust/icicle-runtime" }
+icicle-hash = {path = "../../../wrappers/rust/icicle-hash" }
+icicle-babybear = {path = "../../../wrappers/rust/icicle-fields/icicle-babybear" }
+icicle-m31 = {path = "../../../wrappers/rust/icicle-fields/icicle-m31" }
+hex = "0.4" 
+rand = "0.8"
+clap = { version = "<=4.4.12", features = ["derive"] }
+
+[features]
+cuda = [
+  "icicle-runtime/cuda_backend",
+  "icicle-hash/cuda_backend",
+  "icicle-babybear/cuda_backend",
+  "icicle-m31/cuda_backend",
+]
diff --git a/examples/rust/poseidon2/Readme.md b/examples/rust/poseidon2/Readme.md
new file mode 100644
index 000000000..aea79188a
--- /dev/null
+++ b/examples/rust/poseidon2/Readme.md
@@ -0,0 +1,17 @@
+# Using Poseidon2 API
+
+Sanity checks with standard sage output from attached sage script. The sage script originally from [Horizen labs code](https://github.com/HorizenLabs/poseidon2/blob/055bde3f4782731ba5f5ce5888a440a94327eaf3/poseidon2_rust_params.sage#L1) has been modified to print in standard format.
+
+::: info
+
+Note that the digest element of the Poseidon2 hash api is the output state of Poseidon2 at index 1.
+
+:::
+
+Run the example with
+```
+cargo run --release
+```
+
+* The first example runs Poseidon2 API for the babybear and m31 fields for state sizes $t=2,3,4,8,12,16,20,24$ and prints the results. You can compare it with the results from the attached sage code.
+* The second example builds a binary merkle tree with Poseidon2 $t=2$ using the Merkle tree builder API and verifies the path for an arbitrarily chosen leaf.
\ No newline at end of file
diff --git a/examples/rust/poseidon2/poseidon2_rust_params.sage b/examples/rust/poseidon2/poseidon2_rust_params.sage
new file mode 100644
index 000000000..c10459b75
--- /dev/null
+++ b/examples/rust/poseidon2/poseidon2_rust_params.sage
@@ -0,0 +1,757 @@
+# Remark: This script contains functionality for GF(2^n), but currently works only over GF(p)! A few small adaptations are needed for GF(2^n).
+from sage.rings.polynomial.polynomial_gf2x import GF2X_BuildIrred_list
+from math import *
+import itertools
+
+###########################################################################
+# ICICLE:  change curve here: ICICLE supported curves/fie;ds
+# p = 2147483647 #Mersene
+# p = 18446744069414584321 # GoldiLocks
+# p = 2013265921 # BabyBear
+# p = 52435875175126190479447740508185965837690552500527637822603658699938581184513 # BLS12-381
+# p = 8444461749428370424248824938781546531375899335154063827935233455917409239041 #BLS12-377
+p = 21888242871839275222246405745257275088548364400416034343698204186575808495617 # BN254/BN256
+# p = 21888242871839275222246405745257275088696311157297823662689037894645226208583 #Grumpkin
+# p = 3618502788666131213697322783095070105623107215331596699973092056135872020481 #stark252
+
+## ICICLE unsupported
+# p = 28948022309329048855892746252171976963363056481941560715954676764349967630337 # Pasta (Pallas)
+# p = 28948022309329048855892746252171976963363056481941647379679742748393362948097 # Pasta (Vesta)
+
+n = len(p.bits()) # bit
+# paper stuff, ignore
+# t = 12 # GoldiLocks (t = 12 for sponge, t = 8 for compression)
+# t = 16 # BabyBear (t = 24 for sponge, t = 16 for compression)
+# t = 3 # BN254/BN256, BLS12-381, Pallas, Vesta (t = 3 for sponge, t = 2 for compression)
+
+# change t here: For ICICLE: t=3,4 for large fields and small fields for now.
+
+# ICICLE - CHANGE state size HERE: : t=2,3,4,8,12,16,20,24 are covered in paper. (Mersene, goldilocks,babybear all sizes)
+# large fields 255 bits recommended t =2,3,4,8
+t = 3
+
+FIELD = 1
+SBOX = 0
+FIELD_SIZE = n
+NUM_CELLS = t
+
+def get_alpha(p):
+    for alpha in range(3, p):
+        if gcd(alpha, p-1) == 1:
+            break
+    return alpha
+
+alpha = get_alpha(p)
+
+def get_sbox_cost(R_F, R_P, N, t):
+    return int(t * R_F + R_P)
+
+def get_size_cost(R_F, R_P, N, t):
+    n = ceil(float(N) / t)
+    return int((N * R_F) + (n * R_P))
+
+def poseidon_calc_final_numbers_fixed(p, t, alpha, M, security_margin):
+    # [Min. S-boxes] Find best possible for t and N
+    n = ceil(log(p, 2))
+    N = int(n * t)
+    cost_function = get_sbox_cost
+    ret_list = []
+    (R_F, R_P) = find_FD_round_numbers(p, t, alpha, M, cost_function, security_margin)
+    min_sbox_cost = cost_function(R_F, R_P, N, t)
+    ret_list.append(R_F)
+    ret_list.append(R_P)
+    ret_list.append(min_sbox_cost)
+
+    # [Min. Size] Find best possible for t and N
+    # Minimum number of S-boxes for fixed n results in minimum size also (round numbers are the same)!
+    min_size_cost = get_size_cost(R_F, R_P, N, t)
+    ret_list.append(min_size_cost)
+
+    return ret_list # [R_F, R_P, min_sbox_cost, min_size_cost]
+
+def find_FD_round_numbers(p, t, alpha, M, cost_function, security_margin):
+    n = ceil(log(p, 2))
+    N = int(n * t)
+
+    sat_inequiv = sat_inequiv_alpha
+
+    R_P = 0
+    R_F = 0
+    min_cost = float("inf")
+    max_cost_rf = 0
+    # Brute-force approach
+    for R_P_t in range(1, 500):
+        for R_F_t in range(4, 100):
+            if R_F_t % 2 == 0:
+                if (sat_inequiv(p, t, R_F_t, R_P_t, alpha, M) == True):
+                    if security_margin == True:
+                        R_F_t += 2
+                        R_P_t = int(ceil(float(R_P_t) * 1.075))
+                    cost = cost_function(R_F_t, R_P_t, N, t)
+                    if (cost < min_cost) or ((cost == min_cost) and (R_F_t < max_cost_rf)):
+                        R_P = ceil(R_P_t)
+                        R_F = ceil(R_F_t)
+                        min_cost = cost
+                        max_cost_rf = R_F
+    return (int(R_F), int(R_P))
+
+def sat_inequiv_alpha(p, t, R_F, R_P, alpha, M):
+    N = int(FIELD_SIZE * NUM_CELLS)
+    
+    if alpha > 0:
+        R_F_1 = 6 if M <= ((floor(log(p, 2) - ((alpha-1)/2.0))) * (t + 1)) else 10 # Statistical
+        R_F_2 = 1 + ceil(log(2, alpha) * min(M, FIELD_SIZE)) + ceil(log(t, alpha)) - R_P # Interpolation
+        R_F_3 = (log(2, alpha) * min(M, log(p, 2))) - R_P # Groebner 1
+        R_F_4 = t - 1 + log(2, alpha) * min(M / float(t + 1), log(p, 2) / float(2)) - R_P # Groebner 2
+        R_F_5 = (t - 2 + (M / float(2 * log(alpha, 2))) - R_P) / float(t - 1) # Groebner 3
+        R_F_max = max(ceil(R_F_1), ceil(R_F_2), ceil(R_F_3), ceil(R_F_4), ceil(R_F_5))
+        
+        # Addition due to https://eprint.iacr.org/2023/537.pdf
+        r_temp = floor(t / 3.0)
+        over = (R_F - 1) * t + R_P + r_temp + r_temp * (R_F / 2.0) + R_P + alpha
+        under = r_temp * (R_F / 2.0) + R_P + alpha
+        binom_log = log(binomial(over, under), 2)
+        if binom_log == inf:
+            binom_log = M + 1
+        cost_gb4 = ceil(2 * binom_log) # Paper uses 2.3727, we are more conservative here
+
+        return ((R_F >= R_F_max) and (cost_gb4 >= M))
+    else:
+        print("Invalid value for alpha!")
+        exit(1)
+
+R_F_FIXED, R_P_FIXED, _, _ = poseidon_calc_final_numbers_fixed(p, t, alpha, 128, True)
+print("+++ R_F = {0}, R_P = {1} +++".format(R_F_FIXED, R_P_FIXED))
+
+# For STARK TODO
+# r_p_mod = R_P_FIXED % NUM_CELLS
+# if r_p_mod != 0:
+#     R_P_FIXED = R_P_FIXED + NUM_CELLS - r_p_mod
+
+###########################################################################
+
+INIT_SEQUENCE = []
+
+PRIME_NUMBER = p
+# if FIELD == 1 and len(sys.argv) != 8:
+#     print("Please specify a prime number (in hex format)!")
+#     exit()
+# elif FIELD == 1 and len(sys.argv) == 8:
+#     PRIME_NUMBER = int(sys.argv[7], 16) # e.g. 0xa7, 0xFFFFFFFFFFFFFEFF, 0xa1a42c3efd6dbfe08daa6041b36322ef
+
+F = GF(PRIME_NUMBER)
+
+def grain_sr_generator():
+    bit_sequence = INIT_SEQUENCE
+    for _ in range(0, 160):
+        new_bit = bit_sequence[62] ^^ bit_sequence[51] ^^ bit_sequence[38] ^^ bit_sequence[23] ^^ bit_sequence[13] ^^ bit_sequence[0]
+        bit_sequence.pop(0)
+        bit_sequence.append(new_bit)
+
+    while True:
+        new_bit = bit_sequence[62] ^^ bit_sequence[51] ^^ bit_sequence[38] ^^ bit_sequence[23] ^^ bit_sequence[13] ^^ bit_sequence[0]
+        bit_sequence.pop(0)
+        bit_sequence.append(new_bit)
+        while new_bit == 0:
+            new_bit = bit_sequence[62] ^^ bit_sequence[51] ^^ bit_sequence[38] ^^ bit_sequence[23] ^^ bit_sequence[13] ^^ bit_sequence[0]
+            bit_sequence.pop(0)
+            bit_sequence.append(new_bit)
+            new_bit = bit_sequence[62] ^^ bit_sequence[51] ^^ bit_sequence[38] ^^ bit_sequence[23] ^^ bit_sequence[13] ^^ bit_sequence[0]
+            bit_sequence.pop(0)
+            bit_sequence.append(new_bit)
+        new_bit = bit_sequence[62] ^^ bit_sequence[51] ^^ bit_sequence[38] ^^ bit_sequence[23] ^^ bit_sequence[13] ^^ bit_sequence[0]
+        bit_sequence.pop(0)
+        bit_sequence.append(new_bit)
+        yield new_bit
+grain_gen = grain_sr_generator()
+
+def grain_random_bits(num_bits):
+    random_bits = [next(grain_gen) for i in range(0, num_bits)]
+    # random_bits.reverse() ## Remove comment to start from least significant bit
+    random_int = int("".join(str(i) for i in random_bits), 2)
+    return random_int
+
+def init_generator(field, sbox, n, t, R_F, R_P):
+    # Generate initial sequence based on parameters
+    bit_list_field = [_ for _ in (bin(FIELD)[2:].zfill(2))]
+    bit_list_sbox = [_ for _ in (bin(SBOX)[2:].zfill(4))]
+    bit_list_n = [_ for _ in (bin(FIELD_SIZE)[2:].zfill(12))]
+    bit_list_t = [_ for _ in (bin(NUM_CELLS)[2:].zfill(12))]
+    bit_list_R_F = [_ for _ in (bin(R_F)[2:].zfill(10))]
+    bit_list_R_P = [_ for _ in (bin(R_P)[2:].zfill(10))]
+    bit_list_1 = [1] * 30
+    global INIT_SEQUENCE
+    INIT_SEQUENCE = bit_list_field + bit_list_sbox + bit_list_n + bit_list_t + bit_list_R_F + bit_list_R_P + bit_list_1
+    INIT_SEQUENCE = [int(_) for _ in INIT_SEQUENCE]
+
+def generate_constants(field, n, t, R_F, R_P, prime_number):
+    round_constants = []
+    # num_constants = (R_F + R_P) * t # Poseidon
+    num_constants = (R_F * t) + R_P # Poseidon2
+
+    if field == 0:
+        for i in range(0, num_constants):
+            random_int = grain_random_bits(n)
+            round_constants.append(random_int)
+    elif field == 1:
+        for i in range(0, num_constants):
+            random_int = grain_random_bits(n)
+            while random_int >= prime_number:
+                # print("[Info] Round constant is not in prime field! Taking next one.")
+                random_int = grain_random_bits(n)
+            round_constants.append(random_int)
+            # Add (t-1) zeroes for Poseidon2 if partial round
+            if i >= ((R_F/2) * t) and i < (((R_F/2) * t) + R_P):
+                round_constants.extend([0] * (t-1))
+    return round_constants
+
+def print_round_constants(round_constants, n, field):
+    print("Number of round constants:", len(round_constants))
+
+    if field == 0:
+        print("Round constants for GF(2^n):")
+    elif field == 1:
+        print("Round constants for GF(p):")
+    hex_length = int(ceil(float(n) / 4)) + 2 # +2 for "0x"
+    print(["{0:#0{1}x}".format(entry, hex_length) for entry in round_constants])
+
+def create_mds_p(n, t):
+    M = matrix(F, t, t)
+
+    # Sample random distinct indices and assign to xs and ys
+    while True:
+        flag = True
+        rand_list = [F(grain_random_bits(n)) for _ in range(0, 2*t)]
+        while len(rand_list) != len(set(rand_list)): # Check for duplicates
+            rand_list = [F(grain_random_bits(n)) for _ in range(0, 2*t)]
+        xs = rand_list[:t]
+        ys = rand_list[t:]
+        # xs = [F(ele) for ele in range(0, t)]
+        # ys = [F(ele) for ele in range(t, 2*t)]
+        for i in range(0, t):
+            for j in range(0, t):
+                if (flag == False) or ((xs[i] + ys[j]) == 0):
+                    flag = False
+                else:
+                    entry = (xs[i] + ys[j])^(-1)
+                    M[i, j] = entry
+        if flag == False:
+            continue
+        return M
+
+def generate_vectorspace(round_num, M, M_round, NUM_CELLS):
+    t = NUM_CELLS
+    s = 1
+    V = VectorSpace(F, t)
+    if round_num == 0:
+        return V
+    elif round_num == 1:
+        return V.subspace(V.basis()[s:])
+    else:
+        mat_temp = matrix(F)
+        for i in range(0, round_num-1):
+            add_rows = []
+            for j in range(0, s):
+                add_rows.append(M_round[i].rows()[j][s:])
+            mat_temp = matrix(mat_temp.rows() + add_rows)
+        r_k = mat_temp.right_kernel()
+        extended_basis_vectors = []
+        for vec in r_k.basis():
+            extended_basis_vectors.append(vector([0]*s + list(vec)))
+        S = V.subspace(extended_basis_vectors)
+
+        return S
+
+def subspace_times_matrix(subspace, M, NUM_CELLS):
+    t = NUM_CELLS
+    V = VectorSpace(F, t)
+    subspace_basis = subspace.basis()
+    new_basis = []
+    for vec in subspace_basis:
+        new_basis.append(M * vec)
+    new_subspace = V.subspace(new_basis)
+    return new_subspace
+
+# Returns True if the matrix is considered secure, False otherwise
+def algorithm_1(M, NUM_CELLS):
+    t = NUM_CELLS
+    s = 1
+    r = floor((t - s) / float(s))
+
+    # Generate round matrices
+    M_round = []
+    for j in range(0, t+1):
+        M_round.append(M^(j+1))
+
+    for i in range(1, r+1):
+        mat_test = M^i
+        entry = mat_test[0, 0]
+        mat_target = matrix.circulant(vector([entry] + ([F(0)] * (t-1))))
+
+        if (mat_test - mat_target) == matrix.circulant(vector([F(0)] * (t))):
+            return [False, 1]
+
+        S = generate_vectorspace(i, M, M_round, t)
+        V = VectorSpace(F, t)
+
+        basis_vectors= []
+        for eigenspace in mat_test.eigenspaces_right(format='galois'):
+            if (eigenspace[0] not in F):
+                continue
+            vector_subspace = eigenspace[1]
+            intersection = S.intersection(vector_subspace)
+            basis_vectors += intersection.basis()
+        IS = V.subspace(basis_vectors)
+
+        if IS.dimension() >= 1 and IS != V:
+            return [False, 2]
+        for j in range(1, i+1):
+            S_mat_mul = subspace_times_matrix(S, M^j, t)
+            if S == S_mat_mul:
+                print("S.basis():\n", S.basis())
+                return [False, 3]
+
+    return [True, 0]
+
+# Returns True if the matrix is considered secure, False otherwise
+def algorithm_2(M, NUM_CELLS):
+    t = NUM_CELLS
+    s = 1
+
+    V = VectorSpace(F, t)
+    trail = [None, None]
+    test_next = False
+    I = range(0, s)
+    I_powerset = list(sage.misc.misc.powerset(I))[1:]
+    for I_s in I_powerset:
+        test_next = False
+        new_basis = []
+        for l in I_s:
+            new_basis.append(V.basis()[l])
+        IS = V.subspace(new_basis)
+        for i in range(s, t):
+            new_basis.append(V.basis()[i])
+        full_iota_space = V.subspace(new_basis)
+        for l in I_s:
+            v = V.basis()[l]
+            while True:
+                delta = IS.dimension()
+                v = M * v
+                IS = V.subspace(IS.basis() + [v])
+                if IS.dimension() == t or IS.intersection(full_iota_space) != IS:
+                    test_next = True
+                    break
+                if IS.dimension() <= delta:
+                    break
+            if test_next == True:
+                break
+        if test_next == True:
+            continue
+        return [False, [IS, I_s]]
+
+    return [True, None]
+
+# Returns True if the matrix is considered secure, False otherwise
+def algorithm_3(M, NUM_CELLS):
+    t = NUM_CELLS
+    s = 1
+
+    V = VectorSpace(F, t)
+
+    l = 4*t
+    for r in range(2, l+1):
+        next_r = False
+        res_alg_2 = algorithm_2(M^r, t)
+        if res_alg_2[0] == False:
+            return [False, None]
+
+        # if res_alg_2[1] == None:
+        #     continue
+        # IS = res_alg_2[1][0]
+        # I_s = res_alg_2[1][1]
+        # for j in range(1, r):
+        #     IS = subspace_times_matrix(IS, M, t)
+        #     I_j = []
+        #     for i in range(0, s):
+        #         new_basis = []
+        #         for k in range(0, t):
+        #             if k != i:
+        #                 new_basis.append(V.basis()[k])
+        #         iota_space = V.subspace(new_basis)
+        #         if IS.intersection(iota_space) != iota_space:
+        #             single_iota_space = V.subspace([V.basis()[i]])
+        #             if IS.intersection(single_iota_space) == single_iota_space:
+        #                 I_j.append(i)
+        #             else:
+        #                 next_r = True
+        #                 break
+        #     if next_r == True:
+        #         break
+        # if next_r == True:
+        #     continue
+        # return [False, [IS, I_j, r]]
+
+    return [True, None]
+
+def check_minpoly_condition(M, NUM_CELLS):
+    max_period = 2*NUM_CELLS
+    all_fulfilled = True
+    M_temp = M
+    for i in range(1, max_period + 1):
+        if not ((M_temp.minimal_polynomial().degree() == NUM_CELLS) and (M_temp.minimal_polynomial().is_irreducible() == True)):
+            all_fulfilled = False
+            break
+        M_temp = M * M_temp
+    return all_fulfilled
+
+def generate_matrix(FIELD, FIELD_SIZE, NUM_CELLS):
+    if FIELD == 0:
+        print("Matrix generation not implemented for GF(2^n).")
+        exit(1)
+    elif FIELD == 1:
+        mds_matrix = create_mds_p(FIELD_SIZE, NUM_CELLS)
+        result_1 = algorithm_1(mds_matrix, NUM_CELLS)
+        result_2 = algorithm_2(mds_matrix, NUM_CELLS)
+        result_3 = algorithm_3(mds_matrix, NUM_CELLS)
+        while result_1[0] == False or result_2[0] == False or result_3[0] == False:
+            mds_matrix = create_mds_p(FIELD_SIZE, NUM_CELLS)
+            result_1 = algorithm_1(mds_matrix, NUM_CELLS)
+            result_2 = algorithm_2(mds_matrix, NUM_CELLS)
+            result_3 = algorithm_3(mds_matrix, NUM_CELLS)
+        return mds_matrix
+
+def generate_matrix_full(NUM_CELLS):
+    M = None
+    if t == 2:
+        M = matrix.circulant(vector([F(2), F(1)]))
+    elif t == 3:
+        M = matrix.circulant(vector([F(2), F(1), F(1)]))
+    elif t == 4:
+        M = matrix(F, [[F(5), F(7), F(1), F(3)], [F(4), F(6), F(1), F(1)], [F(1), F(3), F(5), F(7)], [F(1), F(1), F(4), F(6)]])
+    elif (t % 4) == 0:
+        M = matrix(F, t, t)
+        # M_small = matrix.circulant(vector([F(3), F(2), F(1), F(1)]))
+        M_small = matrix(F, [[F(5), F(7), F(1), F(3)], [F(4), F(6), F(1), F(1)], [F(1), F(3), F(5), F(7)], [F(1), F(1), F(4), F(6)]])
+        small_num = t // 4
+        for i in range(0, small_num):
+            for j in range(0, small_num):
+                if i == j:
+                    M[i*4:(i+1)*4,j*4:(j+1)*4] = 2* M_small
+                else:
+                    M[i*4:(i+1)*4,j*4:(j+1)*4] = M_small
+    else:
+        print("Error: No matrix for these parameters.")
+        exit()
+    return M
+
+def generate_matrix_partial(FIELD, FIELD_SIZE, NUM_CELLS): ## TODO: Prioritize small entries
+    entry_max_bit_size = FIELD_SIZE
+    if FIELD == 0:
+        print("Matrix generation not implemented for GF(2^n).")
+        exit(1)
+    elif FIELD == 1:
+        M = None
+        if t == 2:
+            M = matrix(F, [[F(2), F(1)], [F(1), F(3)]])
+        elif t == 3:
+            M = matrix(F, [[F(2), F(1), F(1)], [F(1), F(2), F(1)], [F(1), F(1), F(3)]])
+        else:
+            M_circulant = matrix.circulant(vector([F(0)] + [F(1) for _ in range(0, NUM_CELLS - 1)]))
+            M_diagonal = matrix.diagonal([F(grain_random_bits(entry_max_bit_size)) for _ in range(0, NUM_CELLS)])
+            M = M_circulant + M_diagonal
+            # while algorithm_1(M, NUM_CELLS)[0] == False or algorithm_2(M, NUM_CELLS)[0] == False or algorithm_3(M, NUM_CELLS)[0] == False:
+            while check_minpoly_condition(M, NUM_CELLS) == False:
+                M_diagonal = matrix.diagonal([F(grain_random_bits(entry_max_bit_size)) for _ in range(0, NUM_CELLS)])
+                M = M_circulant + M_diagonal
+        
+        if(algorithm_1(M, NUM_CELLS)[0] == False or algorithm_2(M, NUM_CELLS)[0] == False or algorithm_3(M, NUM_CELLS)[0] == False):
+            print("Error: Generated partial matrix is not secure w.r.t. subspace trails.")
+            exit()
+        return M
+
+def generate_matrix_partial_small_entries(FIELD, FIELD_SIZE, NUM_CELLS):
+    if FIELD == 0:
+        print("Matrix generation not implemented for GF(2^n).")
+        exit(1)
+    elif FIELD == 1:
+        M_circulant = matrix.circulant(vector([F(0)] + [F(1) for _ in range(0, NUM_CELLS - 1)]))
+        combinations = list(itertools.product(range(2, 6), repeat=NUM_CELLS))
+        for entry in combinations:
+            M = M_circulant + matrix.diagonal(vector(F, list(entry)))
+            print(M)
+            # if M.is_invertible() == False or algorithm_1(M, NUM_CELLS)[0] == False or algorithm_2(M, NUM_CELLS)[0] == False or algorithm_3(M, NUM_CELLS)[0] == False:
+            if M.is_invertible() == False or check_minpoly_condition(M, NUM_CELLS) == False:
+                continue
+            return M
+
+def matrix_partial_m_1(matrix_partial, NUM_CELLS):
+    M_circulant = matrix.identity(F, NUM_CELLS)
+    return matrix_partial - M_circulant
+
+def print_linear_layer(M, n, t):
+    print("n:", n)
+    print("t:", t)
+    print("N:", (n * t))
+    print("Result Algorithm 1:\n", algorithm_1(M, NUM_CELLS))
+    print("Result Algorithm 2:\n", algorithm_2(M, NUM_CELLS))
+    print("Result Algorithm 3:\n", algorithm_3(M, NUM_CELLS))
+    hex_length = int(ceil(float(n) / 4)) + 2 # +2 for "0x"
+    print("Prime number:", "0x" + hex(PRIME_NUMBER))
+    matrix_string = "["
+    for i in range(0, t):
+        matrix_string += str(["{0:#0{1}x}".format(int(entry), hex_length) for entry in M[i]])
+        if i < (t-1):
+            matrix_string += ","
+    matrix_string += "]"
+    print("MDS matrix:\n", matrix_string)
+
+def calc_equivalent_matrices(MDS_matrix_field):
+    # Following idea: Split M into M' * M'', where M'' is "cheap" and M' can move before the partial nonlinear layer
+    # The "previous" matrix layer is then M * M'. Due to the construction of M', the M[0,0] and v values will be the same for the new M' (and I also, obviously)
+    # Thus: Compute the matrices, store the w_hat and v_hat values
+
+    MDS_matrix_field_transpose = MDS_matrix_field.transpose()
+
+    w_hat_collection = []
+    v_collection = []
+    v = MDS_matrix_field_transpose[[0], list(range(1,t))]
+
+    M_mul = MDS_matrix_field_transpose
+    M_i = matrix(F, t, t)
+    for i in range(R_P_FIXED - 1, -1, -1):
+        M_hat = M_mul[list(range(1,t)), list(range(1,t))]
+        w = M_mul[list(range(1,t)), [0]]
+        v = M_mul[[0], list(range(1,t))]
+        v_collection.append(v.list())
+        w_hat = M_hat.inverse() * w
+        w_hat_collection.append(w_hat.list())
+
+        # Generate new M_i, and multiplication M * M_i for "previous" round
+        M_i = matrix.identity(t)
+        M_i[list(range(1,t)), list(range(1,t))] = M_hat
+        M_mul = MDS_matrix_field_transpose * M_i
+
+    return M_i, v_collection, w_hat_collection, MDS_matrix_field_transpose[0, 0]
+
+def calc_equivalent_constants(constants, MDS_matrix_field):
+    constants_temp = [constants[index:index+t] for index in range(0, len(constants), t)]
+
+    MDS_matrix_field_transpose = MDS_matrix_field.transpose()
+
+    # Start moving round constants up
+    # Calculate c_i' = M^(-1) * c_(i+1)
+    # Split c_i': Add c_i'[0] AFTER the S-box, add the rest to c_i
+    # I.e.: Store c_i'[0] for each of the partial rounds, and make c_i = c_i + c_i' (where now c_i'[0] = 0)
+    num_rounds = R_F_FIXED + R_P_FIXED
+    R_f = R_F_FIXED / 2
+    for i in range(num_rounds - 2 - R_f, R_f - 1, -1):
+        inv_cip1 = list(vector(constants_temp[i+1]) * MDS_matrix_field_transpose.inverse())
+        constants_temp[i] = list(vector(constants_temp[i]) + vector([0] + inv_cip1[1:]))
+        constants_temp[i+1] = [inv_cip1[0]] + [0] * (t-1)
+
+    return constants_temp
+
+def poseidon(input_words, matrix, round_constants):
+
+    R_f = int(R_F_FIXED / 2)
+
+    round_constants_counter = 0
+
+    state_words = list(input_words)
+
+    # First full rounds
+    for r in range(0, R_f):
+        # Round constants, nonlinear layer, matrix multiplication
+        for i in range(0, t):
+            state_words[i] = state_words[i] + round_constants[round_constants_counter]
+            round_constants_counter += 1
+        for i in range(0, t):
+            state_words[i] = (state_words[i])^alpha
+        state_words = list(matrix * vector(state_words))
+
+    # Middle partial rounds
+    for r in range(0, R_P_FIXED):
+        # Round constants, nonlinear layer, matrix multiplication
+        for i in range(0, t):
+            state_words[i] = state_words[i] + round_constants[round_constants_counter]
+            round_constants_counter += 1
+        state_words[0] = (state_words[0])^alpha
+        state_words = list(matrix * vector(state_words))
+
+    # Last full rounds
+    for r in range(0, R_f):
+        # Round constants, nonlinear layer, matrix multiplication
+        for i in range(0, t):
+            state_words[i] = state_words[i] + round_constants[round_constants_counter]
+            round_constants_counter += 1
+        for i in range(0, t):
+            state_words[i] = (state_words[i])^alpha
+        state_words = list(matrix * vector(state_words))
+
+    return state_words
+
+def poseidon2(input_words, matrix_full, matrix_partial, round_constants):
+
+    R_f = int(R_F_FIXED / 2)
+
+    round_constants_counter = 0
+
+    state_words = list(input_words)
+
+    print("Begin Poseidon2\n")
+    # First matrix mul
+    print("\n Field Modulus: = ",p)
+    print("\n Full Rounds: R_F = ",R_F_FIXED)
+    print("\n Partial Rounds: R_P= ",R_P_FIXED)
+    print("\n Alpha = ",alpha) 
+    print("\n Full Matrix")
+    print(matrix_full)
+    print("\n Partial Matrix")
+    print(matrix_partial)
+    state_words = list(matrix_full * vector(state_words))
+
+    # First full rounds
+    print("\n First full rounds\n")
+    print("Round constants:")
+    for r in range(0, R_f):
+        # Round constants, nonlinear layer, matrix multiplication
+        print("First full round number: ",r)
+        print(round_constants[round_constants_counter:round_constants_counter+t])
+        for i in range(0, t):
+            state_words[i] = state_words[i] + round_constants[round_constants_counter]
+            round_constants_counter += 1
+        for i in range(0, t):
+            state_words[i] = (state_words[i])^alpha
+        state_words = list(matrix_full * vector(state_words))
+
+    # Middle partial rounds
+    print("\n Middle partial rounds\n")
+    print("Round constants:")
+    for r in range(0, R_P_FIXED):
+        # Round constants, nonlinear layer, matrix multiplication
+        print("Middle partial round number: ",r)
+        print(round_constants[round_constants_counter:round_constants_counter+t])          
+        for i in range(0, t):
+            state_words[i] = state_words[i] + round_constants[round_constants_counter]         
+            round_constants_counter += 1
+        state_words[0] = (state_words[0])^alpha
+        state_words = list(matrix_partial * vector(state_words))
+
+    # Last full rounds
+    print("\n Last full rounds\n")
+    print("Round constants:")
+    for r in range(0, R_f):
+        # Round constants, nonlinear layer, matrix multiplication
+        print("Last full round number: ",r)
+        print(round_constants[round_constants_counter:round_constants_counter+t])
+        for i in range(0, t):
+            state_words[i] = state_words[i] + round_constants[round_constants_counter]
+            round_constants_counter += 1
+        for i in range(0, t):
+            state_words[i] = (state_words[i])^alpha
+        state_words = list(matrix_full * vector(state_words))
+
+    return state_words
+
+# Init
+init_generator(FIELD, SBOX, FIELD_SIZE, NUM_CELLS, R_F_FIXED, R_P_FIXED)
+
+# Round constants
+round_constants = generate_constants(FIELD, FIELD_SIZE, NUM_CELLS, R_F_FIXED, R_P_FIXED, PRIME_NUMBER)
+# print_round_constants(round_constants, FIELD_SIZE, FIELD)
+
+# Matrix
+# MDS = generate_matrix(FIELD, FIELD_SIZE, NUM_CELLS)
+MATRIX_FULL = generate_matrix_full(NUM_CELLS)
+MATRIX_PARTIAL = generate_matrix_partial(FIELD, FIELD_SIZE, NUM_CELLS)
+MATRIX_PARTIAL_DIAGONAL_M_1 = [matrix_partial_m_1(MATRIX_PARTIAL, NUM_CELLS)[i,i] for i in range(0, NUM_CELLS)]
+
+def to_hex(value):
+    l = len(hex(p - 1))
+    if l % 2 == 1:
+        l = l + 1
+    value = hex(int(value))[2:]
+    value = "0x" + value.zfill(l - 2)
+    print("from_hex(\"{}\"),".format(value))
+
+# IK removed this: relevant for rust code with bellman
+#print("use super::poseidon::PoseidonParams;")
+#print("use bellman_ce::pairing::{bls12_381::Bls12, ff::ScalarEngine, from_hex};")
+#print("type Scalar = <Bls12 as ScalarEngine>::Fr;")
+#print("use lazy_static::lazy_static;")
+#print("use std::sync::Arc;")
+#print()
+#print("lazy_static! {")
+
+
+# # MDS
+# print("pub static ref MDS{}: Vec<Vec<Scalar>> = vec![".format(t))
+# for vec in MDS:
+#     print("vec![", end="")
+#     for val in vec:
+#         to_hex(val)
+#     print("],")
+# print("];")
+# print()
+
+# Efficient partial matrix (diagonal - 1)
+# IK removed this: relevant for rust code with bellman
+#print("pub static ref MAT_DIAG{}_M_1: Vec<Scalar> = vec![".format(t))
+#for val in MATRIX_PARTIAL_DIAGONAL_M_1:
+#    to_hex(val)
+#print("];")
+#print()
+
+# Efficient partial matrix (full)
+# IK removed this: relevant for rust code with bellman
+# print("pub static ref MAT_INTERNAL{}: Vec<Vec<Scalar>> = vec![".format(t))
+#for vec in MATRIX_PARTIAL:
+#    print("vec![", end="")
+#    for val in vec:
+#        to_hex(val)
+#    print("],")
+#print("];")
+#print()
+
+# Round constants
+# IK removed this: relevant for rust code with bellman
+#print("pub static ref RC{}: Vec<Vec<Scalar>> = vec![".format(t))
+#for (i,val) in enumerate(round_constants):
+#    if i % t == 0:
+#        print("vec![", end="")
+#    to_hex(val)
+#    if i % t == t - 1:
+#        print("],")
+#print("];")
+#print()
+
+#print("pub static ref POSEIDON_{}_PARAMS: Arc<PoseidonParams<Scalar>> = Arc::new(PoseidonParams::new({}, {}, {}, {}, &MAT_DIAG{}_M_1, &RC{}));".format(t, t, alpha, R_F_FIXED, R_P_FIXED , t, t))
+
+#print("}")
+#print()
+#print()
+
+state_in  = vector([F(i) for i in range(t)])
+# state_out = poseidon(state_in, MDS, round_constants)
+state_out = poseidon2(state_in, MATRIX_FULL, MATRIX_PARTIAL, round_constants)
+
+print("\n\n Test Vectors\n")
+print("Input state\n ",state_in)
+print("Output state\n ",state_out)
+
+# IK removed this: relevant for rust code with bellman
+#for (i,val) in enumerate(state_in):
+#    if i % t == 0:
+#        print("vec![", end="")
+#    to_hex(val)
+#    if i % t == t - 1:
+#        print("],")
+#print("];")
+
+#for (i,val) in enumerate(state_out):
+#    if i % t == 0:
+#        print("vec![", end="")
+#    to_hex(val)
+#    if i % t == t - 1:
+#        print("],")
+#print("];")
diff --git a/examples/rust/poseidon2/run.sh b/examples/rust/poseidon2/run.sh
new file mode 100755
index 000000000..8a8f4800f
--- /dev/null
+++ b/examples/rust/poseidon2/run.sh
@@ -0,0 +1,60 @@
+#!/bin/bash
+
+# Exit immediately if a command exits with a non-zero status
+set -e
+
+# Function to display usage information
+show_help() {
+  echo "Usage: $0 [-d DEVICE_TYPE] [-b BACKEND_INSTALL_DIR]"
+  echo
+  echo "Options:"
+  echo "  -d DEVICE_TYPE            Specify the device type (default: CPU)"
+  echo "  -b BACKEND_INSTALL_DIR    Specify the backend installation directory (default: empty)"
+  echo "  -h                        Show this help message"
+  exit 0
+}
+
+# Parse command line options
+while getopts ":d:b:h" opt; do
+  case ${opt} in
+    d )
+      DEVICE_TYPE=$OPTARG
+      ;;
+    b )
+      ICICLE_BACKEND_INSTALL_DIR="$(realpath ${OPTARG})"
+      ;;
+    h )
+      show_help
+      ;;
+    \? )
+      echo "Invalid option: -$OPTARG" 1>&2
+      show_help
+      ;;
+    : )
+      echo "Invalid option: -$OPTARG requires an argument" 1>&2
+      show_help
+      ;;
+  esac
+done
+
+# Set default values if not provided
+: "${DEVICE_TYPE:=CPU}"
+: "${ICICLE_BACKEND_INSTALL_DIR:=}"
+
+DEVICE_TYPE_LOWERCASE=$(echo "$DEVICE_TYPE" | tr '[:upper:]' '[:lower:]')
+
+ICILE_DIR=$(realpath "../../../icicle/")
+ICICLE_BACKEND_SOURCE_DIR="${ICILE_DIR}/backend/${DEVICE_TYPE_LOWERCASE}"
+
+# Build Icicle and the example app that links to it
+if [ "$DEVICE_TYPE" != "CPU" ] && [ ! -d "${ICICLE_BACKEND_INSTALL_DIR}" ] && [ -d "${ICICLE_BACKEND_SOURCE_DIR}" ]; then
+  echo "Building icicle and ${DEVICE_TYPE} backend"
+  cargo build --release --features="${DEVICE_TYPE_LOWERCASE}"
+  export ICICLE_BACKEND_INSTALL_DIR=$(realpath "../target/release/deps/icicle/lib/backend")
+  cargo run --release --features="${DEVICE_TYPE_LOWERCASE}" -- --device-type "${DEVICE_TYPE}"
+else
+  echo "Building icicle without backend, ICICLE_BACKEND_INSTALL_DIR=${ICICLE_BACKEND_INSTALL_DIR}"
+  export ICICLE_BACKEND_INSTALL_DIR="${ICICLE_BACKEND_INSTALL_DIR}"
+  cargo run --release -- --device-type "${DEVICE_TYPE}"
+fi
+
diff --git a/examples/rust/poseidon2/src/main.rs b/examples/rust/poseidon2/src/main.rs
new file mode 100644
index 000000000..5e5f3aede
--- /dev/null
+++ b/examples/rust/poseidon2/src/main.rs
@@ -0,0 +1,150 @@
+use icicle_core::{
+    hash::{HashConfig, Hasher}, 
+    merkle::{MerkleTree, MerkleTreeConfig,MerkleProof,PaddingPolicy},
+    poseidon2::Poseidon2, traits::FieldImpl
+};
+use std::convert::TryInto;
+use icicle_babybear::field::ScalarField as Frbb;
+use icicle_m31::field::ScalarField as Frm31;
+use rand::{random, Rng};
+use icicle_runtime::memory::HostSlice;
+
+pub fn hash_test<F:FieldImpl>(
+    test_vec: Vec<F>,
+    config: HashConfig,
+    hash: Hasher,
+) {
+    let input_slice = HostSlice::from_slice(&test_vec);
+    let out_init:F = F::zero();
+    let mut binding = [out_init];
+    let out_init_slice = HostSlice::from_mut_slice(&mut binding);
+    hash.hash(input_slice, &config, out_init_slice).unwrap();
+    println!("computed digest: {:?} ",out_init_slice.as_slice().to_vec()[0]);
+}
+
+pub fn compute_binary_tree<F:FieldImpl>(
+    mut test_vec: Vec<F>,
+    leaf_size: u64,
+    hasher: Hasher,
+    compress: Hasher,
+    mut tree_config: MerkleTreeConfig,
+) -> MerkleTree {
+    let tree_height: usize = test_vec.len().ilog2() as usize;
+    tree_config.padding_policy = PaddingPolicy::ZeroPadding;
+    let layer_hashes: Vec<&Hasher> = std::iter::once(&hasher)
+        .chain(std::iter::repeat(&compress).take(tree_height))
+        .collect();
+    //binary tree
+    let vec_slice: &mut HostSlice<F> = HostSlice::from_mut_slice(&mut test_vec[..]);
+    let merkle_tree: MerkleTree = MerkleTree::new(&layer_hashes, leaf_size, 0).unwrap();
+
+    let _ = merkle_tree
+        .build(vec_slice,&tree_config);
+    merkle_tree
+}
+
+pub fn main(){
+// digest = output_state[1]
+// Sage output Baby bear
+// t = 2
+// Input state (0, 1)
+// Output state [869011615, 833751247]
+// t= 3
+// Input state (0, 1, 2)
+// Output state [1704654064, 1850148672, 1532353406]
+// t = 4
+// Input state (0, 1, 2, 3)
+// Output state [741579827, 472702774, 852055751, 1266116070]
+// t= 8 
+// Input state (0, 1, 2, 3, 4, 5, 6, 7)
+// Output state [1231724177, 1077997898, 146576824, 919391229, 302461086, 1311223212, 679569792, 681685934]
+// t = 12
+// Input state (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
+// Output state 1540343413, 1605336739, 1201446587, 1251783394, 440826505, 1691696232, 904498569, 1312737773, 1464207073, 133812423, 1144748001, 1160609856]
+// t = 16
+// Input state (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
+// Output state [896560466, 771677727, 128113032, 1378976435, 160019712, 1452738514, 682850273, 223500421, 501450187, 1804685789, 1671399593, 1788755219, 1736880027, 1352180784, 1928489698, 1128802977]
+// t = 20
+// Input state (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
+// Output state [1637625426, 1224149815, 185762176, 1743975927, 215506, 846181926, 1805239884, 1583247763, 40890463, 1769635047, 1593365708, 543030243, 190381160, 114174693, 528766946, 107317631, 199017750, 946546831, 188856465, 89693326]
+// t = 24
+// Input state (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
+// Output state [785637949, 311566256, 241540729, 1641553353, 851108667, 1648913123, 510139232, 616108837, 707720633, 1357404478, 1539840236, 275323287, 899761440, 732341189, 664618988, 1426148993, 1498654335, 792736017, 1804085503, 402731039, 659103866, 1036635937, 1016617890, 1470732388]
+    let t_vec = [2,3,4,8,12,16,20,24];
+    let expected_digest_bb:Vec<Frbb> = vec![Frbb::from_u32(833751247),Frbb::from_u32(1850148672),Frbb::from_u32(472702774),Frbb::from_u32(1077997898),Frbb::from_u32(1605336739),Frbb::from_u32(771677727),Frbb::from_u32(1224149815),Frbb::from_u32(311566256)];
+    println!("Baby Bear");
+    let config = HashConfig::default();
+    for (t,digest) in t_vec.iter().zip(expected_digest_bb.iter()){
+        let input_state_bb:Vec<Frbb> = (0..*t).map(Frbb::from_u32).collect();
+        println!("test vector {:?}",input_state_bb);
+        println!("expected digest {:?}",digest);
+        hash_test::<Frbb>(input_state_bb, config.clone(), Poseidon2::new::<Frbb>(*t,None).unwrap());
+        println!(" ");
+    }
+
+// digest = output_state[1]
+// Sage output m31
+//t=2
+// Input state (0, 1)
+// Output state [1259525573, 1321841424]
+//t=3
+// Input state (0, 1, 2)
+// Output state [1965998969, 1808522380, 1146513877]
+// t=4
+// Input state (0, 1, 2, 3)
+// Output state [1062794947, 1937028579, 518022994, 1790851810]
+// t= 8
+// Input state (0, 1, 2, 3, 4, 5, 6, 7)
+// Output state [1587676993, 1040745210, 1362579098, 1364533986, 505714447, 371333953, 24021099, 1307077870]
+// t =12
+// Input state (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
+// Output state [1352296093, 495013829, 721412628, 551472485, 1402861161, 1099939525, 56806196, 322927204, 1743775127, 1737182096, 1637144312, 482990946]
+// t= 16
+// Input state (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
+// Output state [1348310665, 996460804, 2044919169, 1269301599, 615961333, 595876573, 1377780500, 1776267289, 715842585, 1823756332, 1870636634, 1979645732, 311256455, 1364752356, 58674647, 323699327]
+// t = 20
+// Input state (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
+// Output state [2145869251, 33722680, 323999981, 1338601227, 1335935383, 1569616976, 1025767832, 1219571145, 1312283131, 517961801, 1182517165, 1896142496, 1426432276, 386540698, 1519857378, 840037603, 431686357, 2045496595, 609478066, 1695781828]
+// t = 24
+// Input state (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
+// Output state [813042329, 956159494, 2017691352, 906353481, 1909737181, 1568930368, 1051192156, 1915448194, 114779228, 1695016063, 56353577, 991257558, 1283398606, 1782986529, 89100699, 1011002020, 71058136, 1382771657, 1734747710, 184579357, 1201113333, 2002016011, 1347833245, 1026595486]
+
+    let expected_digest_m31:Vec<Frm31> = vec![Frm31::from_u32(1321841424),Frm31::from_u32(1808522380),Frm31::from_u32(1937028579),Frm31::from_u32(1040745210),
+    Frm31::from_u32(495013829),Frm31::from_u32(996460804),Frm31::from_u32(33722680),Frm31::from_u32(956159494)];
+    println!("M31");
+    let config = HashConfig::default();
+    for (t,digest) in t_vec.iter().zip(expected_digest_m31.iter()){
+        let input_state_m31:Vec<Frm31> = (0..*t).map(Frm31::from_u32).collect();
+        println!("test vector {:?}",input_state_m31);
+        println!("expected digest {:?}",digest);
+        hash_test::<Frm31>(input_state_m31, config.clone(), Poseidon2::new::<Frm31>(*t,None).unwrap());
+        println!(" ");
+    }
+
+    println!("\n Merkle tree test with poseidon 2: m31");
+
+
+    // for binary tree Poseidon(t1,t2) -> n1 
+    let poseidon_state_size = 2; 
+    let leaf_size:u64 = 4;// each leaf is a 32 bit element 32/8 = 4 bytes
+
+    let mut test_vec = vec![Frm31::from_u32(random::<u32>()); 1024* (poseidon_state_size as usize)];   
+    println!("Generated random vector of size {:?}", 1024* (poseidon_state_size as usize));
+    //to use later for merkle proof
+    let mut binding = test_vec.clone();
+    let test_vec_slice = HostSlice::from_mut_slice(&mut binding);
+    //define hash and compression functions
+    let hasher :Hasher = Poseidon2::new::<Frm31>(poseidon_state_size.try_into().unwrap(),None).unwrap();
+    let compress: Hasher = Poseidon2::new::<Frm31>((hasher.output_size()*2).try_into().unwrap(),None).unwrap();
+    //tree config
+    let tree_config = MerkleTreeConfig::default();
+    let merk_tree = compute_binary_tree(test_vec.clone(), leaf_size, hasher, compress,tree_config.clone());
+    println!("computed Merkle root {:?}", merk_tree.get_root::<Frm31>().unwrap());
+
+    let random_test_index = rand::thread_rng().gen_range(0..1024*(poseidon_state_size as usize));
+    print!("Generating proof for element {:?} at random test index {:?} ",test_vec[random_test_index], random_test_index);
+    let merkle_proof = merk_tree.get_proof::<Frm31>(test_vec_slice, random_test_index.try_into().unwrap(), false, &tree_config).unwrap();
+
+    assert!(merk_tree.verify(&merkle_proof).unwrap());
+    println!("\n Merkle proof verified successfully!");
+}
\ No newline at end of file