Why is Mamba-minimal able to train but not mamba? #2569

Corallus-Caninus · 2024-10-20T04:52:35Z

Corallus-Caninus
Oct 20, 2024

It is not obvious to me why mamba minimal is able to train nor is it explicitly documented. Is it due to the state selection implementation? I noticed in the mamba implementation there is a mention of discarding all but the last state. I am able to train mamba locally (not the mamba-minimal but the original mamba example) from some older code I have of Candle that I rigged up with lbfgs from candle-optimisers but it is admittedly not very convex. Any help would be appreciated or pointing me in the right direction (papers?).

Is it due to the gradients for prior steps in the sequence not being part of the backpropagation graph? I feel like I am guessing when the answer might be obvious.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is Mamba-minimal able to train but not mamba? #2569

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Why is Mamba-minimal able to train but not mamba? #2569

Corallus-Caninus Oct 20, 2024

Replies: 0 comments

Corallus-Caninus
Oct 20, 2024