You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is not obvious to me why mamba minimal is able to train nor is it explicitly documented. Is it due to the state selection implementation? I noticed in the mamba implementation there is a mention of discarding all but the last state. I am able to train mamba locally (not the mamba-minimal but the original mamba example) from some older code I have of Candle that I rigged up with lbfgs from candle-optimisers but it is admittedly not very convex. Any help would be appreciated or pointing me in the right direction (papers?).
Is it due to the gradients for prior steps in the sequence not being part of the backpropagation graph? I feel like I am guessing when the answer might be obvious.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
It is not obvious to me why mamba minimal is able to train nor is it explicitly documented. Is it due to the state selection implementation? I noticed in the mamba implementation there is a mention of discarding all but the last state. I am able to train mamba locally (not the mamba-minimal but the original mamba example) from some older code I have of Candle that I rigged up with lbfgs from candle-optimisers but it is admittedly not very convex. Any help would be appreciated or pointing me in the right direction (papers?).
Is it due to the gradients for prior steps in the sequence not being part of the backpropagation graph? I feel like I am guessing when the answer might be obvious.
Beta Was this translation helpful? Give feedback.
All reactions