-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
能给一份S^2 Attension推理的代码吗? #120
Comments
你好, 您直接在inference的时候把forward的函数替换成这个就可以。 LongLoRA/llama_attn_replace_sft.py Line 24 in 39866af
昨天有PR已经把这里改好了。 Regards, |
@yukang2017 看起来里面的代码跟我想的不太一样哈。我如果没理解错的话,这里的代码是离线推理的?我比较期望的是online版的S^2 Attention,但现在看了一下online版应该只有用flash attention实现的full attention。 online版的确有一些问题现在还不明确,比如,在每生成一个token的时候,它前面被cache起来的kv size没法被4整除,这时要做一些其他的事,比如padding或者truncate? |
我看这里说推理时候不需要s2 attention,但是你的代码里又有forward_flashattn_inference,所以推理时究竟需要s2 attention吗?用vllm之类的推理框架时是只要用默认的flash_attn就行吗? |
@coranholmes 你好,forward_flashattn_inference 里面就是标准的attention 推理,不是S^2 attention. 用默认的就可以的。 @hxs91 你好,我确实还没有实现过S^2 attention + KV cache推理的代码,现在的forward_flashattn版本其实已经不需要padding或者考虑整除的问题,您可以尝试一下。 |
RT
The text was updated successfully, but these errors were encountered: