-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] argsort
metal kernel yields incorrect output with > 1024 elements
#2570
Comments
That's actually expected, though we should have a proper error message for it. The candle sort operator uses a bitonic sort which requires the whole data to fit in a single thread-group/cuda-block (the same approach is used by llama.cpp), the idea there is to use this operator for things like mixture of experts where the number of element to sort is very small but it cannot apply to larger sets of elements. |
Yes I realized it's bitonic sort once I went through the code, didn't realize it's by design. A generic implementation would be helpful (in my case speeding up Torch delegates cuda sort to And from what I could gather, Torch relies on According to this implementation, I'm working on an implementation of it and if things go well and the port to metal works I'll probably create a PR where I'll call the Lot of |
Reproduction:
Edit: removed incorrect diagnosis.
The text was updated successfully, but these errors were encountered: