-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better understanding of function names/class names #709
Comments
Oh, well. It can use other embedding functions, and I tried some ollama models with different level of success, and then WordLlama. It looks promising, and it's lightning-fast even laptop with AM Ryzen 5 4500U. |
I think that what is going on here is that the sorting mechanism is not perfect, it is not based on an actual understanding of the query and the results. Like you say, using a better embedding model can help with this, as the semantic distance is one of the main criteria for sorting. So we should definitely experiment with that. Another thing to add is that I believe that there are different potential ways of using. For instance in your use case the result would have probably shown up somewhere towards the top of the list, but not on the very top. This could be improved by using an LLM also to understand the query: for instance using a RAG workflow we could get a list of results that fits into the context limit of a local ollama model, and use the ollama model to formulate the final result. The upside of this would be that you don't have to "manually" peruse several lines of results to find what you are looking for. Downside would be that this model could hallucinate or format the answer incorrectly - could be addressed by some validation. Yet another thing is to improve the chunking: currently we use the actual code lines (based on some heuristic to ignore irrelevant lines) as well as the file names to create the embeddigs. Instead of this, we could use a generative model to actually understand the function of the code line and add additional context to the embedding. This should be fairly simple to do, but would greatly slow down the chunking process. But if we actually have a faster model now, it would be a good time to experiment with it. |
I was thinking that embedding function is supposed to do this. Maybe, it could be achieved by using larger chunks, probably function/class or some other top-level structures with a model like this? |
Thanks for this project, it looks really promising.
I just started using it, and here's what I found, example is this repo:
But, when i split name in into words, it cannot find this function.
That's probably my main use case - to find something without knowing exact name. I'm happy to help with fixing this and doing some research or writing patches.
Maybe you have idea how to impove this? I see there's #354 issue, about trying different models. And probably some other "code-search-oriented model" can improve this.
My absolutely non-ai-related guess, is when encountering snake_case or SomeOtherCase names - convert them to normal words and let it index this. But probably code-search-related models already doing it...
The text was updated successfully, but these errors were encountered: