-
-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: Add an option to make the output more friendly with RAG engines #202
Comments
Hi, @tranquochuy645 ! I see you're considering using AST for code splitting - I'm curious about your planned implementation approach. Since Repomix is a Node.js tool, I'd like to avoid introducing Python dependencies (like Langchain's RecursiveCharacterTextSplitter) as that would require users to maintain both Node.js and Python environments. I've been thinking tree-sitter could be a good fit here since it:
What are your thoughts on implementation details? I'm curious to hear more about how you're planning to handle the splitting logic. I'm excited about this feature and looking forward to hearing your ideas! |
Thanks so much for the encouragement, @yamadashy - really appreciate it! I totally get that adding Python dependencies to a NodeJS package is a no-go. After digging deeper into the LangChain codebase, I found two approaches they use for code parsing:
The first option isn’t great, and the second isn’t supported in LangChainJS yet. But there are some tools we can use:
My plan is to mimic the Langchain language parser module in JavaScript for the splitting part, the rest is pretty straightforward. For now, I’ll keep exploring Repomix’s codebase. Once I have a better understanding, I’ll open a draft PR so we can chat more about the implementation details. |
Also, node-tree-sitter's documentation is not that great, especially about the Query API. Posting this link here for later investigations. |
Thank you for such detailed research! I really appreciate your thorough investigation into potential approaches. I've been wanting to tackle this issue but hadn't been able to start, so this is incredibly helpful. Speaking of Tree-sitter implementations, I recently came across an interesting article about how Aider uses Tree-sitter for their codebase analysis: Also, Cline is a great JavaScript-based reference for Tree-sitter implementation: It might provide some useful insights for our implementation. I completely agree that handling large codebases is currently Repomix's biggest challenge. I'm very excited to move forward with this enhancement. |
Hello @yamadashy ,
The tool is already great, but larger repositories will never fit into an LLM's context window.
I know enterprise-level RAG systems have been around for a while, but sticking to user-friendly solutions, here’s what I’m thinking:
What to do: Make the output of Repomix easier to parse, and more meaningful when retrieved by any RAG engine, such as the "chat with documents" feature that’s common in most LLM applications.
How to do: Split large code files into smaller chunks using the AST, then merge them back into a single output. Also, add separators between chunks that are recognizable by most text splitters.
I’ve recently tried Langchain’s source code loader, and this approach should be easy to implement with a few additional dependencies.
If this sounds good, I’d be happy to open a PR for it! Let me know your thoughts.
The text was updated successfully, but these errors were encountered: