Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Add an option to make the output more friendly with RAG engines #202

Open
tranquochuy645 opened this issue Dec 10, 2024 · 4 comments
Labels
idea needs discussion Issues needing discussion and a decision to be made before action can be taken

Comments

@tranquochuy645
Copy link
Contributor

Hello @yamadashy ,

The tool is already great, but larger repositories will never fit into an LLM's context window.

I know enterprise-level RAG systems have been around for a while, but sticking to user-friendly solutions, here’s what I’m thinking:

  • What to do: Make the output of Repomix easier to parse, and more meaningful when retrieved by any RAG engine, such as the "chat with documents" feature that’s common in most LLM applications.

  • How to do: Split large code files into smaller chunks using the AST, then merge them back into a single output. Also, add separators between chunks that are recognizable by most text splitters.

I’ve recently tried Langchain’s source code loader, and this approach should be easy to implement with a few additional dependencies.

If this sounds good, I’d be happy to open a PR for it! Let me know your thoughts.

@tranquochuy645 tranquochuy645 changed the title Idea: Pre-processing to make the output more friendly with RAG engines Idea: Make the output more friendly with RAG engines Dec 10, 2024
@tranquochuy645 tranquochuy645 changed the title Idea: Make the output more friendly with RAG engines Idea: Add an option to make the output more friendly with RAG engines Dec 10, 2024
@yamadashy yamadashy added the idea label Dec 11, 2024
@yamadashy
Copy link
Owner

Hi, @tranquochuy645 !
Thank you for this great proposal! I completely agree that improving RAG compatibility would be valuable for handling larger codebases.

I see you're considering using AST for code splitting - I'm curious about your planned implementation approach. Since Repomix is a Node.js tool, I'd like to avoid introducing Python dependencies (like Langchain's RecursiveCharacterTextSplitter) as that would require users to maintain both Node.js and Python environments.

I've been thinking tree-sitter could be a good fit here since it:

  • Provides accurate AST parsing for multiple languages
  • Is well-maintained and reliable

What are your thoughts on implementation details? I'm curious to hear more about how you're planning to handle the splitting logic.

I'm excited about this feature and looking forward to hearing your ideas!

@yamadashy yamadashy added the needs discussion Issues needing discussion and a decision to be made before action can be taken label Dec 11, 2024
@tranquochuy645
Copy link
Contributor Author

tranquochuy645 commented Dec 11, 2024

Thanks so much for the encouragement, @yamadashy - really appreciate it!

I totally get that adding Python dependencies to a NodeJS package is a no-go.

After digging deeper into the LangChain codebase, I found two approaches they use for code parsing:

The first option isn’t great, and the second isn’t supported in LangChainJS yet.

But there are some tools we can use:

  • node-tree-sitter: A Tree-sitter implementation for Node.js.

  • The Tree-sitter queries from Python LangChain, like this one.

My plan is to mimic the Langchain language parser module in JavaScript for the splitting part, the rest is pretty straightforward.

For now, I’ll keep exploring Repomix’s codebase. Once I have a better understanding, I’ll open a draft PR so we can chat more about the implementation details.

@tranquochuy645
Copy link
Contributor Author

Also, node-tree-sitter's documentation is not that great, especially about the Query API.

Posting this link here for later investigations.

tree-sitter/node-tree-sitter#70 (comment)

@yamadashy
Copy link
Owner

yamadashy commented Dec 13, 2024

Thank you for such detailed research! I really appreciate your thorough investigation into potential approaches. I've been wanting to tackle this issue but hadn't been able to start, so this is incredibly helpful.

Speaking of Tree-sitter implementations, I recently came across an interesting article about how Aider uses Tree-sitter for their codebase analysis:
https://aider.chat/2023/10/22/repomap.html

Also, Cline is a great JavaScript-based reference for Tree-sitter implementation:
https://github.com/cline/cline/tree/main/src/services/tree-sitter

It might provide some useful insights for our implementation.

I completely agree that handling large codebases is currently Repomix's biggest challenge. I'm very excited to move forward with this enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea needs discussion Issues needing discussion and a decision to be made before action can be taken
Projects
None yet
Development

No branches or pull requests

2 participants