Idea: Add an option to make the output more friendly with RAG engines #202

tranquochuy645 · 2024-12-10T16:56:53Z

The tool is already great, but larger repositories will never fit into an LLM's context window.

I know enterprise-level RAG systems have been around for a while, but sticking to user-friendly solutions, here’s what I’m thinking:

What to do: Make the output of Repomix easier to parse, and more meaningful when retrieved by any RAG engine, such as the "chat with documents" feature that’s common in most LLM applications.
How to do: Split large code files into smaller chunks using the AST, then merge them back into a single output. Also, add separators between chunks that are recognizable by most text splitters.

I’ve recently tried Langchain’s source code loader, and this approach should be easy to implement with a few additional dependencies.

If this sounds good, I’d be happy to open a PR for it! Let me know your thoughts.

yamadashy · 2024-12-11T15:35:22Z

Hi, @tranquochuy645 !
Thank you for this great proposal! I completely agree that improving RAG compatibility would be valuable for handling larger codebases.

I see you're considering using AST for code splitting - I'm curious about your planned implementation approach. Since Repomix is a Node.js tool, I'd like to avoid introducing Python dependencies (like Langchain's RecursiveCharacterTextSplitter) as that would require users to maintain both Node.js and Python environments.

I've been thinking tree-sitter could be a good fit here since it:

Provides accurate AST parsing for multiple languages
Is well-maintained and reliable

What are your thoughts on implementation details? I'm curious to hear more about how you're planning to handle the splitting logic.

I'm excited about this feature and looking forward to hearing your ideas!

tranquochuy645 · 2024-12-11T18:23:19Z

Thanks so much for the encouragement, @yamadashy - really appreciate it!

I totally get that adding Python dependencies to a NodeJS package is a no-go.

After digging deeper into the LangChain codebase, I found two approaches they use for code parsing:

RecursiveCharacterTextSplitter: It uses language-specific separators and splits code like plain text.
Source Code: This one uses Tree-sitter for parsing.

The first option isn’t great, and the second isn’t supported in LangChainJS yet.

But there are some tools we can use:

node-tree-sitter: A Tree-sitter implementation for Node.js.
The Tree-sitter queries from Python LangChain, like this one.

My plan is to mimic the Langchain language parser module in JavaScript for the splitting part, the rest is pretty straightforward.

For now, I’ll keep exploring Repomix’s codebase. Once I have a better understanding, I’ll open a draft PR so we can chat more about the implementation details.

tranquochuy645 · 2024-12-11T19:20:04Z

Also, node-tree-sitter's documentation is not that great, especially about the Query API.

Posting this link here for later investigations.

tree-sitter/node-tree-sitter#70 (comment)

yamadashy · 2024-12-13T09:52:08Z

Thank you for such detailed research! I really appreciate your thorough investigation into potential approaches. I've been wanting to tackle this issue but hadn't been able to start, so this is incredibly helpful.

Speaking of Tree-sitter implementations, I recently came across an interesting article about how Aider uses Tree-sitter for their codebase analysis:
https://aider.chat/2023/10/22/repomap.html

Also, Cline is a great JavaScript-based reference for Tree-sitter implementation:
https://github.com/cline/cline/tree/main/src/services/tree-sitter

It might provide some useful insights for our implementation.

I completely agree that handling large codebases is currently Repomix's biggest challenge. I'm very excited to move forward with this enhancement.

tranquochuy645 changed the title ~~Idea: Pre-processing to make the output more friendly with RAG engines~~ Idea: Make the output more friendly with RAG engines Dec 10, 2024

tranquochuy645 changed the title ~~Idea: Make the output more friendly with RAG engines~~ Idea: Add an option to make the output more friendly with RAG engines Dec 10, 2024

yamadashy added the idea label Dec 11, 2024

yamadashy added the needs discussion Issues needing discussion and a decision to be made before action can be taken label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Add an option to make the output more friendly with RAG engines #202

Idea: Add an option to make the output more friendly with RAG engines #202

tranquochuy645 commented Dec 10, 2024

yamadashy commented Dec 11, 2024

tranquochuy645 commented Dec 11, 2024 •

edited

Loading

tranquochuy645 commented Dec 11, 2024

yamadashy commented Dec 13, 2024 •

edited

Loading

Idea: Add an option to make the output more friendly with RAG engines #202

Idea: Add an option to make the output more friendly with RAG engines #202

Comments

tranquochuy645 commented Dec 10, 2024

yamadashy commented Dec 11, 2024

tranquochuy645 commented Dec 11, 2024 • edited Loading

tranquochuy645 commented Dec 11, 2024

yamadashy commented Dec 13, 2024 • edited Loading

tranquochuy645 commented Dec 11, 2024 •

edited

Loading

yamadashy commented Dec 13, 2024 •

edited

Loading