Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable removeComments and removeEmptyLines #208

Open
rockmandash opened this issue Dec 20, 2024 · 6 comments
Open

Unstable removeComments and removeEmptyLines #208

rockmandash opened this issue Dec 20, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@rockmandash
Copy link

My codebase has 15,382,523 tokens , when I set the removeComments and removeEmptyLines true , the terminal stucks forever, even ctrl-c not working.

@yamadashy yamadashy self-assigned this Dec 28, 2024
@yamadashy yamadashy added the bug Something isn't working label Dec 28, 2024
@yamadashy
Copy link
Owner

Hi, @rockmandash !
Thank you for reporting this issue!

I've analyzed the problem and identified that the main thread is hanging during the processing. I plan to implement non-blocking operations similar to what we're already using in the security check functionality.

I'll work on the fix and update you soon.

@yamadashy
Copy link
Owner

@rockmandash
This issue has also been fixed and released in v0.2.7! Please try it out.

https://github.com/yamadashy/repomix/releases/tag/v0.2.7

@rockmandash
Copy link
Author

Unfortunately it didn't help, my terminal completely freeze!!

Stucking at ⠦ Processing file... (973/1578) api/index.php.develop

My repo is arround 15 million token

@rockmandash
Copy link
Author

I just repomix your repomix and ask o1 pro, here is the response:

Handling Performance Issues in Repomix

It’s very likely that the combination of removeComments and removeEmptyLines is causing a huge amount of per-file text manipulation, which in turn makes Node.js appear to “hang” (in reality it’s just locked up processing). In particular, the Python docstring/comment removal code is known to be somewhat slow and can even end up stuck in an edge case if certain triple-quote patterns appear multiple times on the same line. When you have a codebase large enough to generate 15+ million tokens, the overhead is tremendous.

Main Points and Potential Fixes

1. Why It “Hangs” or Takes So Long

Huge Input Size

With a 15 million–token codebase, you probably have hundreds of thousands (or even millions) of lines of code. Doing comment/empty-line stripping across all files is a lot of text processing in JavaScript.

Python Docstring Removal

If your repository includes Python code, the docstring removal logic in fileManipulate.ts is fairly complicated. It does multi-pass scanning for triple quotes (like """ and ''') and can get stuck in large files that have unusual quoting patterns. If the code is pinned in a problematic logic path for certain lines, it may never break out (effectively an infinite loop).

High Concurrency

Repomix internally uses p-map with concurrency set to CPU count - 1. If your machine has 8+ cores, it might be spawning 7 parallel tasks each processing extremely large files. On top of that, removing comments line-by-line is CPU heavy. All that concurrency can overwhelm your system’s CPU or memory.

Event Loop Starvation

Because all these operations are basically CPU-bound JavaScript loops, the main thread can get starved. That’s why even Ctrl-C doesn’t respond right away—Node.js can’t handle signals promptly if it’s in the middle of a tight loop.

2. Potential Solutions

A) Turn Off removeComments or removeEmptyLines

If you only occasionally need them, try disabling both to see if Repomix can complete. For extremely large repositories, many users find it more practical to skip comments/empty-line removal in one pass and only do it later for smaller subsets of files.

In your repomix.config.json:

{
  "output": {
    "removeComments": false,
    "removeEmptyLines": false
  }
}

Or via CLI flags (the inverse of your usage):

repomix --no-remove-comments --no-remove-empty-lines

(Repomix currently doesn’t support these exact flags, but you get the idea—removing those lines from your config or leaving them off is enough.)

B) Process Only a Subset of the Files

If you truly need comments removed, try doing it in smaller chunks. For instance, process only src/ first, then tests/, etc., or use --include to narrow your scope:

repomix src --include "src/**/*.py"

After it succeeds on partial sets, combine them or do further merges.

C) Investigate the Python Docstring Logic

The function removeDocStrings() in fileManipulate.ts can get stuck in tricky triple-quote patterns (especially if you have weird combinations of """ and ''' in a single line). You could:

  • Disable Python docstring removal temporarily by removing or commenting out the relevant lines in fileManipulate.ts (the PythonManipulator).
  • Add logging to see which file(s) it’s stuck on. If there is a single file with a pathological string, you can exclude that file from your Repomix run.

D) Lower Concurrency

Repomix uses p-map with concurrency = getProcessConcurrency(), which is basically #CPUs - 1. On large codebases, that might be too high. You can force fewer concurrency threads by editing (or patching) processConcurrency.ts to return a smaller number (like 2). This will run fewer files in parallel, which can help reduce CPU thrashing.

E) Watch Memory Usage

With ~15 million tokens, the full content is enormous. If Node runs out of available RAM or starts heavily swapping, it can appear locked up. Make sure you have enough free memory, or run Node with increased heap size:

node --max-old-space-size=8192 bin/repomix.cjs

(Adjust the number as needed—8GB, 10GB, etc.)

3. Practical Recommendations

Most Common Fix: Temporarily disable comment/empty-line removal in repomix.config.json. This is the quickest way to get a successful run on a huge codebase.

If You Must Remove Comments:

  • Try restricting include or concurrency to lighten the load.
  • Consider patching out the Python docstring logic (or skipping Python files).

Split the Repository:

If you don’t really need the entire codebase in a single file, break it up. Feed LLMs in partial chunks (e.g., front-end, back-end, etc.).

Check for a Single Problematic File:

Sometimes a single file with multiple triple-quoted strings can cause PythonManipulator to spin in an infinite loop. Logging or removing that file from the run can be a quick fix.

Summary

  • Cause: The removeComments + removeEmptyLines steps do heavy text manipulation. Python docstring removal is especially complicated and can trigger infinite loops or extreme CPU usage on large files.
  • Solution: Try disabling these features or processing smaller sets of files. If you really need them, do a deeper dive to debug Python docstring removal or patch concurrency.

Hopefully that helps you get Repomix running again on your large codebase! If you narrow it down to a specific infinite-loop scenario in the Python manipulator, please consider opening an issue on GitHub so we can address it in a future release.

@rockmandash
Copy link
Author

I suggest completely remove these two options since it's causing terminal to freeze, and use the following options:

  1. clone entire repo to os temp dir, using some lib like this https://www.npmjs.com/package/temp-dir
  2. process files one by one using stream in order to handle extremely large files
  3. Using language specific minify tool for better performance

@yamadashy
Copy link
Owner

yamadashy commented Dec 31, 2024

Thanks for the detailed report, @rockmandash!

While the process seems to be getting stuck at api/index.php.develop, it's hard to say with certainty if that's the sole cause. However, the fact that your repo includes Python code is very relevant.

Current handling for large Python files can indeed be a bottleneck. The existing implementation may cause some performance issues, and can sometimes cause the terminal to freeze.

Regarding the use of a temporary directory for repos, I'm a bit skeptical if it would solve the issue, but processing large files by copying them to a temp directory and then processing them using streams might be a good approach. However, that's a big change, so for a faster fix, using worker threads to process files outside the main thread will likely be a more effective solution.

We are also exploring using tree-sitter in the future, which might improve processing performance even further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants