Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file truncation #149

Open
klntsky opened this issue Nov 2, 2024 · 5 comments
Open

Add file truncation #149

klntsky opened this issue Nov 2, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@klntsky
Copy link

klntsky commented Nov 2, 2024

The use case is: I have multiple JSON data files. I want to include them in the LLM input, but only to show their structure, not the contents. I'd like to be able to specify that I just want to include the first N lines.

@yamadashy
Copy link
Owner

yamadashy commented Nov 3, 2024

Hi @klntsky!

I'm thinking of implementing this with a new process config option. Does this kind of structure match what you had in mind?

repomix.config.json

{
  "output": {
    // ... output config
  }
  "process": {
    "maxLines": 100,             // Default limit for all files
    "patterns": [
      {
        "pattern": "**/*.json",  // Special limits for JSON files
        "maxLines": 20
      }
    ]
  }
}

The output would look like:

{
  "users": [
    {
      "id": 1,
      "name": "John"
    }
  ]
... (truncated)

Let me know if this is heading in the right direction!

@klntsky
Copy link
Author

klntsky commented Nov 3, 2024

In some cases it may be useful to limit chars or words, not lines (e.g. unformatted json). Maybe all three should be configurable?

@yamadashy
Copy link
Owner

yamadashy commented Nov 3, 2024

@klntsky
If I'm understanding your intention correctly, I think the underlying issue here is that including entire file contents can consume a large number of tokens, which is a common problem for projects using repomix with LLMs.

Given this context and considering how LLMs process text, I think focusing on token count would be the most appropriate approach initially. Something like:

{
  "process": {
    "maxTokens": 1000,          // Global token limit
    "patterns": [
      {
        "pattern": "**/*.json",  
        "maxTokens": 500        // Pattern-specific token limit
      }
    ]
  }
}

I'd like to start with this simpler requirement to minimize potential bugs.

What do you think about this approach?

@klntsky
Copy link
Author

klntsky commented Nov 3, 2024

Yep, token limits seem to cover both cases, but I'd like to have lines too, because it's not immediately clear how many tokens are there in a part of the file, while lines can be inspected visually.

@yamadashy
Copy link
Owner

That makes sense.
We could support both maxLines and maxTokens, truncating when either limit is reached.

Let me think about this a bit more.

@yamadashy yamadashy added the enhancement New feature or request label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants