Add file truncation #149

klntsky · 2024-11-02T13:46:29Z

The use case is: I have multiple JSON data files. I want to include them in the LLM input, but only to show their structure, not the contents. I'd like to be able to specify that I just want to include the first N lines.

yamadashy · 2024-11-03T03:10:51Z

Hi @klntsky!

I'm thinking of implementing this with a new process config option. Does this kind of structure match what you had in mind?

repomix.config.json

{
  "output": {
    // ... output config
  }
  "process": {
    "maxLines": 100,             // Default limit for all files
    "patterns": [
      {
        "pattern": "**/*.json",  // Special limits for JSON files
        "maxLines": 20
      }
    ]
  }
}

The output would look like:

{
  "users": [
    {
      "id": 1,
      "name": "John"
    }
  ]
... (truncated)

Let me know if this is heading in the right direction!

klntsky · 2024-11-03T05:49:09Z

In some cases it may be useful to limit chars or words, not lines (e.g. unformatted json). Maybe all three should be configurable?

yamadashy · 2024-11-03T07:59:12Z

@klntsky
If I'm understanding your intention correctly, I think the underlying issue here is that including entire file contents can consume a large number of tokens, which is a common problem for projects using repomix with LLMs.

Given this context and considering how LLMs process text, I think focusing on token count would be the most appropriate approach initially. Something like:

{
  "process": {
    "maxTokens": 1000,          // Global token limit
    "patterns": [
      {
        "pattern": "**/*.json",  
        "maxTokens": 500        // Pattern-specific token limit
      }
    ]
  }
}

I'd like to start with this simpler requirement to minimize potential bugs.

What do you think about this approach?

klntsky · 2024-11-03T14:21:06Z

Yep, token limits seem to cover both cases, but I'd like to have lines too, because it's not immediately clear how many tokens are there in a part of the file, while lines can be inspected visually.

yamadashy · 2024-11-04T15:22:17Z

That makes sense.
We could support both maxLines and maxTokens, truncating when either limit is reached.

Let me think about this a bit more.

yamadashy added the enhancement New feature or request label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file truncation #149

Add file truncation #149

klntsky commented Nov 2, 2024 •

edited

Loading

yamadashy commented Nov 3, 2024 •

edited

Loading

klntsky commented Nov 3, 2024

yamadashy commented Nov 3, 2024 •

edited

Loading

klntsky commented Nov 3, 2024

yamadashy commented Nov 4, 2024

Add file truncation #149

Add file truncation #149

Comments

klntsky commented Nov 2, 2024 • edited Loading

yamadashy commented Nov 3, 2024 • edited Loading

klntsky commented Nov 3, 2024

yamadashy commented Nov 3, 2024 • edited Loading

klntsky commented Nov 3, 2024

yamadashy commented Nov 4, 2024

klntsky commented Nov 2, 2024 •

edited

Loading

yamadashy commented Nov 3, 2024 •

edited

Loading

yamadashy commented Nov 3, 2024 •

edited

Loading