Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect-secrets not identifying all Github token occurrences in a file #858

Open
1 task done
karamuz opened this issue Jun 20, 2024 · 0 comments
Open
1 task done

Comments

@karamuz
Copy link

karamuz commented Jun 20, 2024

  • I'm submitting a ...

    • bug report
  • What is the current behavior?

For example, given the file test_ghp.txt:

GITHUB_USERID=11111111
PERSONAL_ACCESS_TOKEN=ghp_ab123cDEfGhiz1UabC1cDfGhIj4KlM1NO1P1
GITHUB_USERID=99999999
PERSONAL_ACCESS_TOKEN=ghp_Zx123yDEfGhij9UvW5xCdEfGhIj7MnO4PR2Q

When I scan the file, I get these results:

  "results": {
    "test_ghp.txt": [
      {
        "type": "GitHub Token",
        "filename": "fast.txt",
        "hashed_secret": "e175c6f5f2a92e8623bd9a4820edb4e8c1b0fd10",
        "is_verified": false,
        "line_number": 2
      }
    ]
  },
  "generated_at": "2024-06-20T12:54:36Z"

As referenced in #493, if the secret is written into a file at multiple locations, only the first one is identified by detect-secrets. The problem here is that having multiple GitHub tokens with different values in the same file, they are still interpreted as if they were the same.

In the regular expression used here:

(ghp|gho|ghu|ghs|ghr)_[A-Za-z0-9_]{36}
There is one capturing group: (ghp|gho|ghu|ghs|ghr). This group is designed to match and capture the prefix part of a GitHub token.

Because of this capturing group, when findall() processes a string matching this pattern, it does not return the entire match ("ghp_...36 characters..."). Instead, it returns only the part of the match that corresponds to the capturing group, which in your test cases would be "ghp", "gho", etc., depending on the token.

Example:
If you were to run findall() on a string like "Test ghp_abc123...", given the regex above, the output would be:

['ghp'] # Instead of ['ghp_abc123...']
This output occurs because findall() focuses solely on the capturing group, rather than the entire pattern.

  • What is the expected behavior?

The expected behavior would be to capture all the different secrets in a file.

  • Please tell us about your environment:

    • detect-secrets Version: 1.5.0
    • Python Version: 3.12.4
    • OS Version: macOS Sonoma 14.4.1
  • Other information

In the analyze_string function, maybe using finditer() could solve the issue to ensure that the entire matching string is retrieved.

for match in regex.finditer(string):
    yield match.group(0) # Returns the entire matched string

finditer() yields match objects from which you can extract specific groups or the entire match (via match.group(0)), providing flexibility and precision in handling regex matches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant