Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very slow processing speed for lines with a large number of consecutive spaces #868

Open
kagahd opened this issue Jul 19, 2024 · 0 comments

Comments

@kagahd
Copy link

kagahd commented Jul 19, 2024

  • I'm submitting a bug report

  • What is the current behavior?

Adding Jupyter notebook files (.ipynb) to my codebase increased the execution speed of detect-secrets from a few seconds to almost an hour. The CI/CD pipeline was barely usable because detect-secrets was so slow.

I analyzed the problem and found out that the reason for the slow processing are lines with a large number of consecutive spaces.
The jupyter notebooks may contain hundreds of such "problematic" lines, each with over 800 consecutive spaces, ending with quotes or comma.

To reproduce the issue, I generated a file where each line has 100 more spaces than the previous line, ending with some ASCII characters.
Each of those lines is extracted to a file within a dedicated folder which detect-secrets has to analyze.
As you can see in the following output, detect-secrets has almost the same execution time (0.2 seconds) for lines that contain up to 400 consecutive spaces.
However, detect-secrets needs more than ten times as much time (2.2 seconds) for a line with 1000 consecutive spaces.

secrets-scanner@26f48f23a489:/tmp/notebooks/spaces$ ./scan.sh 
Scanning file 1 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_aa
Time for /tmp/tmp.cTxM6THbjX/split_notebook_aa: 0.2 seconds
Scanning file 2 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ab
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ab: 0.2 seconds
Scanning file 3 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ac
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ac: 0.2 seconds
Scanning file 4 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ad
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ad: 0.2 seconds
Scanning file 5 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ae
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ae: 0.3 seconds
Scanning file 6 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_af
Time for /tmp/tmp.cTxM6THbjX/split_notebook_af: 0.5 seconds
Scanning file 7 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ag
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ag: 0.8 seconds
Scanning file 8 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ah
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ah: 1.1 seconds
Scanning file 9 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ai
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ai: 1.6 seconds
Scanning file 10 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_aj
Time for /tmp/tmp.cTxM6THbjX/split_notebook_aj: 2.2 seconds
  • If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem

Create a file create_notebook_file.shwith following content:

#!/bin/bash

output_file="notebook.ipynb"
> $output_file
spaces=""

for i in {1..10}
do
  echo "${spaces}foo" >> $output_file
  spaces+="                                                                                                    "
done

Make it executable: chmod +x create_notebook_file.sh
Create a file scan.shwith following content:

#!/bin/bash

NOTEBOOK="notebook.ipynb"
LINES_PER_FILE=1

TEMP_DIR=$(mktemp -d)
TEMP_RESULTS=$(mktemp)

split -l $LINES_PER_FILE "$NOTEBOOK" "$TEMP_DIR/split_notebook_"

total_files=$(ls $TEMP_DIR/split_notebook_* | wc -l)
current_file=0

measure_scan_time() {
    local file=$1
    local dir=$2
    mkdir -p "$dir"
    mv "$file" "$dir"
    start_time=$(date +%s%N)
    detect-secrets -C "$dir" scan --all-files &>/dev/null
    end_time=$(date +%s%N)
    duration=$((end_time - start_time))
    echo "$duration $file"
}

for file in $TEMP_DIR/split_notebook_*; do
    current_file=$((current_file + 1))
    dir="$TEMP_DIR/dir_$current_file"
    echo "Scanning file $current_file of $total_files: $file"
    scan_time=$(measure_scan_time "$file" "$dir")
    echo "$scan_time" >> "$TEMP_RESULTS"
    echo "Time for $file: $(echo $scan_time | awk '{printf "%.1f", $1/1000000000}') seconds"
done

Make it executable: chmod +x scan.sh
Execute the scripts:

  • ./create_notebook_file.sh
  • ./scan.sh
  • What is the expected behavior?

I'd expect a runtime of O(n) for lines with n consecutive spaces ending with a non-space ASCII.

  • What is the motivation / use case for changing the behavior?

We are not able to use detect-secrets in our CI/CD pipeline if it takes so long to execute.

  • Please tell us about your environment:
    • detect-secrets Version: 1.5.0
    • Python Version: 3.12.4
    • OS Version: MacOS Sonoma 14.5
    • File type (if applicable): jupyter notebook files (.ipynb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant