very slow processing speed for lines with a large number of consecutive spaces #868

kagahd · 2024-07-19T10:01:46Z

I'm submitting a bug report
What is the current behavior?

Adding Jupyter notebook files (.ipynb) to my codebase increased the execution speed of detect-secrets from a few seconds to almost an hour. The CI/CD pipeline was barely usable because detect-secrets was so slow.

I analyzed the problem and found out that the reason for the slow processing are lines with a large number of consecutive spaces.
The jupyter notebooks may contain hundreds of such "problematic" lines, each with over 800 consecutive spaces, ending with quotes or comma.

To reproduce the issue, I generated a file where each line has 100 more spaces than the previous line, ending with some ASCII characters.
Each of those lines is extracted to a file within a dedicated folder which detect-secrets has to analyze.
As you can see in the following output, detect-secrets has almost the same execution time (0.2 seconds) for lines that contain up to 400 consecutive spaces.
However, detect-secrets needs more than ten times as much time (2.2 seconds) for a line with 1000 consecutive spaces.

secrets-scanner@26f48f23a489:/tmp/notebooks/spaces$ ./scan.sh 
Scanning file 1 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_aa
Time for /tmp/tmp.cTxM6THbjX/split_notebook_aa: 0.2 seconds
Scanning file 2 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ab
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ab: 0.2 seconds
Scanning file 3 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ac
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ac: 0.2 seconds
Scanning file 4 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ad
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ad: 0.2 seconds
Scanning file 5 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ae
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ae: 0.3 seconds
Scanning file 6 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_af
Time for /tmp/tmp.cTxM6THbjX/split_notebook_af: 0.5 seconds
Scanning file 7 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ag
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ag: 0.8 seconds
Scanning file 8 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ah
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ah: 1.1 seconds
Scanning file 9 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_ai
Time for /tmp/tmp.cTxM6THbjX/split_notebook_ai: 1.6 seconds
Scanning file 10 of 10: /tmp/tmp.cTxM6THbjX/split_notebook_aj
Time for /tmp/tmp.cTxM6THbjX/split_notebook_aj: 2.2 seconds

If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem

Create a file create_notebook_file.shwith following content:

#!/bin/bash

output_file="notebook.ipynb"
> $output_file
spaces=""

for i in {1..10}
do
  echo "${spaces}foo" >> $output_file
  spaces+="                                                                                                    "
done

Make it executable: chmod +x create_notebook_file.sh
Create a file scan.shwith following content:

#!/bin/bash

NOTEBOOK="notebook.ipynb"
LINES_PER_FILE=1

TEMP_DIR=$(mktemp -d)
TEMP_RESULTS=$(mktemp)

split -l $LINES_PER_FILE "$NOTEBOOK" "$TEMP_DIR/split_notebook_"

total_files=$(ls $TEMP_DIR/split_notebook_* | wc -l)
current_file=0

measure_scan_time() {
    local file=$1
    local dir=$2
    mkdir -p "$dir"
    mv "$file" "$dir"
    start_time=$(date +%s%N)
    detect-secrets -C "$dir" scan --all-files &>/dev/null
    end_time=$(date +%s%N)
    duration=$((end_time - start_time))
    echo "$duration $file"
}

for file in $TEMP_DIR/split_notebook_*; do
    current_file=$((current_file + 1))
    dir="$TEMP_DIR/dir_$current_file"
    echo "Scanning file $current_file of $total_files: $file"
    scan_time=$(measure_scan_time "$file" "$dir")
    echo "$scan_time" >> "$TEMP_RESULTS"
    echo "Time for $file: $(echo $scan_time | awk '{printf "%.1f", $1/1000000000}') seconds"
done

Make it executable: chmod +x scan.sh
Execute the scripts:

./create_notebook_file.sh
./scan.sh

What is the expected behavior?

I'd expect a runtime of O(n) for lines with n consecutive spaces ending with a non-space ASCII.

What is the motivation / use case for changing the behavior?

We are not able to use detect-secrets in our CI/CD pipeline if it takes so long to execute.

Please tell us about your environment:
- detect-secrets Version: 1.5.0
- Python Version: 3.12.4
- OS Version: MacOS Sonoma 14.5
- File type (if applicable): jupyter notebook files (.ipynb)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

very slow processing speed for lines with a large number of consecutive spaces #868

very slow processing speed for lines with a large number of consecutive spaces #868

kagahd commented Jul 19, 2024 •

edited

Loading

very slow processing speed for lines with a large number of consecutive spaces #868

very slow processing speed for lines with a large number of consecutive spaces #868

Comments

kagahd commented Jul 19, 2024 • edited Loading

kagahd commented Jul 19, 2024 •

edited

Loading