Data Corruption on dcgm_fi_dev_gpu_util Metric #199

TortoiseHam · 2024-11-13T23:59:30Z

Hi all,

Currently using dcgm_fi_dev_gpu_util to monitor GPU utilization but running into an issue where it will occasionally spit out a data point that isn't between 0 and 100. The highest observed value was 4294967295 (max supported by UINT32 which might be a hint), but most often it's in range of 1k to 200k. This appears to happen both in situations where there is load on the GPUs and also in situations where the GPUs are sitting at 0% before and after the erroneous data point. Has anyone else encountered problems with this metric?

I've seen it suggested elsewhere that there's a newer DCGM_FI_PROF_GR_ENGINE_ACTIVE which might replace it, but I don't know whether the root cause here is the metric itself or something in the collection code. Anyone know whether collecting the 'prof' metric would incur a greater performance penalty than the 'dev' metric?

Thanks!

(cross post of NVIDIA/go-dcgm#75 since I'm not sure whether this a problem with the metric itself of the go wrapper being used to extract it)

TortoiseHam · 2024-11-22T21:08:58Z

Confirmed that this problem is with dcgmi directly, and not with the go-wrapper translation layer. To reproduce you can run the following script as a background job for a few days:

#!/bin/bash

output_file="gpu_high_values.log"

echo "Monitoring GPU utilization..."
echo "Logging anomolies to $output_file"
echo "Monitoring started at $(date)" >> "$output_file"

while true; do
	output=$(dcgmi dmon -e 203 --count 1)

	# Process each line
	echo "$output" | while read -r line; do
		# Extract GPU ID and Value
		if [[ $line =~ GPU[[:space:]]+[0-9]+[[:space:]]+([0-9]+) ]]; then
			value="${BASH_REMATCH[1]}"  # extract numeric value
			if (( value > 100 )); then
				echo "Value > 100 detected at $(date):" >> "$output_file"
				echo "$line" >> "$output_file"
				echo "---" >> "$output_file"
				echo "Value > 100 detected! Logged the output."
				exit 0  # break the loops
			fi
		fi
	done
done

nikkon-dev · 2024-11-27T01:23:50Z

@TortoiseHam,

Could you check if the values you get are derivative of the DCGM_INT32_BLANK? DCGM_INT32_NOT_FOUND, DCGM_INT32_NOT_SUPPORTED, etc

TortoiseHam · 2024-11-27T19:21:02Z

@nikkon-dev , interesting thought but it seems like the values are more diverse than that. In the past 14 days I'm seeing:

The only one that would match something from the list would be DCGM_INT64_BLANK

TortoiseHam mentioned this issue Nov 13, 2024

Data Corruption on dcgm_fi_dev_gpu_util Metric NVIDIA/go-dcgm#75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Corruption on dcgm_fi_dev_gpu_util Metric #199

Data Corruption on dcgm_fi_dev_gpu_util Metric #199

TortoiseHam commented Nov 13, 2024

TortoiseHam commented Nov 22, 2024

nikkon-dev commented Nov 27, 2024

TortoiseHam commented Nov 27, 2024

Data Corruption on dcgm_fi_dev_gpu_util Metric #199

Data Corruption on dcgm_fi_dev_gpu_util Metric #199

Comments

TortoiseHam commented Nov 13, 2024

TortoiseHam commented Nov 22, 2024

nikkon-dev commented Nov 27, 2024

TortoiseHam commented Nov 27, 2024