Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6800 crashing after applying DPM3 settings #21

Open
quasd opened this issue Dec 20, 2020 · 44 comments
Open

6800 crashing after applying DPM3 settings #21

quasd opened this issue Dec 20, 2020 · 44 comments

Comments

@quasd
Copy link

quasd commented Dec 20, 2020

After getting upp working I ended up finding this project and branch.

I am trying this with
https://aur.archlinux.org/packages/powerupp-git/
by modifying the branch to bignavi.

The GUI opens up just fine and I can load active settings. But if I click "Apply current" I get following errors from kernel, and the settings don't appear to change. (reading from /sys/kernel/debug/dri/$index/amdgpu_pm_info)

Dec 20 19:06:26 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: smu driver if version = 0x00000034, smu fw if version = 0x0000003b, smu fw version = 0x003a3100 (58.49.0)
Dec 20 19:06:26 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: SMU driver if version not matched
Dec 20 19:06:26 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: use vbios provided pptable
Dec 20 19:06:28 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: failed send message: TransferTableDram2Smu (19)         param: 0x00000000 response 0xffffffc2
Dec 20 19:06:28 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to transfer pptable to SMC!
Dec 20 19:06:28 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to setup smc hw!
Dec 20 19:06:28 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: smu reset failed, ret = -62

If I launch a game, I get a freeze fairly quickly.

Kernel in use

Linux quasd 5.9.14-1-ck #1 SMP PREEMPT Fri, 18 Dec 2020 06:58:44 +0000 x86_64 GNU/Linux

linux-firmware in use

linux-firmware-git 20201130.7455a36-1

Is this a problem of running on 5.9?

@quasd
Copy link
Author

quasd commented Dec 20, 2020

I also tried lowering the mem clock by 1 mhz which resulted in following

Dec 20 19:23:25 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: smu driver if version = 0x00000034, smu fw if version = 0x0000003b, smu fw version = 0x003a3100 (58.49.0)
Dec 20 19:23:25 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: SMU driver if version not matched
Dec 20 19:23:25 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: use vbios provided pptable
Dec 20 19:23:25 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: SMU is initialized successfully!

Probably not big enough change to trigger a change.

@azeam
Copy link
Owner

azeam commented Dec 20, 2020

Interesting with some 6800 test! I am not the maintainer of the Arch AUR package, which is a bit outdated, and support for the 6000 series is only available in this experimental branch. It also requires the latest version of UPP (not available at pip yet).

If you have already installed UPP via pip you could do a quick-and-hacky update by overwriting the files in the upp lib folder (for example ~/.local/lib/python3.8/site-packages/upp) with the latest files from github, otherwise install it from source and make sure the upp command is available and runs the latest version. For powerupp download the bignavi branch and do make && sudo make install.

Edit: sorry, I didn't read you initial post properly, is UPP also from the current git repo?

@quasd
Copy link
Author

quasd commented Dec 20, 2020

Hello

and also did bit more testing. Only the mem settings seem to be the problem. I was able to use the static voltage / Graphics card power ( seem to be limited to 257, more testing to do ) and Gfx clock frequency.

@azeam
Copy link
Owner

azeam commented Dec 20, 2020

Sounds promising! I don't know if you attempted to set the DPM 0 frequency, that will likely cause the GPU to freak out. I would expect at least the DPM 3 frequency to be adjustable though.

@quasd
Copy link
Author

quasd commented Dec 20, 2020

I was doing it to DPM 3.

Here is how it looks to me.
powerupp

Notes this far:

  • Graphic card power seems to be bugged, as long as I set it to maximum allowed, I can keep increasing it. I think the default is 233, and in the screenshot it is 439.
  • Can't increase gpu mhz past 2475 ( will drop down to 2d clocks )
  • With above settings I am pretty much locked to 2450 mhz

@quasd
Copy link
Author

quasd commented Dec 20, 2020

is UPP also from the current git repo?

Technically it's from my own branch (based on master), only added one comma thought.

@quasd
Copy link
Author

quasd commented Dec 20, 2020

More testing done and things seem promising. I was able to confirm that the oc works.

  • Just simple testing and staring at the wall. 10fps increase from stock.
  • The power limit seems to be still locked to 233 even though I can increase it endlessly
  • Biggest limitation is Gfx clock frequency, if this can't be changed there won't be much point putting this card on water :(

@quasd
Copy link
Author

quasd commented Dec 20, 2020

I guess the next step for my oc would be to flash a 6800 XT bios or something.

Anything you would like me to test?

@quasd quasd changed the title 6800 crashing after applying anything 6800 crashing after applying DPM3 settings Dec 20, 2020
@azeam
Copy link
Owner

azeam commented Dec 20, 2020

Many thanks for the report! As for the power limit it's been tricky to set with the navi 10 cards as well, only working well with certain firmware/kernel version combinations. Are the changes that you make reflected in cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap ?

Gfx clock is a pity if it's not possible to adjust, it is unfortunately not unlikely that the card is "hard limited" like the 5600 XT was.

@quasd
Copy link
Author

quasd commented Dec 20, 2020

Many thanks for the report! As for the power limit it's been tricky to set with the navi 10 cards as well, only working well with certain firmware/kernel version combinations. Are the changes that you make reflected in cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap ?

it is reflected there too.

 cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
360000000

And regarding the Memory settings, I tried again. It still crashes after few seconds without changing any settings. ( at least what powerupp and /sys/kernel/debug/dri/$index/amdgpu_pm_info tell me) I tried both lowering and increasing the DPM3 Clock frequency, both seem to make this occur.

@azeam
Copy link
Owner

azeam commented Dec 20, 2020

Can you try to set upp set smc_pptable/DcModeMaxFreq/0=2500 smc_pptable/DcModeMaxFreq/2=1050 --write and see if that makes any difference to setting the Gfx and DPM 3 frequencies (should raise the limits used by OD to 2500/1050 MHz)? You could also see if turning the amdgpu.ppfeaturemask=0xffffffff boot flag on or off makes any difference.

Do you know if and to what extent the memory is possible to adjust in Windows/Radeon Software?

@azeam
Copy link
Owner

azeam commented Dec 20, 2020

Also try to set only the DPM 3 clock (1020 MHz in the example below) using UPP to confirm that the same thing happens and that it's not something buggy in powerupp upp set smc_pptable/FreqTableUclk/3=1020 --write.

@quasd
Copy link
Author

quasd commented Dec 20, 2020

Hello

I already have the featuremask.

[root@quasd ~]# grep -o amdgpu.ppfeaturemask=0xffffffff /proc/cmdline 
amdgpu.ppfeaturemask=0xffffffff

After adjusting voltage offset -115 and setting power limit to 233 below seems ineffective. ( don't know the magic strings so have to do from powerupp)

[root@quasd ~]# upp set smc_pptable/DcModeMaxFreq/0=2560 --write
Changing smc_pptable.DcModeMaxFreq.0 from 2460 to 2560 at 0x626
Commiting changes to '/sys/class/drm/card0/device/pp_table'.

no errors, but seems to be ineffective. The core clock is still stuck to 2450.

For the DPM3

upp set smc_pptable/DcModeMaxFreq/2=1050 --write

seems to be also ineffective.

Other notes

  • Temps are around 80c, is this some magic point where it stops boosting?
  • Power usage is jumping around 200 W
  • Testing with overwatch and practice range
  • Setting fan speed to 100% and waiting for the card to cool down and retrying allowed me to reach 2470
  • edit: typos

@quasd
Copy link
Author

quasd commented Dec 20, 2020

Also try to set only the DPM 3 clock (1020 MHz in the example below) using UPP to confirm that the same thing happens and that it's not something buggy in powerupp upp set smc_pptable/FreqTableUclk/3=1020 --write.

[root@quasd ~]# upp set smc_pptable/FreqTableUclk/3=1020 --write
Changing smc_pptable.FreqTableUclk.3 from 1000 to 1020 at 0x584
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
[root@quasd ~]#

@quasd
Copy link
Author

quasd commented Dec 20, 2020

Setting it twice? in a row makes following to happen.

[root@quasd ~]# upp set smc_pptable/FreqTableUclk/3=1020 --write
Changing smc_pptable.FreqTableUclk.3 from 1020 to 1020 at 0x584
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
Traceback (most recent call last):
  File "/usr/bin/upp", line 33, in <module>
    sys.exit(load_entry_point('upp==0.0.7.post2', 'console_scripts', 'upp')())
  File "/usr/lib/python3.9/site-packages/upp/upp.py", line 336, in main
    cli(obj={})()
  File "/usr/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/upp/upp.py", line 318, in set
    decode._write_pp_tables_file(pp_file, pp_bytes)
  File "/usr/lib/python3.9/site-packages/upp/decode.py", line 47, in _write_pp_tables_file
    f.close()
OSError: [Errno 62] Timer expired
[root@quasd ~]# 

On journalctl side following happens. The line about fan speed keeps repeating forever 4x in 1s.

Dec 20 21:53:08 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: failed send message: TransferTableDram2Smu (19)         param: 0x00000000 response 0xffffffc2
Dec 20 21:53:08 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to transfer pptable to SMC!
Dec 20 21:53:08 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to setup smc hw!
Dec 20 21:53:08 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: smu reset failed, ret = -62
Dec 20 21:53:08 quasd kernel: amdgpu: manual fan speed control should be enabled first

And amdgpu_pm_info changes to following

[root@quasd ~]# cat /sys/kernel/debug/dri/0/amdgpu_pm_info 
Clock Gating Flags Mask: 0x38118305
	Graphics Medium Grain Clock Gating: On
	Graphics Medium Grain memory Light Sleep: Off
	Graphics Coarse Grain Clock Gating: On
	Graphics Coarse Grain memory Light Sleep: Off
	Graphics Coarse Grain Tree Shader Clock Gating: Off
	Graphics Coarse Grain Tree Shader Light Sleep: Off
	Graphics Command Processor Light Sleep: Off
	Graphics Run List Controller Light Sleep: Off
	Graphics 3D Coarse Grain Clock Gating: On
	Graphics 3D Coarse Grain memory Light Sleep: Off
	Memory Controller Light Sleep: On
	Memory Controller Medium Grain Clock Gating: On
	System Direct Memory Access Light Sleep: Off
	System Direct Memory Access Medium Grain Clock Gating: Off
	Bus Interface Medium Grain Clock Gating: Off
	Bus Interface Light Sleep: Off
	Unified Video Decoder Medium Grain Clock Gating: Off
	Video Compression Engine Medium Grain Clock Gating: Off
	Host Data Path Light Sleep: On
	Host Data Path Medium Grain Clock Gating: On
	Digital Right Management Medium Grain Clock Gating: Off
	Digital Right Management Light Sleep: Off
	Rom Medium Grain Clock Gating: Off
	Data Fabric Medium Grain Clock Gating: Off
	Address Translation Hub Medium Grain Clock Gating: On
	Address Translation Hub Light Sleep: On

dpm not enabled
[root@quasd ~]# 

Missing all the clock, temp etc information. The system also becomes unresponsive occasionally.
edit: remove speculation of what might be the cause

@azeam
Copy link
Owner

azeam commented Dec 20, 2020

  • Temps are around 80c, is this some magic point where it stops boosting?

The 6800 test file that I have has a fan target temperature of 80 degrees, not sure if it's the same model that you have but seems likely that they have the same target at least. You could try to increase it but be careful not to overheat the card upp set smc_pptable/FanTargetTemperature=85 --write

What monitor frequency are you running at? Dual monitors? Can you try to change the frequency and maybe set to single monitor and change connection (HDMI/DP) if possible and see if it makes any difference for the memory clock setting. The repeating error messages can be caused by powerupps (or other application) periodical hwmon readings.

@quasd
Copy link
Author

quasd commented Dec 20, 2020

What monitor frequency are you running at? Dual monitors? Can you try to change the frequency and maybe set to single monitor and change connection (HDMI/DP) if possible and see if it makes any difference for the memory clock setting. The repeating error messages can be caused by powerupps (or other application) periodical hwmon readings.

Main screen 240hz 1440p, secondary monitor 60hz 1080p both DP
Tried with only 1 screen hdmi. Memory setting still didn't work. And setting memory more than 1 results in instability/freezes.

upp set smc_pptable/FreqTableUclk/3=1020 --write

edit: also I am now running kernel 5.10.1-1 with few patches ( hopefully irrelevant ). And that didn't improve the situation.

  "enable_additional_cpu_optimizations-$_gcc_more_v.tar.gz::https://github.com/graysky2/kernel_gcc_patch/archive/$_gcc_more_v.tar.gz"
  0015-zfs.patch
  "0001-futex-patches.patch::https://raw.githubusercontent.com/Frogging-Family/linux-tkg/master/linux59-tkg/linux59-tkg-patches/0007-v5.9-fsync.patch"
  0001-ZEN-Add-sysctl-and-CONFIG-to-disallow-unprivileged-C.patch
  0002-Bluetooth-Fix-LL-PRivacy-BLE-device-fails-to-connect.patch
  0003-Bluetooth-Fix-attempting-to-set-RPA-timeout-when-uns.patch
  0004-HID-quirks-Add-Apple-Magic-Trackpad-2-to-hid_have_sp.patch

@quasd
Copy link
Author

quasd commented Dec 20, 2020

upp set smc_pptable/FanTargetTemperature=85 --write

Sadly even with this max I can get is 2475 mhz

root@quasd ~# upp set smc_pptable/DcModeMaxFreq/0=2550 --write
Changing smc_pptable.DcModeMaxFreq.0 from 2475 to 2550 at 0x626
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
root@quasd ~# upp set smc_pptable/FanTargetTemperature=90 --write
Changing smc_pptable.FanTargetTemperature from 80 to 90 at 0x720
Commiting changes to '/sys/class/drm/card0/device/pp_table'.
root@quasd ~# 

@sibradzic
Copy link
Collaborator

sibradzic commented Dec 21, 2020

Setting mem clock on my RX5700 is also very flaky, it would only accept certain values and crash with most others. Often, a difference of just one MHz would result the card to crash (mostly unrecoverable, needs HW reset), and it does not matter if you increase or decrease the clock. And also, it does not matter if you use upp (pp_table interface) or radeon-clocks (kernel sysfs API) to change these clocks, it is something in firmware/SMU/RAM timings that makes it crash.

I had to determine a certain set of "safe" clocks by trial and error :|

@azeam
Copy link
Owner

azeam commented Dec 21, 2020

edit: also I am now running kernel 5.10.1-1 with few patches ( hopefully irrelevant ). And that didn't improve the situation.

As for now the 5.10 kernel seems to at least break the power limit setting possibility for Navi 10 for some reason, still need to do some further digging to understand why this happens and if there are any workarounds. But since you also tried 5.9 that shouldn't be the main culprit. There might still be driver/firmware issues that AMD will sort out eventually.

@azeam
Copy link
Owner

azeam commented Dec 21, 2020

Are you able to lower the power limit by the way (set it to 150 W for example, using powerupp)?

@asiantuntija
Copy link

I'm dropping in since I'm testing on 6800XT. PowerUPP reads all values correctly but nothing changes when trying to apply something new.

cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap doesn't show change when applying something with PowerUPP, echo 293000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap however does work and sets the power limit to maximum. I also tried undervolting heavily but nothing seemed to crash the system so that probably isn't working either.

PowerUPP does ask for permissions when applying so I wouldn't think it's a problem with permissions. Glad to help with debugging, not really familiar with the interfaces so can't do it by myself.

@azeam
Copy link
Owner

azeam commented Dec 21, 2020

I'm dropping in since I'm testing on 6800XT. PowerUPP reads all values correctly but nothing changes when trying to apply something new.

cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap doesn't show change when applying something with PowerUPP, echo 293000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap however does work and sets the power limit to maximum. I also tried undervolting heavily but nothing seemed to crash the system so that probably isn't working either.

PowerUPP does ask for permissions when applying so I wouldn't think it's a problem with permissions. Glad to help with debugging, not really familiar with the interfaces so can't do it by myself.

Thanks, can you try to run PowerUPP from terminal (powerupp) while setting the values and see if anything strange shows there?
Getting any errors with dmesg | grep amdgpu?

@azeam
Copy link
Owner

azeam commented Dec 21, 2020

Main screen 240hz 1440p, secondary monitor 60hz 1080p both DP
Tried with only 1 screen hdmi. Memory setting still didn't work. And setting memory more than 1 results in instability/freezes.

I skimmed through some forums and it appears to be difficult to adjust the memory clock at least on high refresh rate monitors (note that the memory clocks are reported differently under Windows and Linux, I believe the values are halved under Linux).

@asiantuntija
Copy link

asiantuntija commented Dec 21, 2020

I'm dropping in since I'm testing on 6800XT. PowerUPP reads all values correctly but nothing changes when trying to apply something new.
cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap doesn't show change when applying something with PowerUPP, echo 293000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap however does work and sets the power limit to maximum. I also tried undervolting heavily but nothing seemed to crash the system so that probably isn't working either.
PowerUPP does ask for permissions when applying so I wouldn't think it's a problem with permissions. Glad to help with debugging, not really familiar with the interfaces so can't do it by myself.

Thanks, can you try to run PowerUPP from terminal (powerupp) while setting the values and see if anything strange shows there?
Getting any errors with dmesg | grep amdgpu?

That was helpful, I ran PowerUPP on terminal and got following:
Sorry, user root is not allowed to execute '/bin/zsh -c /usr/bin/upp --pp-file /sys/class/drm/card0/device/pp_table set --write smc_pptable/MaxVoltageGfx=4000 smc_pptable/SocketPowerLimitAc/0=293 smc_pptable/FreqTableGfx/1=2577 smc_pptable/MemMvddVoltage/3=5400 smc_pptable/MemVddciVoltage/3=3400 smc_pptable/FreqTableUclk/3=1000 smc_pptable/MaxVoltageSoc=4600 smc_pptable/FreqTableSocclk/1=1200 smc_pptable/qStaticVoltageOffset/0/c=0.000000 smc_pptable/MemMvddVoltage/0=5000 smc_pptable/MemVddciVoltage/0=2700 smc_pptable/FreqTableUclk/0=97 smc_pptable/MemMvddVoltage/1=5400 smc_pptable/MemVddciVoltage/1=3200 smc_pptable/FreqTableUclk/1=457 smc_pptable/MemMvddVoltage/2=5400 smc_pptable/MemVddciVoltage/2=3400 smc_pptable/FreqTableUclk/2=674 smc_pptable/MinVoltageGfx=3524 smc_pptable/MinVoltageSoc=3800' as user on arch-pc.
Sorry, user root is not allowed to execute '/usr/sbin/tee /sys/class/hwmon/hwmon3/power1_cap' as root on arch-pc.

After trying the command with sudo it worked and lowered the voltage and changed power limit. Any ideas how to get it work, root not a user of some group? And would probably be useful to insert some sort of error message in the GUI if this happens.

edit: changing gfx frequency doesn't seem to work, if I raise it the gpu doesn't boost anymore. But this is probably driver issue.

@azeam
Copy link
Owner

azeam commented Dec 21, 2020

After trying the command with sudo it worked and lowered the voltage and changed power limit. Any ideas how to get it work, root not a user of some group?

It would appear that root is not allowed to issue sudo commands? Maybe try something like this?

And would probably be useful to insert some sort of error message in the GUI if this happens.

Absolutely. I just pushed a new commit to the bignavi branch, before you fix the cause let me know if this works as intended please.

edit: changing gfx frequency doesn't seem to work, if I raise it the gpu doesn't boost anymore. But this is probably driver issue.

It seems so unfortunately, hopefully something that will get sorted out.

@quasd
Copy link
Author

quasd commented Dec 21, 2020

Are you able to lower the power limit by the way (set it to 150 W for example, using powerupp)?

Setting it to 150W seems to work just fine and it's also reflected in the power draw.

However when trying to go back to 233 W I got the following.

Dec 21 19:27:00 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: New power limit (233) is over the max allowed 172

:D

@azeam
Copy link
Owner

azeam commented Dec 21, 2020

However when trying to go back to 233 W I got the following.

Dec 21 19:27:00 eki-ryzen kernel: amdgpu 0000:0c:00.0: amdgpu: New power limit (233) is over the max allowed 172

Interesting. For navi 10 this error message only seems to appear under kernel 5.10 and when not using the amdgpu.ppfeaturemask=0xffffffff flag, do you still have it set?

Probably not important but noteable is that contrary to navi 10 it seems to properly calculate the max allowed (150 + 15%), with navi 10 it would have said "max allowed 150" (the actual max value set in the powerplay table). The message (at least for navi 10) is not triggered when setting the powerplay table values but when applying the value to sysfs.

Can you set the power cap manually echo 233000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap after getting this error? If it's possible it would seem like there's a timing issue (PowerUPP tries to set the sysfs value before the re-initialization of the powerplay table is complete), which would be fixable.

@asiantuntija
Copy link

After trying the command with sudo it worked and lowered the voltage and changed power limit. Any ideas how to get it work, root not a user of some group?

It would appear that root is not allowed to issue sudo commands? Maybe try something like this?

That was it, that is commented by default in arch sudo package. Thanks.

@quasd
Copy link
Author

quasd commented Dec 21, 2020

Interesting. For navi 10 this error message only seems to appear under kernel 5.10 and when not using the amdgpu.ppfeaturemask=0xffffffff flag, do you still have it set?

Below test is without the flag

[root@quasd ~]# cat /proc/cmdline | grep amd
[root@quasd ~]# 

Can you set the power cap manually echo 233000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap after getting this error? If it's possible it would seem like there's a timing issue (PowerUPP tries to set the sysfs value before the re-initialization of the powerplay table is complete), which would be fixable.

Appears still to be broken

[root@quasd ~]# echo 233000000 | sudo tee /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
233000000
[root@quasd ~]# cat /sys/class/hwmon/$(ls -1 /sys/class/drm/card0/device/hwmon)/power1_cap
150000000
[root@quasd ~]#

and on the syslog

Dec 21 22:10:12 quasd kernel: amdgpu 0000:0c:00.0: amdgpu: New power limit (233) is over the max allowed 150

@quasd
Copy link
Author

quasd commented Dec 21, 2020

Some more findings. After updating to 5.10.2 + linux-firmware-git 20201218.646f159-1 + and rebuilding upp and powerupp.

I can now change the power limit and it seems to really change too. Still weird behaviour where I have to slowly bring it up step by step, but I am now seeing power usage > 250 W every now and then.

The main limiting factor GFX clock frequency (2475) still remains.

@azeam
Copy link
Owner

azeam commented Dec 21, 2020

Some more findings. After updating to 5.10.2 + linux-firmware-git 20201218.646f159-1 + and rebuilding upp and powerupp.

I can now change the power limit and it seems to really change too. Still weird behaviour where I have to slowly bring it up step by step, but I am now seeing power usage > 250 W every now and then.

The main limiting factor GFX clock frequency (2475) still remains.

Perfect! I've noticed some similar things recently with my 5700 XT as well, the other day setting the power limit had no effect and today it works all of a sudden (not quite sure if/what I updated...) but sometimes crashes after setting it (never had that happen before). If the Gfx clock is a driver/firmware issue it will hopefully be fixed soon as well.

@asiantuntija have you tried to change the memory DPM 3 clock and does it cause a crash for you as well?

@quasd
Copy link
Author

quasd commented Dec 21, 2020

Remarks regarding amdgpu.ppfeaturemask=0xffffffff

  • without really stuck at 203 W
  • with can now increase with no limits? ( temp/mhz limited )

And DPM3 settings still do the following behaviour even with all the updates

  1. First time changing DPM3 memory freq, no errors but the settings don't really change
  2. Second round setting a DPM3 memory freq, I get the errors mentioned somewhere in this thread and things start freezing/crashing

@asiantuntija
Copy link

@asiantuntija have you tried to change the memory DPM 3 clock and does it cause a crash for you as well?

Yes, I tried it a couple of times and having same symptoms as @quasd. I remember having same sort of crashes with Navi10 when it was fresh, if I had radeon-profile-daemon running on the background it would freeze pc shortly after booting, IIRC it had something to do with race condition in requesting information from the GPU. I will try a bit more with memory tomorrow, rebooted million times today already because one of my RAM sticks died on me.

Seems like fixed undervolting is the way to go, getting stable 2575MHz (max possible) with -150mV, Unigine Superposition score went from stock 9119 to 9619, and lower temperatures as well.

@quasd
Copy link
Author

quasd commented Dec 31, 2020

I got my hands on 6800 XT and it also seems to act in the same way as the 6800. Same problems seems to occur also on 6800 XT.

The max core frequency is now 2577 mhz compared to 2475 mhz.

Does anyone know what is currently the limiting/bugging part that stops us from setting higher frequencies? AFAIK these cards should be limited to 2.7ghz not 2.58ghz.

@quasd
Copy link
Author

quasd commented Jan 2, 2021

Did bit of poking around and found this
Linux 5.12 To Support Radeon RX 6000 Series OverDrive Overclocking

Now on 5.11 rc1 + few patches form drm-next and I have working memory and core frequency control through pp_od_clk_voltage ( up to 2.8ghz/1075mhz) + voltage offset.

cat /sys/class/drm/card0/device/pp_od_clk_voltage
OD_SCLK:
0: 500Mhz
1: 2600Mhz
OD_MCLK:
0: 97Mhz
1: 1075MHz
OD_VDDGFX_OFFSET:
50mV
OD_RANGE:
SCLK:     500Mhz       2800Mhz
MCLK:     674Mhz       1075Mhz

Powerupp/upp still don't work with memory clock or increasing the core frequency.

@moaalseiari
Copy link

Did bit of poking around and found this
Linux 5.12 To Support Radeon RX 6000 Series OverDrive Overclocking

Now on 5.11 rc1 + few patches form drm-next and I have working memory and core frequency control through pp_od_clk_voltage ( up to 2.8ghz/1075mhz) + voltage offset.

cat /sys/class/drm/card0/device/pp_od_clk_voltage
OD_SCLK:
0: 500Mhz
1: 2600Mhz
OD_MCLK:
0: 97Mhz
1: 1075MHz
OD_VDDGFX_OFFSET:
50mV
OD_RANGE:
SCLK:     500Mhz       2800Mhz
MCLK:     674Mhz       1075Mhz

Powerupp/upp still don't work with memory clock or increasing the core frequency.

Hello brother, can you please share the patches please.
I would like to use them myself.
I tried the patches from the link below and one of them failed to compile.

https://cgit.freedesktop.org/~agd5f/linux/diff/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c?h=drm-next&id=37a58f691551dfdff4f1035ee119c9ebdb9eb119

Best regards.

@quasd
Copy link
Author

quasd commented Jan 24, 2021

Hello brother, can you please share the patches please.
I would like to use them myself.
I tried the patches from the link below and one of them failed to compile.

https://cgit.freedesktop.org/~agd5f/linux/diff/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c?h=drm-next&id=37a58f691551dfdff4f1035ee119c9ebdb9eb119

Best regards.

5.11 rc kernel and these 4

https://cgit.freedesktop.org/~agd5f/linux/commit/?id=78d907e2b8ba89c936b7f0c3344261c653668a62
https://cgit.freedesktop.org/~agd5f/linux/commit/?id=aa75fa34e04c842d93a45087adac66ab3a2a7f33
https://cgit.freedesktop.org/~agd5f/linux/commit/?id=37a58f691551dfdff4f1035ee119c9ebdb9eb119
https://cgit.freedesktop.org/~agd5f/linux/commit/?id=a2b6df4fd6e3c0ba088b00fc00579dac263b0a64

however I have now stopped using these due to random black screens after the match in Overwatch. ( dunno if relevant, haven't reproduces yet after removing the patches)
Seems like this was problem with wine, please ignore

@moaalseiari
Copy link

Hello brother, can you please share the patches please.
I would like to use them myself.
I tried the patches from the link below and one of them failed to compile.
https://cgit.freedesktop.org/~agd5f/linux/diff/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c?h=drm-next&id=37a58f691551dfdff4f1035ee119c9ebdb9eb119
Best regards.

5.11 rc kernel and these 4

https://cgit.freedesktop.org/~agd5f/linux/commit/?id=78d907e2b8ba89c936b7f0c3344261c653668a62
https://cgit.freedesktop.org/~agd5f/linux/commit/?id=aa75fa34e04c842d93a45087adac66ab3a2a7f33
https://cgit.freedesktop.org/~agd5f/linux/commit/?id=37a58f691551dfdff4f1035ee119c9ebdb9eb119
https://cgit.freedesktop.org/~agd5f/linux/commit/?id=a2b6df4fd6e3c0ba088b00fc00579dac263b0a64

however I have now stopped using these due to random black screens after the match in Overwatch. ( dunno if relevant, haven't reproduces yet after removing the patches)
Seems like this was problem with wine, please ignore

Thanks for your reply.
Can you please explain how to manually modify the voltage and settings like you did ?
I will build the kernel now with patches.

@quasd
Copy link
Author

quasd commented Jan 27, 2021

# Power Limit
echo "400000000" > "/sys/class/drm/card0/device/hwmon/hwmon1/power1_cap" 
# Offset voltage
echo "vo -10" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Core freq
echo "s 1 2590" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Memory freq
echo "m 1 1075" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Make settings active
echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage

@moaalseiari
Copy link

# Power Limit
echo "400000000" > "/sys/class/drm/card0/device/hwmon/hwmon1/power1_cap" 
# Offset voltage
echo "vo -10" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Core freq
echo "s 1 2590" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Memory freq
echo "m 1 1075" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Make settings active
echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage

thank you very much, it worked prefectly.

@moaalseiari
Copy link

# Power Limit
echo "400000000" > "/sys/class/drm/card0/device/hwmon/hwmon1/power1_cap" 
# Offset voltage
echo "vo -10" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Core freq
echo "s 1 2590" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Memory freq
echo "m 1 1075" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Make settings active
echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage

sorry to bother you here but

echo "vo -10" > /sys/class/drm/card0/device/pp_od_clk_voltage

is like a decrease of -10mv ? when i run sudo watch sensors, the voltage shows 1150mv for the gpu.
is this the only way to control core freq voltage ?
I dropped the offset to -200mv with echo "vo -200" > /sys/class/drm/card0/device/pp_od_clk_voltage
and the card crashed which means it uses the offset voltage.

Could you plz explain, and thanks so much for sharing your knowledge.

@quasd
Copy link
Author

quasd commented Jan 27, 2021

echo "vo -10" > /sys/class/drm/card0/device/pp_od_clk_voltage

is like a decrease of -10mv ? when i run sudo watch sensors, the voltage shows 1150mv for the gpu.

yes

is this the only way to control core freq voltage ?

yes afaik

@koiakoia
Copy link

# Power Limit
echo "400000000" > "/sys/class/drm/card0/device/hwmon/hwmon1/power1_cap" 
# Offset voltage
echo "vo -10" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Core freq
echo "s 1 2590" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Memory freq
echo "m 1 1075" > /sys/class/drm/card0/device/pp_od_clk_voltage
# Make settings active
echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage

This allowed me to at least get above the intial 1000 hurdle. However once I have executed these commands once I am not able to execute again. It says invalid argument. So I now am running at 1075 which is better then stock at least.

I am not sure on the programming side of things what the challenges are but wanted to confirm that for me at least this allowed the overclock. Although Powerupp still does not make any changes to these settings for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants