Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recently, Setting a Persistent Save Powerupp table breaks everything #16

Open
gardotd426 opened this issue Aug 10, 2020 · 10 comments
Open

Comments

@gardotd426
Copy link

So, using a persistent save table used to work perfectly fine, but as of about two weeks ago, having any sort of persistent save (even if it's just raising the power limit and setting a voltage offset) will cause flickering and a temperature reading of -0.0001 for the GPU, as well as prevent the fans from working whatsoever. I've reproduced this on completely separate systems (one with Manjaro and one with vanilla Arch).

At first, I thought I'd broken my Arch installation, so I restored a timeshift snapshot to back before the issue was present, but it was still there. So I moved to my Manjaro install and the issue was gone. I started using it while trying to figure out what the hell happened to Arch, and eventually I set up a persistent powerupp table on Manjaro, and boom - next time I reboot, same issue. I had already started to suspect powerupp at this point, so I jumped to a vtty and deleted the powerupp startup script that gets provided when you click "Persistent Save" and then rebooted. Sure enough, it was gone. I chrooted into my Arch install and deleted that same file, booted into Arch, and sure enough, the issue was gone.

I don't know what's going on, but now it seems setting a persistent save causes the powerplay table to be corrupted in the same way that just using Powerupp at all with a 5600 XT causes (remember all that? when we were trying to get it to work with the 5600XT? This is kind of similar feeling).

I'm happy to get any information you need, but yeah at this point it seems confirmed that at least some settings in Powerupp will break a bunch of stuff if you have it set to persistent. Doing it after boot and not setting it to persistent still works perfectly fine. I can set the exact same table as a one-time thing and it work perfectly fine, only when it tries to set those same values at boot does it cause issues.

Arch Linux, Manjaro Linux

Gigabyte Gaming OC RX 5700 XT

Kernels: 5.7.13-arch, 5.8.0-2-MANJARO, 5.8.0-rc7-tkg-pds, 5.8.0-1-tkg-pds, a bunch of others.

@azeam
Copy link
Owner

azeam commented Aug 10, 2020

Thanks. I installed a fresh copy of Manjaro in order to troubleshoot but I am not able to reproduce this on my end (tested kernels in Manjaro Settings Manager 5.6.19-2, 5.7.9-1 and 5.8.0-1).

What is the content of your /usr/bin/powerupp_startup_script_card0.sh (or other number)?

Can you see anything strange with sudo dmesg | grep power?

Can you reset to default values and try to change only, for example, the Gfx clock and see if there is a certain setting causing this?

Can you try to delay the udev startup with sudo mv /etc/udev/rules.d/80-powerupp0.rules /etc/udev/rules.d/99-powerupp0.rules && udevadm control --reload-rules && udevadm trigger?

@gardotd426
Copy link
Author

Yeah, like I said I had to delete the file to get my system to not freak out, so I'll redo it and get the contents.

But as far as this:

Can you reset to default values and try to change only, for example, the Gfx clock and see if there is a certain setting causing this?

Like I said, I never even changed the GFX clock. I literally only set a -35mV voltage offset and raised the power limit from 190W to 235W. But yeah I'll re-set the persistent save and make sure I get the same issue, and then get the contents as well as try just changing one of those two things at a time.

@gardotd426
Copy link
Author

So I'm having some issues on Manjaro with reproducing (though I only tried once), but on Arch I can reproduce it every single time.

I checked the dmesg logs, and the only relevant line I got was:

[ 0.204575] pci 0000:11:00.1: D0 power state depends on 0000:11:00.0

Which I get with or without powerupp's startup script enabled.

Here's a screenshot of the temperature monitor, but the flickering obviously doesn't show up in screenshots (though it's very bad, if you ever had the whole "flickering with more than one monitor with ppfeaturemask enabled" bug before they fixed it, it's exactly like that:

Screen Capture_20200810133643

Also, I tried to load up powerupp while it was happening, seeing if setting everything back to default would fix it, and - surprise, surprise - my card isn't even detected, I can't change any settings, nothing whatsoever. So this is clearly something to do with powerupp (or upp).

@gardotd426
Copy link
Author

gardotd426 commented Aug 10, 2020

Also, sensors says the device can't be read or something crazy for the GPU, and there are numerous other issues (frequency can't be read, etc).

EDIT: Also, I'm trying with the adjusted udev rules as well, I actually had considered something like that as well (just that the timing might be the issue).

@azeam
Copy link
Owner

azeam commented Aug 10, 2020

Tested with Arch now as well but no luck for me (or rather bad luck, it's working fine). Paste the /usr/bin/powerupp_startup_script_card0.sh contents just to confirm that nothing strange is being set there and also check sudo dmesg | grep amdgpu

@gardotd426
Copy link
Author

Sorry, I thought I'd already done that (pasted the contents of the script), and realized I'd forgotten:

#!/bin/bash
chmod 666 /sys/class/drm/card0/device/pp_table
sudo -i -u matt /usr/bin/upp --pp-file /sys/class/drm/card0/device/pp_table set --write smc_pptable/MaxVoltageGfx=4800 smc_pptable/SocketPowerLimitAc/0=235 smc_pptable/FreqTableGfx/1=2100 smc_pptable/MemMvddVoltage/3=5400 smc_pptable/MemVddciVoltage/3=3400 smc_pptable/FreqTableUclk/3=875 smc_pptable/MaxVoltageSoc=4200 smc_pptable/FreqTableSocclk/1=1267 smc_pptable/qStaticVoltageOffset/0/c=-0.035000 smc_pptable/MemMvddVoltage/0=5000 smc_pptable/MemVddciVoltage/0=2700 smc_pptable/FreqTableUclk/0=100 smc_pptable/MemMvddVoltage/1=5400 smc_pptable/MemVddciVoltage/1=3400 smc_pptable/FreqTableUclk/1=500 smc_pptable/MemMvddVoltage/2=5400 smc_pptable/MemVddciVoltage/2=3400 smc_pptable/FreqTableUclk/2=625 smc_pptable/MinVoltageGfx=3000 smc_pptable/MinVoltageSoc=3000
chmod 644 /sys/class/drm/card0/device/pp_table
echo 235000000 | tee /sys/class/hwmon/hwmon5/power1_cap

@azeam
Copy link
Owner

azeam commented Aug 10, 2020

Looks normal, you could try to comment out the line echo 235000000 | tee /sys/class/hwmon/hwmon5/power1_cap and see if that makes any difference.

@azeam
Copy link
Owner

azeam commented Aug 10, 2020

Not sure that this will help but found an unescaped string that could theoretically cause the power limit workaround to be set to the incorrect path.

@azeam
Copy link
Owner

azeam commented Aug 24, 2020

Any updates on this? Did you try to delay the udev script startup and/or comment out the hwmon line?

@gardotd426
Copy link
Author

gardotd426 commented Aug 24, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants