Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia: Patching of 565.57.01 for Linux kernel 6.12 #417

Closed
Binary-Eater opened this issue Nov 14, 2024 · 2 comments · Fixed by #418
Closed

nvidia: Patching of 565.57.01 for Linux kernel 6.12 #417

Binary-Eater opened this issue Nov 14, 2024 · 2 comments · Fixed by #418

Comments

@Binary-Eater
Copy link

Binary-Eater commented Nov 14, 2024

Hi,

Commit 6c22aadbf6fd ("drm/fbdev-helper: Remove
drm_fb_helper_output_poll_changed()") in the linux kernel removed support for
the .output_poll_changed interface in drm_mode_config_funcs. This callback was
used to handle hotplug events in place of the hotplug interface provided through
fbdev emulation. We verify if this callback is present in the kernel
installation the driver is being built against and omit it if not present.

If this callback is not present, it is safe to assume that filling modes for the
connectors has become the responsibility of the core DRM stack. No replacement
callback needs to be defined for hotplug support to continue working as it
currently does on released kernel trees. The assumption about the core stack
populating the modes capable per connector is based around
drm_fb_helper_hotplug_event which is called thanks to the following.

drm_fbdev_ttm_setup
  |_ drm_client_init
    |_ drm_fbdev_ttm_client_funcs
      |_ .hotplug (drm_fbdev_ttm_client_hotplug)
        |_ drm_fbdev_ttm_client_hotplug

Because of this, calling drm_client_register and keeping
nv_drm_output_poll_changed is unnecessary for kernel 6.12 since the core stack
handles hotplug events on our behalf through the DRM fbdev API. We noticed users
reporting issues on the NVIDIA forum with regards to this.

https://forums.developer.nvidia.com/t/getting-kernel-null-pointer-dereference-when-unloading-modprobe-r-nvidia-drm/312331

We see users reporting the patched logic hitting the "Failed to
initialize the nv-hotplug-helper DRM client" failure case based on their
dmesg logs. We believe that what then happens is that we failed to init
the drm device (entered a failure path) but the device pointer is on our
list in nv_drm_remove_devices which leads to the kernel panic seen by
the user.

BUG: kernel NULL pointer dereference, address: 00000000000000a8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 8 UID: 0 PID: 70806 Comm: modprobe Tainted: P S   U     OE      6.12.0-rc6-1-cachyos-rc-gcc #1 4192acde4edb66f2f1f68c607ab66f58138591f5
Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
RIP: 0010:drm_client_dev_unregister+0xd/0xf0
Call Trace:
 <TASK>
 ? __die_body.cold+0x8/0x12
 ? page_fault_oops+0x15a/0x2e0
 ? exc_page_fault+0x81/0x190
 ? asm_exc_page_fault+0x26/0x30
 ? drm_client_dev_unregister+0xd/0xf0
 drm_dev_unregister+0x21/0x1c0
 nv_drm_remove_devices+0x2d/0x60 [nvidia_drm 713ad65fe3ef08e6e23794e19a16790721d8c08f]
 __do_sys_delete_module+0x1d1/0x310
 do_syscall_64+0x82/0x190
 ? __x64_sys_openat+0x1f5/0x230
 ? syscall_exit_to_user_mode+0x10/0x210
 ? do_syscall_64+0x8e/0x190
 ? __x64_sys_openat+0x1f5/0x230
 ? syscall_exit_to_user_mode+0x10/0x210
 ? do_syscall_64+0x8e/0x190
 ? syscall_exit_to_user_mode+0x10/0x210
 ? do_syscall_64+0x8e/0x190
 ? exc_page_fault+0x81/0x190
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
 </TASK>

We believe you can safely reduce your patch to what we have shared on
our forum regarding our current 565 driver release and kernel 6.12
compatibility. Our next 565 driver release will patch this as well, but
it unfortunately will likely be release a couple weeks after Linux
kernel 6.12 is tagged.

Thanks,
Rahul Rameshbabu

1Naim added a commit that referenced this issue Nov 14, 2024
Unsurprisingly, my hack of a rebase has caused issues on some systems. Thankfully, NVIDIA has reached out to us with their
own patch that should fix things for 6.12 kernels without causing any issues.

I have also taken the chance to sync the patch file names with the ones in our tree for coherency and consistency with
the patch numbering.

Closes #417

Suggested-by: Rahul Rameshbabu <[email protected]>
Signed-off-by: Eric Naim <[email protected]>
@1Naim
Copy link
Member

1Naim commented Nov 14, 2024

Acknowledged! Thank you so much for reaching out to us with this. I have created a PR that will use this patch instead of the hacky fix. I have also gotten a few users to test this patch and it seems to work as intended. Once @ptr1337 is available, the PR should get merged and it will be in our repos. I will also sync the patches used in our kernel modules soon.

1Naim added a commit to CachyOS/kernel-patches that referenced this issue Nov 14, 2024
1Naim added a commit to CachyOS/kernel-patches that referenced this issue Nov 14, 2024
1Naim added a commit to CachyOS/kernel-patches that referenced this issue Nov 14, 2024
@ptr1337
Copy link
Member

ptr1337 commented Nov 14, 2024

Thanks for reaching us out! I will send this also to the rpmfusion maintainers and push it into the archlinux repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants