Skip to content
This repository has been archived by the owner on May 20, 2024. It is now read-only.

Got the driver to work in kernel 6.2rc5 but need some help troubleshooting an issue #18

Open
tommy-bolger opened this issue Feb 6, 2023 · 9 comments

Comments

@tommy-bolger
Copy link

tommy-bolger commented Feb 6, 2023

I forked strank/lg4k-linux, reverted the ID changes to work with my GC573, and fixed a few method calls that prevented this driver from building in Linux Kernel 6.2. Here are my changes: strank#1

I'm running Ubuntu 22.10 with kernel 6.2rc5 from mainline.

I can build the driver, and my Livegamer 4k works after a reboot. But I am getting some insane stutter in OBS from the capture card. It'll capture a couple of frames, freeze for a second, capture a few more frames, etc. Adjusting the video format, resolution, frame rate, and color range for the card's properties in OBS doesn't seem to make any difference. Checking Autoreset on timeout and adjusting Frames Until Timeout does seem to make a difference but that can only go down as far as 2. So it seems like the card is timing out constantly?

Anyone have any ideas/tips on how to troubleshoot this? I haven't touched C in years, and have never done this kind of desktop development for a device driver since I'm a fullstack web engineer. I think it's possible to get this card fully functional on the latest kernel if I can figure out what's causing this issue.

This is my build log if that helps.

@tommy-bolger
Copy link
Author

I loaded the 6.2rc7 kernel and still had the same issue. I also noticed if I kept OBS open with the capture card running the whole system would hard lock

I had a look at the syslog and it's spamming a ton of error messages nonstop related to the capture card. Here's one of the entries that's being spammed. This is one of those logs.

I'm guessing that the syslog being spammed is what's causing the capture card to stutter horribly and the system to lock up.

@kkiyama117
Copy link

kkiyama117 commented Feb 8, 2023

I got the same error but I can't find what causes this error. In my environment, the sound is delayed relative to the image and eventually freezes, but I don't know what that means - it's the same with ALSA and pulseaudio.

@kkiyama117
Copy link

kkiyama117 commented Feb 8, 2023

I found the notation "last function: sys_work_func:gc573" in the log at shutdown.
./driver/utils/thread/task_model.c may be the cause of the problem.

@tommy-bolger
Copy link
Author

I did some searching around for the error BUG: scheduling while atomic: kworker and found this: https://stackoverflow.com/a/3544277

"Scheduling while atomic" indicates that you've tried to sleep somewhere that you shouldn't - like within a spinlock-protected critical section or an interrupt handler.

Common examples of things that can sleep are mutex_lock(), kmalloc(..., GFP_KERNEL), get_user() and put_user().

I looked around for mutex_lock() and came across i2c_model_read() in i2c_model.c. It's a part of the stack trace from my log above. Both this function and i2c_model_write() are using mutex_lock() and mutex_unlock():

int i2c_model_read(i2c_model_bus_handle_t bus_handle, U8_T slaveAddr, U32_T subAddr, U8_T* pBuf, U8_T bufLen)
{
    i2c_model_bus_t *bus=bus_handle;
    int ret=-1;

    if(bus->read_func)
    {
        mutex_lock(&bus->lock);
        ret=bus->read_func(bus->ref_cxt, 0, slaveAddr, subAddr, 1, pBuf, bufLen);
              
        mutex_unlock(&bus->lock);
    }
    
    return ret;
}

int i2c_model_write(i2c_model_bus_handle_t bus_handle, U8_T slaveAddr, U32_T subAddr, U8_T* pBuf, U8_T bufLen)
{
    i2c_model_bus_t *bus=bus_handle;
    int ret=-1;

    if(bus->write_func)
    {
        mutex_lock(&bus->lock);
        ret=bus->write_func(bus->ref_cxt, 0, slaveAddr, subAddr, 1, pBuf, bufLen);
              
        mutex_unlock(&bus->lock);
    }
    
    return ret;
}

@kkiyama117
Copy link

kkiyama117 commented Feb 19, 2023

@tommy-bolger I'm a beginner of programing, but I did a little digging...
In the stacktrace, it looks like i2c_model_read and sys_wait_sem_timer are both called. Considering that the last call of gc573 is down in sys_wait_sem_timer(driver/utils/misc/sys.c), is it possible that this and mutex_lock are interfering with each other? This is the function that allocates semaphores. It may be totally misguided, but I thought it might be possible given the stackoverflow answers you provided.

@tommy-bolger
Copy link
Author

I've hired a Linux driver dev on Fiverr to attempt to fix this bug based on our findings. Their delivery window is roughly 8 days so I hope they can get this problem fixed. I'll post any updates here.

@tommy-bolger
Copy link
Author

A small update for this driver. There's good news and bad news.

The good news is that I have a version of the driver that functions! However the same error messages spam the syslog. That could severely degrade a SSD since it's constant, even when the capture card isn't being used. The dev and I have been going back and forth trying new versions of the driver to attempt to stop it from spamming those error messages.

The bad news is that we're starting to hit a limitation with how far we can get with this driver. A portion of it is closed source and cannot be edited. So we're attempting to work around this issue with some hacks. The hacks stop the error message from being spammed but the capture card can't resolve DRM handshakes so nothing actually shows in OBS.

We're still going back and forth and trying new drivers. I am still hopeful we can get something working this month.

@cdorn0
Copy link

cdorn0 commented Mar 14, 2023

Did you find out where the error messages come from? I am pretty sure, that the messages are coming from some memory that is allocated with the wrong flags. I dissected the closed object files and I guess that the problem is somewhere in the open files. AFAIR the closed files contain fairly simple control stuff (besides the FPGA interface driver - that is actually quite complex).

@tommy-bolger
Copy link
Author

tommy-bolger commented Mar 14, 2023

I updated my branch for the PR with the first fixed version of the driver: https://github.com/strank/lg4k-linux/pull/1/files

If you use this driver then your syslog will be spammed with a new error. This was what the developer said about it when I was discussing it with him:

This error come out from the xilinx_xaver module, which is a closed source code.

After several driver revisions this is what he stated was the cause of the syslog spamming issue:

Problem is that, developers of the drivers used mutex and semaphore in the interrupt context (this is a BUG on Linux). The same problem we have on the sys_wait_sem_timer function (the down function triggers the schedule to preempt the kthread during the interrupt context). And to complicate the situation, this function is called on the closed source xilinx_xavier module.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants