Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pci device plugin fix orphan goroutine and remove unused channel #67

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

WebberHuang1118
Copy link
Member

@WebberHuang1118 WebberHuang1118 commented Feb 20, 2024

Problem:
After a PCI device plugin is disabled, there is Go routine (invoked as the device plugin started) blocked by chan stop, however, chan stop is never touched, this brings an orphan Go routine once a PCI device is enabled and disabled.

Solution:
PCI device plugin fix the orphan goroutine and removes unused channel

Related Issue:
harvester/harvester#5179

Additional Context:
The CodeFactor Complex Method error could be fixed once PR #66 merged

@bk201 bk201 marked this pull request as draft February 20, 2024 06:29
@WebberHuang1118 WebberHuang1118 marked this pull request as ready for review February 20, 2024 10:27
@Yu-Jack
Copy link
Contributor

Yu-Jack commented Feb 29, 2024

I traced the original stop usage in kubevirt.

The heartBeat will trigger kubevirt to refresh the permitted devices.
-> Then it will start and stop all devices.
-> Then the DeviceController uses controlledDevice to close stopChan

In device_plugin part, it uses errChan to wait which waits for grpc server and health check error in Start function.
-> and there is a healthCheck function to detect stopChan to return nil if stopChan is closed.

The structure is one DeviceController uses multiple controlledDevice, and one controlledDevice is mapped to one devicePlugin. But, in our structure, we just used devicePlugin, and our device plugin Start function doesn't wait for health check, so I think that's why stop chan is weird in our structure.

@WebberHuang1118
Copy link
Member Author

WebberHuang1118 commented Mar 4, 2024

I traced the original stop usage in kubevirt.

The heartBeat will trigger kubevirt to refresh the permitted devices. -> Then it will start and stop all devices. -> Then the DeviceController uses controlledDevice to close stopChan

In device_plugin part, it uses errChan to wait which waits for grpc server and health check error in Start function. -> and there is a healthCheck function to detect stopChan to return nil if stopChan is closed.

The structure is one DeviceController uses multiple controlledDevice, and one controlledDevice is mapped to one devicePlugin. But, in our structure, we just used devicePlugin, and our device plugin Start function doesn't wait for health check, so I think that's why stop chan is weird in our structure.

@Yu-Jack, Thanks for your explanation,

IIUC, in the current implementation stop chan doesn't take effect, its existence is just because the PCI controller references some Kubevirt code snippet from the beginning.

IMHO, since this PR doesn't introduce any structural or control flow change, I'd like to remove the unused code path. WDYT? cc @bk201 @ibrokethecloud , thanks.

@Yu-Jack
Copy link
Contributor

Yu-Jack commented Mar 4, 2024

Just leave link to original comment #66 (comment) if we don't want to remove existing codes.

Copy link
Contributor

@Yu-Jack Yu-Jack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to delete stop channel. This stop is a control flag to control device plugin. For example, if we remove this stop, it can't return in health check function. Instead, we can remove done channel in this case.

@ibrokethecloud
Copy link
Collaborator

Looks like one channel is enough. we could either use done or stop to trigger cleanup. that channel should be used in the healthcheck.

this is legacy stuff which we copied over when we tried to setup the device plugins.

Thanks for the PR @WebberHuang1118 to help clean this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants