-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimization #20
Comments
Some performance stats.... If we can improve the delay time and bring it down closer to .5 second, even at the cost of more cpu usage I think it would be a good tradeoff especially for those that have cpu bandwidth to spare. Perhaps there could even be a cpu utilization/delay time tradeoff config value that allows allocating more resources to improve the video delay. |
I get about 0.1s per mainloop iteration on a Ryzen 8-Core CPU with 100% load on one core. Evaluating the network seems to be 0.01-0.02s, getting the images and scheduling them seems to be fast. Probably body-pix + filters just sum up. Maybe the segmentation for the next image could run in another thread before the current one is scheduled? In principle you could process one image per core by just grabbing the next frame when the first core is idle again For limiting the CPU usage, we could limit the framerate, but it should probably be at least 10 for a good stream. |
How are you measuring the loop iteration? In my case I just hold up a stopwatch in front of the camera and observe how far apart the video is from the real one. |
This may not be the best profiler, but its enough to get a rough impression what may be the slow parts. The latency between webcam and fake webcam is another issue, but for finding the slow parts the most important part is what happens between |
Right, so correctly stated, my latency is about a second. Just reviewed that mainloop code between cap.read and schedule_frame. I see there is a lot going on per frame there. |
Some thoughts about performance
|
Using another thread for grabbing frames slightly increases the speed. |
@Nerdyvedi Did you test it in some way? I think it will have quite a bit of issues. When grabbing the next frame at the beginning of the loop you are dropping the in between frames automatically. When grabbing in an own thread you need a buffer. When the filter thread is ready to process the next frame, the one in the buffer is already stale, so you need to constantly fill a stack and drop the older frames yourself when processing the next one from the top of the stack. The benefit of avoiding the delay to grab a frame is probably not worth the issues with buffering and synchronization, if you don't have numbers that it will be a lot faster. |
Maybe no stack is needed, but just double buffering and correct locking. I am still not convinced if this is the most important part that should be optimized at cost of complexity. |
@allo-
|
|
|
|
Should I create a PR for 2nd and 3rd point? |
@allo- |
I am not that convinced of threading. One would like to have the most recent frame when the current one is finished (discarding all in-between frames) to minimize latency, but to have it you need to capture frames as fast as possible. Assuming the capturing is a bottleneck, you would need to read the last frame and not the currently captured one (which is still unfinished, when reading it is costly) and prevent the race condition of the frame becoming complete and shifting the buffer. It may be easier at the cost of latency to capture one frame and stop until the current one is processed and then process the (then already quite old) frame and start capturing the next in the background thread. In both cases there needs to be a lock, for when capturing is slower than processing, to wait for the next frame. In my experiments, capturing with mjpeg is limited by the webcam framerate and not by reading the frame. Capturing with h264 mode may be a bit slower, but mjpeg is probably optimal for frame by frame processing anyway (and for many cams the default or only format). And python threads have some extra gotchas with the GIL and similar issues. So I am really skeptic if shaving off like 10ms capturing is worth it when the model evaluation takes 100ms and filters 200ms. |
I created a branch for benchmarking: https://github.com/allo-/virtual_webcam_background/tree/benchmark_webcam_fps Set the size options, the (max) I get for 800x600 with 30 fps supported by cam 15 FPS with buffering (comment the |
When I lift the webcam cover and the room is bright enough I get the full framerate for HD easily. The cam seems to be lowering the fps when the image is not bright enough. So I would say capturing is probably no bottleneck. It may be interesting not to measure the fps in a loop but the time for grabbing a single frame to see if it can be read instantly from a buffer (cam, linux or cv) or if it blocks for one frame, but I still think this is not a bottleneck with high priority. It could be interesting to parallelize filters (e.g. input (blurring) and foreground (e.g color effects) filters. Especially blur is quite cpu intensive. |
@allo- |
Yes. Do you allow us to use your patches under MIT license? |
@Nerdyvedi
Can you have a look why this does not work? I thought the |
After profiling I found out the most time is spent in calculation of |
This would be a good idea. I just wonder if this shouldn't be the fastest part when there is no bottleneck. I guess opencv, tensorflow and numpy would be able to fastly compute something like Do you have a good setup for benchmarking or does it take much time for you as well? Looking at the code (currently not having the time for debugging it), I think some parts could be done in tensorflow only:
I think I used the pattern
to get the tensor as numpy array, but one may be able to do more calculations in tensorflow (and so possibly on the GPU when one has setup CUDA) before converting to numpy. After converting the tensors to numpy for operations like averaging, they are used with opencv for dilation/erosion/bluring. So a fast fix would be to disable computing unnecessary image layers. A good feature may be if plugins could request which layers they need. This should also account for different models providing different layers (in different order). And there is a new faster mobilenet model, that is not supported yet. This will probably make segmentation much faster (and I guess not provide so many part masks). |
I just added timestamps with time.time() to every code section and printed
theoretical FPS (1/(endtime-starttime)) that section produces. Will attach
the code when I will be back on my pc.
|
The code mentioned is in #61 |
Even with tensorflow-gpu, the program uses the CPU (one core) quite much.
Find out what are the slow parts and how good they can be optimized:
cv2
operations?The text was updated successfully, but these errors were encountered: