Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking the performance of JupyterWgpuCanvas #378

Closed
kushalkolar opened this issue Oct 16, 2023 · 3 comments
Closed

Benchmarking the performance of JupyterWgpuCanvas #378

kushalkolar opened this issue Oct 16, 2023 · 3 comments

Comments

@kushalkolar
Copy link
Contributor

kushalkolar commented Oct 16, 2023

I went down a rabbit hole of testing out vispy/jupyter_rfb#76 , and figuring out why the JupyterWgpuCanvas could never seem to exceed 30fps. simplejpeg is very fast, encoding is in the range of a few milliseconds, or less than 1 ms for encoding astronaut.png. Increasing widget.max_buffered_frames worked for basic RemoteFrameBuffer, but made no difference with the WgpuCanvas. So I found this:

def _request_draw(self):
if not self._request_draw_timer_running:
self._request_draw_timer_running = True
call_later(self._get_draw_wait_time(), RemoteFrameBuffer.request_draw, self)

Dividing the draw_wait_time on L89 by 2 seems to increase performance! The framerate is double with small canvases around 512x512, and significantly higher for larger canvases.

I added a JupyterWgpuCanvas.delay_divisor which divides the draw_wait_time in the call_later call to test things:

Ran these on a Radeon RX 570 (old GPU), will test on a more modern GPU later today.

This gives me 54fps:

import fastplotlib as fpl
import numpy as np

a = np.random.rand(100, 100)

plot = fpl.Plot(size=(500, 500), name=f"fps: {0}")

plot.add_image(a)

i = 0

buffer_size = range(2, 200)
def update_frame(p):
    p.graphics[0].data = np.random.rand(100, 100)
    fps = p.canvas.get_stats()["fps"]
    plot.set_title(f"fps: {fps:.01f}")

plot.add_animations(update_frame)

# 30 fps without these two lines
plot.canvas.max_buffered_frames = 20
plot.canvas.delay_divisor = 2

plot.show(sidecar=False)

With delay_divisor = 2, the lag is barely perceptible during interaction, if it's increased to 4 the lag becomes very noticeable and the framerate barely increases. The pygfx controller also has dampening which probably helps to reduce effects of lag (if present) with delay_divisor = 2.

delay_divisor = 1:

d1.mp4

delay_divisor = 2:

d2.mp4

delay_divisor = 4, lag is very obvious:

d4.mp4

I benchmarked this with a larger canvas at 1700x900 and got this, which seems to suggest that dividing the delay by 2 and buffering 10-20 frames gives the best performance. If there's a way to measure input lag that would be nice to factor in as well!

download

Benchmarking code:

  from itertools import product
  import random
  
  import fastplotlib as fpl
  import numpy as np
  import pandas as pd
  import seaborn as sns
  
  delays = range(1, 10)
  
  max_buffered = range(2, 100, 5)
  
  test_grid = list(product(delays, max_buffered))
  
  results = pd.DataFrame(index=delays, columns=max_buffered)
  
  plot = fpl.Plot(size=(1700, 900))
  
  # pre-make images
  img = np.random.rand(900, 1700)
  
  plot.add_image(img)
  
  i = 0
  
  plot.canvas.delay_divisor = test_grid[i][0]
  plot.canvas.max_buffered_frames = test_grid[i][1]
  
  def update_frame(p):
      global i
      stats = p.canvas.get_stats()
      fps = stats["fps"]
      if stats["sent_frames"] > 500:
          if i == len(test_grid):
              plot.set_title("done!")
              return
          # record fps
          results.loc[test_grid[i]] = fps
          
          p.canvas.reset_stats()
          plot.canvas.delay_divisor = test_grid[i][0]
          plot.canvas.max_buffered_frames = test_grid[i][1]
          
          i += 1
          
      
      plot.set_title(f"fps: {fps:.01f}")
  
  plot.add_animations(update_frame)
  plot.show(sidecar=False)
  
  ax = sns.heatmap(results.astype(float).round(), cmap="viridis", cbar_kws={"label": "fps"})
  ax.set_ylabel("delay divisor")
  ax.set_xlabel("max buffered frames")

Side note: the drop in fps with ~7 buffered frames is really odd.

Questions before doing a PR:

  1. Is the JupyterWgpuCanvas._get_draw_wait_time() from a round-trip measure, which is why dividing it by 2 increases the performance?
  2. Do you think it would be a good idea to add delay_divisor as a @property to JupyterWgpuCanvas? Or is there some other caveat? The only drawback I found is input lag if the delay_divisor is large, but the benchmarks seem to show that increasing it beyond 2 doesn't increase performance anyways.
  3. I still need to test this without feat: avoid base64-encoding image data vispy/jupyter_rfb#76 to see if it's rate limiting.

Now this makes me wonder what is the real bottleneck when the canvas is very large, like near 4k. Simplejpeg starts to slow down at these resolutions, might look into nvjpeg.

@almarklein
Copy link
Member

Thanks for taking the time to do these kinds of benchmarks!

  1. Is the JupyterWgpuCanvas._get_draw_wait_time() from a round-trip measure, which is why dividing it by 2 increases the performance?

The _get_draw_wait_time is not from Jupyter_rfb, but a wgpu gui feature that implements max_fps, so creating a WgpuCanvas(max_fps=60) should have the same effect as the delay_divisor=2:

wgpu-py/wgpu/gui/base.py

Lines 178 to 182 in 2ffacd9

def _get_draw_wait_time(self):
"""Get time (in seconds) to wait until the next draw in order to honour max_fps."""
now = time.perf_counter()
target_time = self._last_draw_time + 1.0 / self._max_fps
return max(0, target_time - now)

  1. Do you think it would be a good idea to add delay_divisor as a @propertyto JupyterWgpuCanvas?

No, because we already have max_fps. It could be made a property (now it can only be set at initialization).


To explain a bit about max_buffered_frames: when communicating with Jupyter we have to deal with the fact that it takes time to send frames from the server to the client. And the time this takes can differ a lot depending on the situation (e.g. localhost vs a server on the other side of the world).

If the server just sends every frame as soon as it can, then from the pov of the server things look fast, but it will put a lot of strain on the connection, which causes lag at the client, can even reduce the fps at the client, and eventually clog the connection.

So what Jupyter_rfb does, is that for each frame that the client receives, the client sends a confirmation to the server. The max_buffered_frames says how many frames the server can send (i.e. are in-flight) since the last confirmed frame. Of course it also takes time for the confirmation to arrive. If the server would only send a new frame once it has confirmation of the last (the equivalent of max_buffered_frames=0) you'd have low fps, and the connection is unused for the most time. With max_buffered_frames=1 you basically send one extra frame. In theory this frame is send to the client as the confirmation from the previous frame returns. Having one more frame in the buffer seems to offer a bit better performance, but beyond that, it will simply mean more frames are in-flight, resulting in a larger delay from rendering to presentation, and a higher risk at straining the connection, which will again increase the delay.

That dip with max_buffered_frames=7 is weird, though I find 7 already quite a lot.

To benchmark this stuff effectively, I think we'd need three measures:

  • FPS obviously gives a certain sense of smoothness. Should be measured at the client! Should probably measure min-max fps, because straining a connection may cause a variable FPS, which probably feels worse than a steady FPS at a lower average.
  • Presentation delay: time between rendering and the user seeing the result on screen.
  • Reaction delay: time between an action (e.g. clicking/moving the mouse) and seeing the result. This is actually a combination of the above. The FPS limit imposes a delay in the rendering, but higher FPS can increase presentation delay.

I've never done a thorough benchmark myself. I've only tried several values at different types of connections. I suppose the FPS limit is the major knob to turn. I expect reasonable values for max_buffered_frames to lay between 1 and 4. Other knobs include time to serialize vs time to send (i.e. does spending more time on image compression pay off for sending the image quicker).

@almarklein
Copy link
Member

I created pygfx/rendercanvas#40. I also included a list of steps with some details / options.

@almarklein
Copy link
Member

Let's continue the discussion there (even though it may require changes in the canvas context here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants