-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timer hanging and high CPU load when using MultiThreadedExecutor #1223
Comments
I have the same issue, +1 for this |
From our weekly issue triage meeting: Assigning @clalancette so he can make a high level tracking issue about known issues with Python's executors. Also, we think this issue will not make progress without staffing of an engineer, which we don't currently have, or a dedicated community member to investigate and make a good suggestion on what to change and why. This is because executor related issues tend to be very nuanced and complicated to work on. So we're also assigning this the "help wanted" label with that in mind. |
Hi @JasperTan97 I recommend you could run py-spy to see where the high CPU load comes from. This helps immensely. Thanks |
@KKSTB, it looks like the waiting for ready callbacks are the issue. Do you get the same on your end? |
@JasperTan97 i ran py-spy on my ubuntu ROS iron machine and got this result: And
This shows the 4 worker threads were mostly idle, which makes sense because there is basically nothing to do inside subscriber and timer node. The main thread was instead very very busy retrieving tasks for the 4 worker threads to do. It seems such workload of retrieving and distributing tasks and gathering results at high frequency (500Hz) is marginal for one core. Therefore the rate slows down considerably. Although I have no clue why your single core is much slower (I can achieve 3XX-4XX Hz on my i7-9750H). As for single threaded executor, I can achieve 500Hz. The CPU utilization was half of multi threaded case. I have to push to 2500Hz before the rate starts to drop. I believe the problem has to do with the efficiency of transferring tasks from main thread to worker threads, relative to the actual useful tasks done in the worker thread. |
I experience the same issue, +1 |
As MultiThreadedExecutor is broken, see ros2/rclpy#1223
But it doesn't show the files and line numbers correct. Therefore I have included another screenshot, which does show this. You should check the linenumbers in this revision of update: I have looked at the code a bit, |
This does indeed not work child classes, but the __repr__ was also not working for child classes. And no currently there are no child classes, so performance is more important. See ros2#1223 Signed-off-by: Matthijs van der Burgh <[email protected]>
This does indeed not work child classes, but the __repr__ was also not working for child classes. And no currently there are no child classes, so performance is more important. See ros2#1223 Signed-off-by: Matthijs van der Burgh <[email protected]>
Yep, you are correct. And this flame graph is the kind of thing I was waiting to see to confirm. So thanks for spending the time to do that. We spent a bunch of time over the last year optimizing the |
@clalancette just let me know whether you need a more detailed graph, etc. |
To be totally transparent; we don't (currently) have plans to work on this. While that could change in the future, if it is something you are interested in, I'd suggest trying to make some improvements and opening PRs. We are more than happy to review them. |
@clalancette I don't have all the knowledge to fix this. But I do think this does require to be fixed. As it is baked into many ROS applications, i.e. rqt. But it works much worse than the SingleThreadedExecutor. I want to help, though I am not able to figure out the design of it. It is very complex. So it would be helpful of someone could explain with enough detail what should be happening. Then me and other people could check the code. |
@MatthijsBurgh can you tell what frequency you can achieve? I can achieve 2xx-3xx hz on i7 9750H (9th gen Intel). But @JasperTan97 only managed 2x hz on i7 1355U (13th gen Intel). I think the problem lies on why there's such a big difference. |
SingleThreadedExecutor: 500Hz It is really weird newer hardware is much slower. But still, the MultiThreadedExecutor, should only be slower with very few threads, compared to the SingleThreadedExecutor. Correction: the results above are with a |
I have no idea. Maybe try these:
Commands to use CycloneDDS:
and ddsconfig.xml:
|
@KKSTB Using cyclone did fix the issue for me. All MultiThreaded configurations now are able to reach 500Hz. Now I am still wondering how a bad network/RMW results in the distributing of task to the worker threads being very slow. |
* (NumberOfEntities) add __iadd__ * (NumberOfEntities) improve __add__ performance This does indeed not work child classes, but the __repr__ was also not working for child classes. And no currently there are no child classes, so performance is more important. See #1223 Signed-off-by: Matthijs van der Burgh <[email protected]>
My plain guess of reasons:
|
I have no clue, I can only observe there is a big difference in distribution of the calls. I have run py-spy with for The SVG is not a result of the same execution as the speedscope. I have run all combination such that have around 30.000 samples with a sampling rate of 500Hz, around 60s of runtime) The results can be found at https://gist.github.com/MatthijsBurgh/cd7871c6597c4cdc196526b658693a18 |
@MatthijsBurgh thank you for your analysis. In fact py-spy has an -n option to also sample non-python libraries to get a better look of where takes the most CPU time, similar to the record in #1223 (comment) where rcl_wait and rmw_wait can be seen in the graph. But this option is very slow so the py-spy sampling frequency should be lowered too. But I think even if you rerun py-spy with this option, it is difficult to look into a C++ problem from py-spy. Probably you can try setting ROS2 log level to debug, i.e. |
I seem to have a related issue with heavy CPU usage (100%), even when running the sender and receiver in separate processes. In one program, I run a sender that sends strings at a frequency of 10 Hz. In another program, I run a receiver with a Initially, I used the |
@astroseger are you using fastDDS? Would you also try |
yes. I was using fastrtps (default in ros:jazzy docker). With cycloneDDS I don't have this issue with 100% CPU usage. (I'm referring to the test where I run a sender in a separate process, which sends short strings at a frequency of 10Hz. Meanwhile, I run a receiver in another separate process using |
Thanks @astroseger. It seems im able to repeat the CPU high load issue when using fastrtps and separating pub and sub nodes. The sub node has unusually high CPU load. I ran the sub node with I slowed down the publisher to 1Hz so that the sequence of event can be observed by looking into the timestamp. Also the publisher uses now() timestamp as message content, and subscriber prints the message data so that we can see when the message was published. Pub: def timer_callback(self):
msg = std_msgs.msg.String()
time = self.get_clock().now().seconds_nanoseconds()
msg.data = str(time[0]) + '.' + str(time[1])
self.publisher_.publish(msg) Sub: def listener_callback(self, msg):
print(msg.data) Outputs: FastRTPS multithreaded executor (high CPU load, independent of reliability=reliable / best effort. Num of threads=8):
For comparison, cycloneDDS multithreaded executor (normal):
FastRTPS / cycloneDDS singlethreaded executor (normal):
Observations:
ROS2 version: Iron Identical issue: Possibly similar issue: |
* (NumberOfEntities) add __iadd__ * (NumberOfEntities) improve __add__ performance This does indeed not work child classes, but the __repr__ was also not working for child classes. And no currently there are no child classes, so performance is more important. See ros2#1223 Signed-off-by: Matthijs van der Burgh <[email protected]> (cherry picked from commit 786c464)
Bug report
Required Info:
Operating System: Ubuntu 22.04
Installation type: Binaries
Version or commit hash: Iron
DDS implementation: eProsima’s Fast DDS (the default)
Client library (if applicable): rclpy
CPU info (if needed):
Steps to reproduce issue
My publisher:
my subscriber:
And my main function:
Expected behavior (which is what I get when using the SingleThreadedExecutor)
Actual behavior
Additional information
So similar issues have been brought up with rclcpp, but I have not seen any comments made about rclpy. I found an issue here:ros2/rclcpp#1487, with other people also reporting something like: ros2/rclcpp#1618 and the fix is ros2/rclcpp#1516 and then ros2/rclcpp#1692.
Aside from the timer callback hanging (I assume it to be with after following some tic toc), my CPU load becomes really high using the MultiThreadedExecutor, while the SingleThreadedExecutor does not cause any noticeable CPU load. I have also tried using both the
MutuallyExclusiveCallbackGroup
andReentrantCallbackGroup
with no change in behaviour.I am not sure if my QOS settings are the problem, or this is an issue intrinsic to python (because of GIL or etc.) but either a more suitable example for how to use the MultiThreadedExecutor could be provided (if my usage is wrong), or the ROS wiki pages should reflect that this significant problem exists (if no fix is possible).
Thank you for helping!
The text was updated successfully, but these errors were encountered: