-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix constantly increasing memory in std::list #636
Fix constantly increasing memory in std::list #636
Conversation
When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory
Just highlighting that this leak cost us a lot of time. We are open to contribute with PRs to fix it, and unit tests to make sure it won't happen again |
Thank you for the catch! Some of the tests are failing I can assume is because this new condition, do you mind to fix it ? |
And also fixed the situation where two different tf2 needs to be introduced at the same timestamp. Since the tests were failing I realized that I had to dive a bit deeper into the simple solution. And extend a bit the TransformStorage clase to be comparable, now instead of just looking to the timestamps, the implementation avoids inserting duplicates in the buffer. If for some reason, someone is publishing a different Transfrom (xyz,rpy) at the same timestamp. Well, that case is not being captured now, but I'd say that's another problem
Done, I had to slightly complicate a bit more the solution at 2f2bedd But I think it's still worth the fix. I'm entirely open to other ideas as well, |
^ Your refactor doesn't look that nutty, so I think folks would be open to it in a follow up PR after this is done. Make sure to track CI and make it green. The rolling build has some test failures. That way once someone has a chance to look at this it can be merged if they agree quickly without waiting for you to cycle through fixes |
Done 😎 Then I'll try to make some room to open a follow-up with more usage of the STL library inside the TimeCache module aiming at improving it. Thanks for the positive feedback :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nachovizzo what about adding a unit test ?
@ahcorde sure thing. It will take me some time since I'm currently overloaded :) I'd add a simple test that just guarantees that the application code can not insert multiple times the same transformation. This will indirectly test that the Would that work? |
sounds goot to me, thank you |
Sorry for the delay, I just pushed some unit tests for this particular corner case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few very minor things to fix.
Co-authored-by: Chris Lalancette <[email protected]> Signed-off-by: Ignacio Vizzo <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]> Signed-off-by: Ignacio Vizzo <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]> Signed-off-by: Ignacio Vizzo <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]> Signed-off-by: Ignacio Vizzo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for tracking the issue and contributing back @nachovizzo |
Should we backport this to |
I think that seems reasonable; it shouldn't break API or ABI. I'll ask the bot to do it. |
@Mergifyio backport humble iron |
✅ Backports have been created
|
* Fix constantly increasing memory in std::list When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory And also fixed the situation where two different tf2 needs to be introduced at the same timestamp. Since the tests were failing I realized that I had to dive a bit deeper into the simple solution. And extend a bit the TransformStorage clase to be comparable, now instead of just looking to the timestamps, the implementation avoids inserting duplicates in the buffer. If for some reason, someone is publishing a different Transfrom (xyz,rpy) at the same timestamp. Well, that case is not being captured now, but I'd say that's another problem Signed-off-by: Ignacio Vizzo <[email protected]> (cherry picked from commit 1621942)
* Fix constantly increasing memory in std::list When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory And also fixed the situation where two different tf2 needs to be introduced at the same timestamp. Since the tests were failing I realized that I had to dive a bit deeper into the simple solution. And extend a bit the TransformStorage clase to be comparable, now instead of just looking to the timestamps, the implementation avoids inserting duplicates in the buffer. If for some reason, someone is publishing a different Transfrom (xyz,rpy) at the same timestamp. Well, that case is not being captured now, but I'd say that's another problem Signed-off-by: Ignacio Vizzo <[email protected]> (cherry picked from commit 1621942)
* Fix constantly increasing memory in std::list When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory And also fixed the situation where two different tf2 needs to be introduced at the same timestamp. Since the tests were failing I realized that I had to dive a bit deeper into the simple solution. And extend a bit the TransformStorage clase to be comparable, now instead of just looking to the timestamps, the implementation avoids inserting duplicates in the buffer. If for some reason, someone is publishing a different Transfrom (xyz,rpy) at the same timestamp. Well, that case is not being captured now, but I'd say that's another problem Signed-off-by: Ignacio Vizzo <[email protected]> (cherry picked from commit 1621942) Co-authored-by: Ignacio Vizzo <[email protected]>
* Fix constantly increasing memory in std::list When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory And also fixed the situation where two different tf2 needs to be introduced at the same timestamp. Since the tests were failing I realized that I had to dive a bit deeper into the simple solution. And extend a bit the TransformStorage clase to be comparable, now instead of just looking to the timestamps, the implementation avoids inserting duplicates in the buffer. If for some reason, someone is publishing a different Transfrom (xyz,rpy) at the same timestamp. Well, that case is not being captured now, but I'd say that's another problem Signed-off-by: Ignacio Vizzo <[email protected]> (cherry picked from commit 1621942) Co-authored-by: Ignacio Vizzo <[email protected]>
Description
This PR solve this issue for us : #630
When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory
How to test it
I would like to add a unit test for this, but I'd add this to my ToDo list since now I have no more time to solve that.
I've created a small repo where the code used for doing an isolation test of this problem can be found: https://github.com/nachovizzo/cyclone_dds_leak (ignore the name)
In that repo, there are 2 nodes, the offender one that publishes non-stop a tf over the wire, which never change its timestamp. The Second node is not doing anything. It's just creating a tf2 listener, but because of the offender, the
storage_
member of theTimeCache
gets populated without any sort of bounds, and thus, start allocating more and more memory on that list. This is just 1 node, without doing anything particular, you can imagine the massive amount of memory leaks that we were experiencing in our system with dozens of nodes that have atf2_ros::TransformListener
Offender node:
Consumer node, the poor guy who shows the memory leak symptoms:
Just running those 2 nodes it's sufficient for these tests
Heaptrack memory analysts (massif can also be used)
Originally this is how we detected the problem after many hours of debugging, by running the nodes that had a tf2 listener through valgrind/massif/heaptrack.
I let the leaky nodes run for 7 minutes without touching my system and these are the results:
Heaptrack analysts after 7 minutes from
rolling
In this case, the tf2_node already consumed 45MiB, which is all duplicates of the
dummy
transform spit at the offender node:Heaptrack analysts after 7 minutes from
rolling
+ this PRBy avoiding inserting repeated elements into the list, then the results are as expected
Open questions
geometry2/tf2/include/tf2/time_cache.h
Line 113 in e07648f
pruneList()
method, and this indeed also solved our leak problems, but it would degrade a bit the performance. Are you guys interested also in adding that just as a sanity check?Possible related issues