Fix constantly increasing memory in std::list #636

nachovizzo · 2023-11-21T16:35:37Z

Description

This PR solve this issue for us : #630

When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory

How to test it

I would like to add a unit test for this, but I'd add this to my ToDo list since now I have no more time to solve that.

I've created a small repo where the code used for doing an isolation test of this problem can be found: https://github.com/nachovizzo/cyclone_dds_leak (ignore the name)

In that repo, there are 2 nodes, the offender one that publishes non-stop a tf over the wire, which never change its timestamp. The Second node is not doing anything. It's just creating a tf2 listener, but because of the offender, the storage_ member of the TimeCache gets populated without any sort of bounds, and thus, start allocating more and more memory on that list. This is just 1 node, without doing anything particular, you can imagine the massive amount of memory leaks that we were experiencing in our system with dozens of nodes that have a tf2_ros::TransformListener

Offender node:

class LeakyNode : public rclcpp::Node {
public:
    explicit LeakyNode(const std::string &name) : Node(name) {
        // Capture the moment where the node is created
        fixed_timestamp = this->get_clock()->now();
        // Create a tf broadcaster to send over the wire dummy transformations
        tf_broadcaster_ = std::make_unique<tf2_ros::TransformBroadcaster>(*this);
        // Control the publishing with a timer, do it fast to observe the "leak" right away
        timer_ = create_wall_timer(std::chrono::duration<double>(1.0 / publish_rate_),
                                   std::bind(&LeakyNode::timerCallback, this));
    }

    void timerCallback() {
        geometry_msgs::msg::TransformStamped dummy_pose;
        dummy_pose.header.stamp = fixed_timestamp;  // does not change (application error)
        dummy_pose.child_frame_id = "jose";
        dummy_pose.header.frame_id = "pepe";
        tf_broadcaster_->sendTransform(dummy_pose);
    }

    std::unique_ptr<tf2_ros::TransformBroadcaster> tf_broadcaster_;
    rclcpp::TimerBase::SharedPtr timer_;
    double publish_rate_ = 1000.0;
    rclcpp::Time fixed_timestamp;
};

Consumer node, the poor guy who shows the memory leak symptoms:

class LeakyNode : public rclcpp::Node {
public:
    explicit LeakyNode(const std::string &name) : Node(name) {
        tf_buffer = std::make_unique<tf2_ros::Buffer>(this->get_clock());
        tf_listener = std::make_unique<tf2_ros::TransformListener>(*tf_buffer);
    }
    std::unique_ptr<tf2_ros::TransformListener> tf_listener;
    std::unique_ptr<tf2_ros::Buffer> tf_buffer;
};

Just running those 2 nodes it's sufficient for these tests

Heaptrack memory analysts (massif can also be used)

Originally this is how we detected the problem after many hours of debugging, by running the nodes that had a tf2 listener through valgrind/massif/heaptrack.

I let the leaky nodes run for 7 minutes without touching my system and these are the results:

Heaptrack analysts after 7 minutes from `rolling`

In this case, the tf2_node already consumed 45MiB, which is all duplicates of the dummy transform spit at the offender node:

Heaptrack analysts after 7 minutes from `rolling` + this PR

By avoiding inserting repeated elements into the list, then the results are as expected

Open questions

I'd like to carry on a bit of improvement (I already did on a private branch) on this module, improving some of the readability of the implementation without sacrificing performance. Are you guys interested in me doing this? Branch: https://github.com/nachovizzo/geometry2/tree/nacho/improve_tf2_cache
I also found that there is this constant lying on the header but not consumed anywhere, I'd love to also open a PR to avoid the linked list growing (beyond this PR)

geometry2/tf2/include/tf2/time_cache.h

Line 113 in e07648f

static const unsigned int MAX_LENGTH_LINKED_LIST = 1000000;
I also tried using std::list::::unique at the end of the pruneList() method, and this indeed also solved our leak problems, but it would degrade a bit the performance. Are you guys interested also in adding that just as a sanity check?

Possible related issues

When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory

doisyg · 2023-11-21T17:13:47Z

Just highlighting that this leak cost us a lot of time. We are open to contribute with PRs to fix it, and unit tests to make sure it won't happen again

ahcorde · 2023-11-21T17:19:09Z

Thank you for the catch! Some of the tests are failing I can assume is because this new condition, do you mind to fix it ?

And also fixed the situation where two different tf2 needs to be introduced at the same timestamp. Since the tests were failing I realized that I had to dive a bit deeper into the simple solution. And extend a bit the TransformStorage clase to be comparable, now instead of just looking to the timestamps, the implementation avoids inserting duplicates in the buffer. If for some reason, someone is publishing a different Transfrom (xyz,rpy) at the same timestamp. Well, that case is not being captured now, but I'd say that's another problem

nachovizzo · 2023-11-21T17:21:21Z

Thank you for the catch! Some of the tests are failing I can assume is because of this new condition, do you mind to fix it ?

Done, I had to slightly complicate a bit more the solution at 2f2bedd But I think it's still worth the fix.

I'm entirely open to other ideas as well,

SteveMacenski · 2023-11-21T17:22:36Z

^ Your refactor doesn't look that nutty, so I think folks would be open to it in a follow up PR after this is done.

Make sure to track CI and make it green. The rolling build has some test failures. That way once someone has a chance to look at this it can be merged if they agree quickly without waiting for you to cycle through fixes

nachovizzo · 2023-11-21T17:29:13Z

^ Your refactor doesn't look that nutty, so I think folks would be open to it in a follow up PR after this is done.

Make sure to track CI and make it green. The rolling build has some test failures. That way once someone has a chance to look at this it can be merged if they agree quickly without waiting for you to cycle through fixes

Done 😎

Then I'll try to make some room to open a follow-up with more usage of the STL library inside the TimeCache module aiming at improving it. Thanks for the positive feedback :)

ahcorde

@nachovizzo what about adding a unit test ?

nachovizzo · 2023-11-27T09:56:28Z

@ahcorde sure thing. It will take me some time since I'm currently overloaded :)

I'd add a simple test that just guarantees that the application code can not insert multiple times the same transformation. This will indirectly test that the std::list does not grow unbounded, and therefore that the memory stays bounded (at least from that point of view).

Would that work?

ahcorde · 2023-11-27T10:03:31Z

@ahcorde sure thing. It will take me some time since I'm currently overloaded :)

I'd add a simple test that just guarantees that the application code can not insert multiple times the same transformation. This will indirectly test that the std::list does not grow unbounded, and therefore that the memory stays bounded (at least from that point of view).

Would that work?

sounds goot to me, thank you

…internal storage

nachovizzo · 2024-01-09T12:10:40Z

@ahcorde sure thing. It will take me some time since I'm currently overloaded :)
I'd add a simple test that just guarantees that the application code can not insert multiple times the same transformation. This will indirectly test that the std::list does not grow unbounded, and therefore that the memory stays bounded (at least from that point of view).
Would that work?

sounds goot to me, thank you

Sorry for the delay, I just pushed some unit tests for this particular corner case

tf2/test/test_storage.cpp

clalancette

I've left a few very minor things to fix.

tf2/src/cache.cpp

tf2/test/cache_unittest.cpp

Co-authored-by: Chris Lalancette <[email protected]> Signed-off-by: Ignacio Vizzo <[email protected]>

ahcorde

Linux
Linux-aarch64
Windows

ahcorde · 2024-01-10T18:51:27Z

Thank you for tracking the issue and contributing back @nachovizzo

nachovizzo · 2024-02-12T11:43:38Z

Should we backport this to iron, humble ??

clalancette · 2024-02-12T20:24:39Z

Should we backport this to iron, humble ??

I think that seems reasonable; it shouldn't break API or ABI. I'll ask the bot to do it.

clalancette · 2024-02-12T20:24:47Z

@Mergifyio backport humble iron

mergify · 2024-02-12T20:25:50Z

backport humble iron

✅ Backports have been created

#648 Fix constantly increasing memory in std::list (backport #636) has been created for branch humble
#649 Fix constantly increasing memory in std::list (backport #636) has been created for branch iron

* Fix constantly increasing memory in std::list When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory And also fixed the situation where two different tf2 needs to be introduced at the same timestamp. Since the tests were failing I realized that I had to dive a bit deeper into the simple solution. And extend a bit the TransformStorage clase to be comparable, now instead of just looking to the timestamps, the implementation avoids inserting duplicates in the buffer. If for some reason, someone is publishing a different Transfrom (xyz,rpy) at the same timestamp. Well, that case is not being captured now, but I'd say that's another problem Signed-off-by: Ignacio Vizzo <[email protected]> (cherry picked from commit 1621942)

* Fix constantly increasing memory in std::list When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory And also fixed the situation where two different tf2 needs to be introduced at the same timestamp. Since the tests were failing I realized that I had to dive a bit deeper into the simple solution. And extend a bit the TransformStorage clase to be comparable, now instead of just looking to the timestamps, the implementation avoids inserting duplicates in the buffer. If for some reason, someone is publishing a different Transfrom (xyz,rpy) at the same timestamp. Well, that case is not being captured now, but I'd say that's another problem Signed-off-by: Ignacio Vizzo <[email protected]> (cherry picked from commit 1621942) Co-authored-by: Ignacio Vizzo <[email protected]>

nachovizzo requested review from ahcorde and clalancette as code owners November 21, 2023 16:35

doisyg mentioned this pull request Nov 21, 2023

Possible memory leak in Cyclone/Iceoryx subscriber history queue ros2/rmw_cyclonedds#471

Open

ahcorde requested changes Nov 27, 2023

View reviewed changes

sloretz assigned clalancette and ahcorde Dec 8, 2023

nachovizzo added 2 commits January 9, 2024 08:31

Add simple test fixture for equal/unequal operator

af15ee3

Add unit tests to ensure no repeated elements are insterted into the …

7705f4b

…internal storage

nachovizzo requested a review from ahcorde January 9, 2024 12:10

ahcorde requested changes Jan 9, 2024

View reviewed changes

tf2/test/test_storage.cpp Show resolved Hide resolved

add BSD license

69de3a4

clalancette requested changes Jan 9, 2024

View reviewed changes

tf2/src/cache.cpp Outdated Show resolved Hide resolved

tf2/test/cache_unittest.cpp Outdated Show resolved Hide resolved

tf2/test/cache_unittest.cpp Outdated Show resolved Hide resolved

tf2/test/cache_unittest.cpp Outdated Show resolved Hide resolved

nachovizzo and others added 4 commits January 10, 2024 09:55

Update tf2/test/cache_unittest.cpp

2701d4f

Co-authored-by: Chris Lalancette <[email protected]> Signed-off-by: Ignacio Vizzo <[email protected]>

Update tf2/test/cache_unittest.cpp

ef9ab48

Co-authored-by: Chris Lalancette <[email protected]> Signed-off-by: Ignacio Vizzo <[email protected]>

Update tf2/src/cache.cpp

6271184

Co-authored-by: Chris Lalancette <[email protected]> Signed-off-by: Ignacio Vizzo <[email protected]>

Update tf2/test/cache_unittest.cpp

1aa51b0

Co-authored-by: Chris Lalancette <[email protected]> Signed-off-by: Ignacio Vizzo <[email protected]>

ahcorde approved these changes Jan 10, 2024

View reviewed changes

ahcorde merged commit 1621942 into ros2:rolling Jan 10, 2024
2 checks passed

This was referenced Feb 12, 2024

Fix constantly increasing memory in std::list (backport #636) #648

Merged

Fix constantly increasing memory in std::list (backport #636) #649

Merged

nachovizzo mentioned this pull request Mar 12, 2024

Nacho/minor fixes tf2 cache #658

Merged

EricCousineau-TRI mentioned this pull request May 9, 2024

Performance regression due to #636 #676

Closed

ahcorde mentioned this pull request May 15, 2024

Fix memory leak in Transform Listener #630

Closed

EricCousineau-TRI mentioned this pull request May 15, 2024

[TimeCache] Improve performance for insertData() and pruneList() #680

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix constantly increasing memory in std::list #636

Fix constantly increasing memory in std::list #636

nachovizzo commented Nov 21, 2023 •

edited

Loading

doisyg commented Nov 21, 2023 •

edited

Loading

ahcorde commented Nov 21, 2023

nachovizzo commented Nov 21, 2023

SteveMacenski commented Nov 21, 2023 •

edited

Loading

nachovizzo commented Nov 21, 2023

ahcorde left a comment

nachovizzo commented Nov 27, 2023

ahcorde commented Nov 27, 2023

nachovizzo commented Jan 9, 2024

clalancette left a comment

ahcorde left a comment

ahcorde commented Jan 10, 2024

nachovizzo commented Feb 12, 2024

clalancette commented Feb 12, 2024

clalancette commented Feb 12, 2024

mergify bot commented Feb 12, 2024 •

edited

Loading

Fix constantly increasing memory in std::list #636

Fix constantly increasing memory in std::list #636

Conversation

nachovizzo commented Nov 21, 2023 • edited Loading

Description

How to test it

Offender node:

Consumer node, the poor guy who shows the memory leak symptoms:

Heaptrack memory analysts (massif can also be used)

Heaptrack analysts after 7 minutes from rolling

Heaptrack analysts after 7 minutes from rolling + this PR

Open questions

Possible related issues

doisyg commented Nov 21, 2023 • edited Loading

ahcorde commented Nov 21, 2023

nachovizzo commented Nov 21, 2023

SteveMacenski commented Nov 21, 2023 • edited Loading

nachovizzo commented Nov 21, 2023

ahcorde left a comment

Choose a reason for hiding this comment

nachovizzo commented Nov 27, 2023

ahcorde commented Nov 27, 2023

nachovizzo commented Jan 9, 2024

clalancette left a comment

Choose a reason for hiding this comment

ahcorde left a comment

Choose a reason for hiding this comment

ahcorde commented Jan 10, 2024

nachovizzo commented Feb 12, 2024

clalancette commented Feb 12, 2024

clalancette commented Feb 12, 2024

mergify bot commented Feb 12, 2024 • edited Loading

✅ Backports have been created

nachovizzo commented Nov 21, 2023 •

edited

Loading

Heaptrack analysts after 7 minutes from `rolling`

Heaptrack analysts after 7 minutes from `rolling` + this PR

doisyg commented Nov 21, 2023 •

edited

Loading

SteveMacenski commented Nov 21, 2023 •

edited

Loading

mergify bot commented Feb 12, 2024 •

edited

Loading