Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix constantly increasing memory in std::list #636

Merged
merged 9 commits into from
Jan 10, 2024

Conversation

nachovizzo
Copy link
Contributor

@nachovizzo nachovizzo commented Nov 21, 2023

Description

This PR solve this issue for us : #630

When someone is constantly publishing with the same tf2 timestamp (application error, I know), the storage_ of the tf2::TimeCache grows unbounded causing system-wide memory leaks. In a ROS system, each tf2 Listener will spawn one of these TimeCache objects and thus, all ROS nodes that have any sort of tf2 listener will slowly start allocating more and more memory

How to test it

I would like to add a unit test for this, but I'd add this to my ToDo list since now I have no more time to solve that.

I've created a small repo where the code used for doing an isolation test of this problem can be found: https://github.com/nachovizzo/cyclone_dds_leak (ignore the name)

In that repo, there are 2 nodes, the offender one that publishes non-stop a tf over the wire, which never change its timestamp. The Second node is not doing anything. It's just creating a tf2 listener, but because of the offender, the storage_ member of the TimeCache gets populated without any sort of bounds, and thus, start allocating more and more memory on that list. This is just 1 node, without doing anything particular, you can imagine the massive amount of memory leaks that we were experiencing in our system with dozens of nodes that have a tf2_ros::TransformListener

Offender node:

class LeakyNode : public rclcpp::Node {
public:
    explicit LeakyNode(const std::string &name) : Node(name) {
        // Capture the moment where the node is created
        fixed_timestamp = this->get_clock()->now();
        // Create a tf broadcaster to send over the wire dummy transformations
        tf_broadcaster_ = std::make_unique<tf2_ros::TransformBroadcaster>(*this);
        // Control the publishing with a timer, do it fast to observe the "leak" right away
        timer_ = create_wall_timer(std::chrono::duration<double>(1.0 / publish_rate_),
                                   std::bind(&LeakyNode::timerCallback, this));
    }

    void timerCallback() {
        geometry_msgs::msg::TransformStamped dummy_pose;
        dummy_pose.header.stamp = fixed_timestamp;  // does not change (application error)
        dummy_pose.child_frame_id = "jose";
        dummy_pose.header.frame_id = "pepe";
        tf_broadcaster_->sendTransform(dummy_pose);
    }

    std::unique_ptr<tf2_ros::TransformBroadcaster> tf_broadcaster_;
    rclcpp::TimerBase::SharedPtr timer_;
    double publish_rate_ = 1000.0;
    rclcpp::Time fixed_timestamp;
};

Consumer node, the poor guy who shows the memory leak symptoms:

class LeakyNode : public rclcpp::Node {
public:
    explicit LeakyNode(const std::string &name) : Node(name) {
        tf_buffer = std::make_unique<tf2_ros::Buffer>(this->get_clock());
        tf_listener = std::make_unique<tf2_ros::TransformListener>(*tf_buffer);
    }
    std::unique_ptr<tf2_ros::TransformListener> tf_listener;
    std::unique_ptr<tf2_ros::Buffer> tf_buffer;
};

Just running those 2 nodes it's sufficient for these tests

Heaptrack memory analysts (massif can also be used)

Originally this is how we detected the problem after many hours of debugging, by running the nodes that had a tf2 listener through valgrind/massif/heaptrack.

I let the leaky nodes run for 7 minutes without touching my system and these are the results:

Heaptrack analysts after 7 minutes from rolling

In this case, the tf2_node already consumed 45MiB, which is all duplicates of the dummy transform spit at the offender node:

image
image

Heaptrack analysts after 7 minutes from rolling + this PR

By avoiding inserting repeated elements into the list, then the results are as expected
image
image

Open questions

  • I'd like to carry on a bit of improvement (I already did on a private branch) on this module, improving some of the readability of the implementation without sacrificing performance. Are you guys interested in me doing this? Branch: https://github.com/nachovizzo/geometry2/tree/nacho/improve_tf2_cache
  • I also found that there is this constant lying on the header but not consumed anywhere, I'd love to also open a PR to avoid the linked list growing (beyond this PR)
    static const unsigned int MAX_LENGTH_LINKED_LIST = 1000000;
  • I also tried using std::list::::unique at the end of the pruneList() method, and this indeed also solved our leak problems, but it would degrade a bit the performance. Are you guys interested also in adding that just as a sanity check?

Possible related issues

When someone is constantly publishing with the same tf2 timestamp
(application error, I know), the storage_ of the tf2::TimeCache grows
unbounded causing system-wide memory leaks. In a ROS system, each tf2
Listener will spawn one of these TimeCache objects and thus, all ROS
nodes that have any sort of tf2 listener will slowly start allocating
more and more memory
@doisyg
Copy link

doisyg commented Nov 21, 2023

Just highlighting that this leak cost us a lot of time. We are open to contribute with PRs to fix it, and unit tests to make sure it won't happen again

@ahcorde
Copy link
Contributor

ahcorde commented Nov 21, 2023

Thank you for the catch! Some of the tests are failing I can assume is because this new condition, do you mind to fix it ?

And also fixed the situation where two different tf2 needs to be
introduced at the same timestamp.

Since the tests were failing I realized that I had to dive a bit deeper
into the simple solution. And extend a bit the TransformStorage clase to
be comparable, now instead of just looking to the timestamps, the
implementation avoids inserting duplicates in the buffer.

If for some reason, someone is publishing a different Transfrom
(xyz,rpy) at the same timestamp. Well, that case is not being captured
now, but I'd say that's another problem
@nachovizzo
Copy link
Contributor Author

Thank you for the catch! Some of the tests are failing I can assume is because of this new condition, do you mind to fix it ?

Done, I had to slightly complicate a bit more the solution at 2f2bedd But I think it's still worth the fix.

I'm entirely open to other ideas as well,

@SteveMacenski
Copy link
Contributor

SteveMacenski commented Nov 21, 2023

^ Your refactor doesn't look that nutty, so I think folks would be open to it in a follow up PR after this is done.

Make sure to track CI and make it green. The rolling build has some test failures. That way once someone has a chance to look at this it can be merged if they agree quickly without waiting for you to cycle through fixes

@nachovizzo
Copy link
Contributor Author

^ Your refactor doesn't look that nutty, so I think folks would be open to it in a follow up PR after this is done.

Make sure to track CI and make it green. The rolling build has some test failures. That way once someone has a chance to look at this it can be merged if they agree quickly without waiting for you to cycle through fixes

Done 😎

Then I'll try to make some room to open a follow-up with more usage of the STL library inside the TimeCache module aiming at improving it. Thanks for the positive feedback :)

Copy link
Contributor

@ahcorde ahcorde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nachovizzo what about adding a unit test ?

@nachovizzo
Copy link
Contributor Author

@ahcorde sure thing. It will take me some time since I'm currently overloaded :)

I'd add a simple test that just guarantees that the application code can not insert multiple times the same transformation. This will indirectly test that the std::list does not grow unbounded, and therefore that the memory stays bounded (at least from that point of view).

Would that work?

@ahcorde
Copy link
Contributor

ahcorde commented Nov 27, 2023

@ahcorde sure thing. It will take me some time since I'm currently overloaded :)

I'd add a simple test that just guarantees that the application code can not insert multiple times the same transformation. This will indirectly test that the std::list does not grow unbounded, and therefore that the memory stays bounded (at least from that point of view).

Would that work?

sounds goot to me, thank you

@nachovizzo nachovizzo requested a review from ahcorde January 9, 2024 12:10
@nachovizzo
Copy link
Contributor Author

@ahcorde sure thing. It will take me some time since I'm currently overloaded :)
I'd add a simple test that just guarantees that the application code can not insert multiple times the same transformation. This will indirectly test that the std::list does not grow unbounded, and therefore that the memory stays bounded (at least from that point of view).
Would that work?

sounds goot to me, thank you

Sorry for the delay, I just pushed some unit tests for this particular corner case

tf2/test/test_storage.cpp Show resolved Hide resolved
Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few very minor things to fix.

tf2/src/cache.cpp Outdated Show resolved Hide resolved
tf2/test/cache_unittest.cpp Outdated Show resolved Hide resolved
tf2/test/cache_unittest.cpp Outdated Show resolved Hide resolved
tf2/test/cache_unittest.cpp Outdated Show resolved Hide resolved
nachovizzo and others added 4 commits January 10, 2024 09:55
Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Ignacio Vizzo <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Ignacio Vizzo <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Ignacio Vizzo <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Ignacio Vizzo <[email protected]>
Copy link
Contributor

@ahcorde ahcorde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@ahcorde ahcorde merged commit 1621942 into ros2:rolling Jan 10, 2024
2 checks passed
@ahcorde
Copy link
Contributor

ahcorde commented Jan 10, 2024

Thank you for tracking the issue and contributing back @nachovizzo

@nachovizzo
Copy link
Contributor Author

Should we backport this to iron, humble ??

@clalancette
Copy link
Contributor

Should we backport this to iron, humble ??

I think that seems reasonable; it shouldn't break API or ABI. I'll ask the bot to do it.

@clalancette
Copy link
Contributor

@Mergifyio backport humble iron

Copy link
Contributor

mergify bot commented Feb 12, 2024

backport humble iron

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Feb 12, 2024
* Fix constantly increasing memory in std::list

When someone is constantly publishing with the same tf2 timestamp
(application error, I know), the storage_ of the tf2::TimeCache grows
unbounded causing system-wide memory leaks. In a ROS system, each tf2
Listener will spawn one of these TimeCache objects and thus, all ROS
nodes that have any sort of tf2 listener will slowly start allocating
more and more memory

And also fixed the situation where two different tf2 needs to be
introduced at the same timestamp.

Since the tests were failing I realized that I had to dive a bit deeper
into the simple solution. And extend a bit the TransformStorage clase to
be comparable, now instead of just looking to the timestamps, the
implementation avoids inserting duplicates in the buffer.

If for some reason, someone is publishing a different Transfrom
(xyz,rpy) at the same timestamp. Well, that case is not being captured
now, but I'd say that's another problem

Signed-off-by: Ignacio Vizzo <[email protected]>
(cherry picked from commit 1621942)
mergify bot pushed a commit that referenced this pull request Feb 12, 2024
* Fix constantly increasing memory in std::list

When someone is constantly publishing with the same tf2 timestamp
(application error, I know), the storage_ of the tf2::TimeCache grows
unbounded causing system-wide memory leaks. In a ROS system, each tf2
Listener will spawn one of these TimeCache objects and thus, all ROS
nodes that have any sort of tf2 listener will slowly start allocating
more and more memory

And also fixed the situation where two different tf2 needs to be
introduced at the same timestamp.

Since the tests were failing I realized that I had to dive a bit deeper
into the simple solution. And extend a bit the TransformStorage clase to
be comparable, now instead of just looking to the timestamps, the
implementation avoids inserting duplicates in the buffer.

If for some reason, someone is publishing a different Transfrom
(xyz,rpy) at the same timestamp. Well, that case is not being captured
now, but I'd say that's another problem

Signed-off-by: Ignacio Vizzo <[email protected]>
(cherry picked from commit 1621942)
ahcorde pushed a commit that referenced this pull request Feb 13, 2024
* Fix constantly increasing memory in std::list

When someone is constantly publishing with the same tf2 timestamp
(application error, I know), the storage_ of the tf2::TimeCache grows
unbounded causing system-wide memory leaks. In a ROS system, each tf2
Listener will spawn one of these TimeCache objects and thus, all ROS
nodes that have any sort of tf2 listener will slowly start allocating
more and more memory

And also fixed the situation where two different tf2 needs to be
introduced at the same timestamp.

Since the tests were failing I realized that I had to dive a bit deeper
into the simple solution. And extend a bit the TransformStorage clase to
be comparable, now instead of just looking to the timestamps, the
implementation avoids inserting duplicates in the buffer.

If for some reason, someone is publishing a different Transfrom
(xyz,rpy) at the same timestamp. Well, that case is not being captured
now, but I'd say that's another problem

Signed-off-by: Ignacio Vizzo <[email protected]>
(cherry picked from commit 1621942)

Co-authored-by: Ignacio Vizzo <[email protected]>
ahcorde pushed a commit that referenced this pull request Feb 13, 2024
* Fix constantly increasing memory in std::list

When someone is constantly publishing with the same tf2 timestamp
(application error, I know), the storage_ of the tf2::TimeCache grows
unbounded causing system-wide memory leaks. In a ROS system, each tf2
Listener will spawn one of these TimeCache objects and thus, all ROS
nodes that have any sort of tf2 listener will slowly start allocating
more and more memory

And also fixed the situation where two different tf2 needs to be
introduced at the same timestamp.

Since the tests were failing I realized that I had to dive a bit deeper
into the simple solution. And extend a bit the TransformStorage clase to
be comparable, now instead of just looking to the timestamps, the
implementation avoids inserting duplicates in the buffer.

If for some reason, someone is publishing a different Transfrom
(xyz,rpy) at the same timestamp. Well, that case is not being captured
now, but I'd say that's another problem

Signed-off-by: Ignacio Vizzo <[email protected]>
(cherry picked from commit 1621942)

Co-authored-by: Ignacio Vizzo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants