fix: Fixed race condition in action server between is_ready and take #2250

jmachowinski · 2023-07-28T17:06:29Z

…data and execute

Some background information: is_ready and take_data are guaranteed to be
called in sequence without interruption from another thread.
while execute is running, another thread may also call is_ready.

The problem was, that goal_request_ready_, cancel_request_ready_, result_request_ready_
and goal_expired_ were accessed and written from is_ready and execute.

This commit fixed this by only using the mentioned variables in is_ready and take_data.
execute now only accesses the given pointer and works on this.

Signed-off-by: Janosch Machowinski [email protected]

Note, this fixes #2242

jmachowinski · 2023-07-31T15:03:11Z

I'm still a bit puzzled about the is_ready() function. When will it be called in relation to the take_data function ?
From the code it looks like it should be a sequence of is_ready, take_data, is_ready etc. but this seems not to be the case...

jmachowinski · 2023-08-01T09:29:18Z

Looks like is_ready may be called multiple times before take_data is called.Therefore this patch needs a rewrite.

jmachowinski · 2023-09-05T13:29:56Z

@clalancette can you do a review of this ? The exact description of the issue this is fixing can be found here #2242 (comment)

jmachowinski · 2023-09-14T15:21:27Z

@ros2/team this patch is hanging here since July, is there anything that I can do to speed this up ?

clalancette

First of all, I'm sorry that this took a long time to get to.

Second, thank you for the excellent explanation of the issue in #2242 , along with the reproducer. That helped me to understand the problem.

I've left a few questions inline, please take a look.

rclcpp_action/src/client.cpp

clalancette · 2023-09-19T13:53:45Z

rclcpp_action/src/client.cpp

+      ret = rcl_action_take_feedback(
+        pimpl_->client_handle.get(), feedback_message.get());


I'm not really sure that this is the direction we want to go in. This change makes ClientBase::is_ready no longer idempotent, and that seems like a huge change in semantics. Further, while the documentation for the is_ready virtual method doesn't explicitly say so, it implies that this should be idempotent.

Can you explain further why you moved all of this here, rather than just adding the lock? We have to take the lock both here and in ClientBase::take_data anyway.

The reason for this is that I saw cases, were two threads would call is_ready, and then go on to take_data, but the second thread would trigger the exception, as there would only be one data item to take.
I am still puzzled which code path triggers this, as the multithreaded executor protects the sequence of both calls by holding a mutex.

I've been working on reviewing this pr and I think I agree with @clalancette at the very least in that this changes the meaning/expectation for the is_ready function.

I do think that your approach is probably on the right track (I'm also working on a test to narrow down the core issue, like you I'm a bit puzzled by the root cause), but I may propose a patch that changes how the code is laid out.

clalancette · 2023-09-19T13:54:27Z

rclcpp_action/include/rclcpp_action/server.hpp

+  execute_goal_request_received(
+    rcl_ret_t ret, rcl_action_goal_info_t goal_info, rmw_request_id_t request_header,
+    std::shared_ptr<void> message);


Do we really need these server changes to fix the crash on the client? Put another way, could we split these server fixes to another PR?

In theory they are independent, and we could split them, would there be any advantage of doing this ?

fmauch · 2023-10-18T08:52:54Z

Hi, I've run into #2242 on my test setup, as well. I have a (intentionally) not too capable PC running a robot driver plus MoveIt and have it execute random collision-free motions forever (or as long as possible). Basically, I want to try (end demonstrate) whether a ROS 2 based application can reliably work in a 24/7 application.

We use a low-latency kernel and have our robot driver using FIFO-scheduling with a very high priority, so this could definitively be one of the reasons why we stumbled upon this.

Without the fix from this PR the driver crashed after about 1-2 hours, with this fix applied (merged into current rolling) we did not have any crashes for 4 days now. I did not have an in-depth look into this since I am not very familiar with rclcpp's code base, but I hope that a "Solves a severe problem for me" voice helps in bringing this PR forward.

jmachowinski · 2023-10-24T11:19:02Z

@clalancette @wjwwood I'm back from vacation, lets try to get this MR sorted next ;-)

tommy-erko · 2023-11-03T11:51:57Z

Is there any review going on?
@jmachowinski can you meanwhile fix the DCO check (sign off in one commit message)?

jmachowinski · 2023-11-03T12:00:49Z

@tommy-erko fixed

jmachowinski · 2023-11-16T12:32:39Z

Putting a queue there could work, but would result in a weird mix of behaviors of the events.
I also fear, that rmw_wait would not wake up / wake up delayed if we get the triggering wrong, in
cases that waitables are ready.

As far as I understand the code, (@wjwwood Correct me if this assumption is wrong !) for everything
except the conditionals, we have some sort of interrupt system going on, were the flag in the wait set will
be set on consecutive calls to rmw_wait, until the data was actually taken. Taking the data, implicitly clears the flag.

Therefor I would suggest, to change the behavior of the conditionals in this ways:

Make them counting
Add an API to decrease the count
Don't touch them in RMW

If we would decrease the count of an associated conditional every time take_data is called, this should solve all of our problems.

tommy-erko · 2023-11-29T11:14:43Z

@jmachowinski I just tested the latest patch version with our system and I got an error in the first run, instead of the usual "Taking data from action client...", this time it is the "Taking data from action server but nothing is ready" version. I see it happens if
next_ready_event == std::numeric_limits<uint32_t>::max() (line 654 in client.cpp). What are the circumstances that can lead to this, can there be an issue on our side? We never got this error with the original "summer-version" of this PR, or with the official unpatched version.

jmachowinski · 2023-12-04T18:19:17Z

"Taking data from action client...", this time it is the "Taking data from action server but nothing is ready"

I messed up the error message in client.cpp:654

Interesting that you still get this error though, this means there is another way to trigger it, that I am not aware of yet.
Anyway, there is no error on your side, the current state of the patch is known to not fully fix the issue. You should stay on the "summer version" for now.

jmachowinski · 2023-12-06T16:51:01Z

@tommy-erko I just did a test with the current version ontop of Iron, and it works without any issues for me.

Can you try this version ? https://github.com/cellumation/rclcpp/tree/action_fix_final
Its the same as this one, just as two commits.

Note, this version of the patch will expose errors in waitables, and may introduce problems with interprocess communication.

jmachowinski · 2023-12-06T17:23:30Z

@Barry-Xu-2018 FYI I had a longer Q&A session with @wjwwood on discord. The outcome of it is, that by design it is the responsibility of a waitable, to trigger wakeups as long as it has data for processing. Therefore the patch from #2109 seems to be the wrong way to go, as it shifts the responsibility to the executioner.

Barry-Xu-2018 · 2023-12-07T07:53:03Z

@jmachowinski

Thank you for sharing the results of the discussion.

The outcome of it is, that by design it is the responsibility of a waitable, to trigger wakeups as long as it has data for processing. Therefore the patch from #2109 seems to be the wrong way to go, as it shifts the responsibility to the executioner.

While fixing this issue, I also had doubts about which module rclcpp or rmw should be responsible for addressing it.

In RMW, the criteria for determining whether guard_conditions/events and subscriptions/clients/services have trigger condition are currently different. For subscriptions/clients/services, having data available in buffer is considered a trigger condition. But guard_conditions/events have no 'data'.
Refer to
https://github.com/ros2/rmw_fastrtps/blob/5b96503556302b47b4e4af96a44f6189355edf29/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp#L56-L116

For guard_conditions, trigger condition will be cleared before leaving __rmw_wait(). The issue is that currently, the RMW layer cannot know whether the trigger has been handled by the upper layers.

https://github.com/ros2/rmw_fastrtps/blob/5b96503556302b47b4e4af96a44f6189355edf29/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp#L306-L315

If we fix this issue in the RMW layer, we would need to design a new mechanism that allows the upper layers to inform RMW that a guard condition has been handled and can be cleared.

jmachowinski · 2023-12-14T13:26:40Z

@Barry-Xu-2018 Is there another use case except the waitables that I am not aware of ?
For the waitables you can figure out if you got processed or not. E.g. if take_data() was not called your were not processed, and you need to retrigger your internal guard condition on the next call to add_to_wait_set.

jmachowinski · 2023-12-14T13:45:51Z

@clalancette @fujitatomoya @wjwwood I now pushed the version, that I would consider the correct fix for this issue.

If you are not in favor of the revert of " Avoid losing waitable handles while using MultiThreadedExecut…" before a fix for the issue #2012 is in place, I would remove it from this MR and just merge the first commit.

I'm going to be in vacation now and might not be responsive for the rest of the year.

Barry-Xu-2018 · 2023-12-20T00:50:04Z

@jmachowinski

Is there another use case except the waitables that I am not aware of ?
For the waitables you can figure out if you got processed or not. E.g. if take_data() was not called your were not processed, and you need to retrigger your internal guard condition on the next call to add_to_wait_set.

Apologies for the delayed response.
From the information I known (mentioned in #2250 (comment)), there shouldn't be anymore.

jmachowinski · 2024-01-29T08:26:43Z

@wjwwood @mjcarroll @fujitatomoya Ping, I would like to see this merged for jazzy

firesurfer · 2024-04-02T07:07:39Z

Is there a reason for this PR not getting merged? If there is anything a user from the community can do to help getting this merged. Let me know!
Especially in combination with ros2control (which uses a multithreaded executor) and the default JointTrajectoryController - which is used as a basic building block for many other components this issue can be encountered quite regularly.
So fixing this should be quite beneficial for many applications!

jmachowinski · 2024-04-02T10:20:32Z

@firesurfer This is now in the pipeline for merge. The plan is to merge #2420 and afterwards this PR. If you want to help out, you could retry this patch as soon as 2420 was merged, to verify, that it is still working, and we got no regressions.

jmachowinski · 2024-04-04T20:49:18Z

@firesurfer this version should be ready to merge now. Can you run it on your system an verify it with the latest version of rolling ?

firesurfer · 2024-04-05T05:44:26Z

As our stack is iron based and has never been built with rolling this might proove a bit tricky. I would be happy to test any backport of this PR as soon as it is available (I am not sure about ABI compatiblity between rolling and iron at the moment - if its compatible I could just recompile rclcpp from rolling)

jmachowinski · 2024-04-05T12:47:55Z

@firesurfer our stack is also iron based, but compiles without problems ontop of rolling. Note, if you still use gazebo11 (classic) it gets tricky.

fmauch · 2024-04-05T13:17:28Z

Uh,I'll try to setup my demonstrator with that version as well. Can't promise though.

mjcarroll · 2024-04-05T14:50:41Z

Linux
Linux-aarch64
Windows

wjwwood · 2024-04-09T13:04:02Z

I have no idea what happened, but I tried pushing one commit to this branch and it reset rolling branch on jmachowinski:rolling to be the same as ros2:rolling, and now it's not letting me push, and it closed this pull request 🙃. So I'm opening a new one: #2495

jmachowinski requested review from ivanpauno, hidmic and wjwwood as code owners July 28, 2023 17:06

jmachowinski force-pushed the rolling branch 4 times, most recently from 7efa763 to 91fb1dd Compare July 31, 2023 11:16

jmachowinski marked this pull request as draft August 1, 2023 09:28

jmachowinski force-pushed the rolling branch from 91fb1dd to 76ac33e Compare August 1, 2023 11:45

jmachowinski marked this pull request as ready for review August 1, 2023 12:52

jmachowinski force-pushed the rolling branch 2 times, most recently from 1a7cf77 to 302771d Compare August 4, 2023 10:02

This was referenced Aug 21, 2023

ActionClient Crash: Taking data from action client but nothing is ready #2242

Closed

Executing action server but nothing is ready #2226

Open

clalancette assigned mjcarroll, quarkytale and clalancette Sep 14, 2023

clalancette reviewed Sep 19, 2023

View reviewed changes

wjwwood self-assigned this Oct 19, 2023

jmachowinski force-pushed the rolling branch from 10e2941 to 4d8cdee Compare November 3, 2023 12:00

jmachowinski force-pushed the rolling branch from 4d8cdee to 3dd8981 Compare November 3, 2023 13:49

fmauch mentioned this pull request Nov 7, 2023

Occasional crash with SJTC UniversalRobots/Universal_Robots_ROS2_Driver#380

Open

jmachowinski mentioned this pull request Nov 17, 2023

Added test cased for missing events & double call to take_data #2371

Closed

wjwwood mentioned this pull request Nov 30, 2023

Utilize rclcpp::WaitSet as part of the executors #2142

Merged

3 tasks

jmachowinski force-pushed the rolling branch 2 times, most recently from 3912404 to 9f2ec0c Compare December 14, 2023 13:34

jmachowinski force-pushed the rolling branch from 9f2ec0c to 6c4c6a2 Compare January 29, 2024 15:46

firesurfer mentioned this pull request Apr 2, 2024

Aborted during moving UniversalRobots/Universal_Robots_ROS2_Driver#954

Closed

1 task

jmachowinski mentioned this pull request Apr 3, 2024

rclcpp_action: take and execute service entities in priority. #2471

Closed

jmachowinski force-pushed the rolling branch 2 times, most recently from 9b0d6de to 34ec478 Compare April 4, 2024 20:47

wjwwood closed this Apr 9, 2024

wjwwood force-pushed the rolling branch from 34ec478 to bfa7efa Compare April 9, 2024 12:42

wjwwood mentioned this pull request Apr 9, 2024

Remake of "fix: Fixed race condition in action server between is_ready and take" #2495

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fixed race condition in action server between is_ready and take #2250

fix: Fixed race condition in action server between is_ready and take #2250

jmachowinski commented Jul 28, 2023 •

edited

Loading

jmachowinski commented Jul 31, 2023

jmachowinski commented Aug 1, 2023 •

edited

Loading

jmachowinski commented Sep 5, 2023

jmachowinski commented Sep 14, 2023

clalancette left a comment

clalancette Sep 19, 2023

jmachowinski Sep 19, 2023

wjwwood Nov 9, 2023

clalancette Sep 19, 2023

jmachowinski Sep 19, 2023

fmauch commented Oct 18, 2023

jmachowinski commented Oct 24, 2023

tommy-erko commented Nov 3, 2023

jmachowinski commented Nov 3, 2023

jmachowinski commented Nov 16, 2023

tommy-erko commented Nov 29, 2023

jmachowinski commented Dec 4, 2023

jmachowinski commented Dec 6, 2023

jmachowinski commented Dec 6, 2023

Barry-Xu-2018 commented Dec 7, 2023

jmachowinski commented Dec 14, 2023

jmachowinski commented Dec 14, 2023 •

edited

Loading

Barry-Xu-2018 commented Dec 20, 2023

jmachowinski commented Jan 29, 2024

firesurfer commented Apr 2, 2024

jmachowinski commented Apr 2, 2024 •

edited

Loading

jmachowinski commented Apr 4, 2024

firesurfer commented Apr 5, 2024

jmachowinski commented Apr 5, 2024

fmauch commented Apr 5, 2024

mjcarroll commented Apr 5, 2024

wjwwood commented Apr 9, 2024 •

edited

Loading

		ret = rcl_action_take_feedback(
		pimpl_->client_handle.get(), feedback_message.get());

fix: Fixed race condition in action server between is_ready and take #2250

fix: Fixed race condition in action server between is_ready and take #2250

Conversation

jmachowinski commented Jul 28, 2023 • edited Loading

jmachowinski commented Jul 31, 2023

jmachowinski commented Aug 1, 2023 • edited Loading

jmachowinski commented Sep 5, 2023

jmachowinski commented Sep 14, 2023

clalancette left a comment

Choose a reason for hiding this comment

clalancette Sep 19, 2023

Choose a reason for hiding this comment

jmachowinski Sep 19, 2023

Choose a reason for hiding this comment

wjwwood Nov 9, 2023

Choose a reason for hiding this comment

clalancette Sep 19, 2023

Choose a reason for hiding this comment

jmachowinski Sep 19, 2023

Choose a reason for hiding this comment

fmauch commented Oct 18, 2023

jmachowinski commented Oct 24, 2023

tommy-erko commented Nov 3, 2023

jmachowinski commented Nov 3, 2023

jmachowinski commented Nov 16, 2023

tommy-erko commented Nov 29, 2023

jmachowinski commented Dec 4, 2023

jmachowinski commented Dec 6, 2023

jmachowinski commented Dec 6, 2023

Barry-Xu-2018 commented Dec 7, 2023

jmachowinski commented Dec 14, 2023

jmachowinski commented Dec 14, 2023 • edited Loading

Barry-Xu-2018 commented Dec 20, 2023

jmachowinski commented Jan 29, 2024

firesurfer commented Apr 2, 2024

jmachowinski commented Apr 2, 2024 • edited Loading

jmachowinski commented Apr 4, 2024

firesurfer commented Apr 5, 2024

jmachowinski commented Apr 5, 2024

fmauch commented Apr 5, 2024

mjcarroll commented Apr 5, 2024

wjwwood commented Apr 9, 2024 • edited Loading

jmachowinski commented Jul 28, 2023 •

edited

Loading

jmachowinski commented Aug 1, 2023 •

edited

Loading

jmachowinski commented Dec 14, 2023 •

edited

Loading

jmachowinski commented Apr 2, 2024 •

edited

Loading

wjwwood commented Apr 9, 2024 •

edited

Loading