Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove suspicious wait set locking code #119

Open
wants to merge 3 commits into
base: rolling
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions rmw_cyclonedds_cpp/src/rmw_node.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2221,8 +2221,10 @@ static void clean_waitset_caches()
used ... */
std::lock_guard<std::mutex> lock(gcdds.lock);
for (auto && ws : gcdds.waitsets) {
std::lock_guard<std::mutex> lock2(ws->lock);
waitset_detach(ws);
std::unique_lock<std::mutex> lock2(ws->lock, std::try_to_lock);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"try lock" is generally a tricky game: all a failure to lock proves is that the lock was held at the time of trying. By the time you start executing the next instruction it may already have been released. And I would argue that this gives you essentially the same behaviour as the original code.

In the original code there is the issue that the schedule:
T0: rmw_wait / set busy / do attach / wait
T1: delete X / clean_waitset_caches / skip waitset_detach
T0: wakeup / erase slots for non-triggering entities / clear busy
ends up with the (rmw) waitset entry cache containing a dangling pointer a deleted rmw entity [*]. What will happen most of the time is that a subsequent call to rmw_wait will notice a change in the set of pointers (because it does a memcmp, the fact that it is dangling is not really an issue, and by accident more than design, the events will squeeze through without dereferencing a dangling pointer, too).

The interesting bit is when we then continue as follows:

  • T1: create X', and it happens to get the same address as X
  • T0: rmw_wait
    because the cache hasn't been invalidated and X is at the same address as X' (and presumably may moreover be in the same position in the arguments to rmw_wait) the cache checking logic can end up concluding that the cache is correct for the set being waited for, skipping the attach calls, and therefore not waiting for X' to trigger.

Your change to unconditionally call waitset_detach, blocking until a concurrent call to rmw_wait woke up, would fix that problem. But with this change, you're back to the original behaviour ... and thus get back the original problem.

But there may be a saving grace: it may (as my old comment noted) well be that ROS2 will never delete an entity while another thread is using it in rmw_wait. In that case, all this is a moot point. I think it is likely that this is the case, because at the RMW layer they're all simply pointers, not, e.g., weak references. (Cyclone doesn't make such assumptions in its API, but it exposes 31-bit random numbers as handles on its API instead of pointers. It doesn't guarantee anything about entity handle reuse, but it does use a decent PRNG and with the typical application having a few thousands handles at most, the likelihood of aliasing is pretty small.)

The other thing is that (as I mentioned before) the entire caching code exists because of a microbenchmark I once did. Perhaps it isn't worth it at all. If one simply rips it out, the problem disappears along with it.

(What would also work, and what I would consider doing if the caching is worth keeping, is to retain the busy flag, and depending on whether it is clear or set, either invalidate the cache or mark it as invalid/cleanup-deferred. And then, in rmw_wait, always do the "require reattach" path if it was invalid/cleanup-deferred.)

[*] It has been detached from the DDS waitset, that bit is not an issue.

Copy link
Collaborator Author

@rotu rotu Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is not a mistake with how I'm locking here, and the code is correct as-is.

unique_lock and lock_guard are both raii objects. That is they continue to hold the lock until the object is destroyed. In the case of clean_waitset_caches, that's one iteration of the loop (no worry that the lock might be snatched from us during waitset_detach) In the case of rmw_wait, that's until the function returns. Again, no worry that someone else might acquire the lock.

I do think I can make this code much better by redoing how we cache, but that's a bigger PR.

if (lock2.owns_lock()) {
waitset_detach(ws);
}
}
}

Expand Down