Fix infinite loop in SubprocessSet::DoWork() #2484

dendy · 2024-08-28T07:47:48Z

In case fd is invalid loop will be infinite because iterator is not incremented.

Make cycle more obvious by adding regular iterator into for() loop and move Subprocess from running_ into finished_ in separate loop.

digit-google

Technically, this changes the order of calls to OnPipeReady() and Done() called at runtime, making them inconsistent with the !USE_PPOLL implementation. This may bite us in the future when refactoring this code.

Can you instead just add the missing increment in the if (fd < 0) continue statement (e.g. if (fd < 0) { ++i; continue }), which will have the same effect with a simpler change.

In case fd is invalid loop will be infinite because iterator is not incremented. Make loop consistent between both variants of SubprocessSet::DoWork() (with and withoug USE_PPOLL) by removing 'continue' and always advancing iterator in the end of the cycle.

digit-google · 2024-08-28T16:48:24Z

You are making the code more complicated for now good reason. Please simplify as suggested instead. @jhasse, wdyt?

dendy · 2024-08-28T16:55:04Z

Code should be more clear now, since there is no continue statement, and consistent between two variants of DoWork(). I prepared it in this way to simplify the subsequent patch for stderr implementation, which adds handling of stderr fd to the loops.

digit-google · 2024-08-28T17:08:10Z

No. You are changing the behavior of the code by calling the Done() method in cases where it wasn't called before. This is wasteful and potentially dangerous.

This is not what this patch is supposed to do, which would be to avoid fixing a potential infinite loop, and which can be trivially done by adding a missing increment in one line of source file.

Please keep it simple.

dendy · 2024-08-28T17:24:59Z

To be honest, I do not see how this condition should be met in first place:

if (fd < 0) continue;

I believe that is a leftover from some older implementation. And correct fix would be to replace this condition with assert. As long as Subprocess in in the running_ list its fd_ must always be valid. Please correct me if I am wrong, because I cannot see where else it might be changing outside of OnPipeReady().

If the above is true, then behavior was not changed, and Done() call will return false when OnPipeReady() was not called.

The only reason I moved Done() to the end of the loop, is to prepare for the subsequent addition of the second error fd to handle. If you prefer not to do that in within this patch, then I believe the correct fix for this would be to replace the condition above with:

assert(fd >= 0)

dendy · 2024-08-28T17:27:52Z

To clarify, when second error fd will be added, Subprocess will stay in the running_ list until both fds are cleared. And the condition check above in runtime makes sense. But with a single fd the condition will always fail. That is why you never see ninja to freeze in the current implementation, because fd is always valid.

digit-google · 2024-08-28T19:02:20Z

First, do not sneak unrelated behavior-changing code into a patch that is supposed to address a trivial issue. Not only this is a very easy way to introduce subtle bugs, or worse security issues, but your are also wasting time for you and reviewers, as well as lower the chances that your future contributions will be considered of appropriate quality. That being said, mistakes happen.

Second, looking at the code, it looks like that given the current implementation, the fd_ value would always be valid when these loop run. However, these details are pretty subtle, as well as the fact that the DoWork() function mixes low-level file event handling and management of the running_ and finished_ sub-process queues. It is easy to imagine an unrelated change elsewhere in this source file that would trigger the bug flakily, so fixing it has value.

assert() statements are no-ops in production code, so would not solve a potential infinite loop issue. It is better to call Fatal() if the condition is detected, with a hint explaining what's going on instead, or simply ignore such file descriptors in the loop.

If you intend to modify the behavior of this function in non-trivial way, I suggest you do that in a dedicated PR that clearly explain the rationale for it, and hopefully provides separate commits to address changes in a logical order. And don't forget unit or regression tests for the new feature / behavior.

Regarding the topic of supporting two separate event descriptors per sub-process, I would recommend first decoupling the file event wait logic from the queue management one (e.g. provide a PosixFileWatcher class that provides clients the ability to register callbacks to be invoked when an event is detected, with the ability to add / remove registrations within the callback function itself for safety). That will make the code much clearer to understand and maintain over time, while allowing unit-testing for different types of conditions that are not possible with the current code.

Apart from that, I recommend you either drop this PR completely, or just fix the actual bug by adding a missing increment.

dendy · 2024-08-28T19:31:41Z

There are no behavior changes within this patch.

I believe the biggest confusion is in title, it should rather say "Cleanup" than "Fix", because there was no runtime issue within the current code. I already put that description into the commit body, but I would also change the commit title to make it clear.

Justification for this code cleanup is to prepare the DoWork() implementation for the 2 fds handling simultaneously.

I believe current implementation of the DoWork() is fine with regards to multiple fd event handling, and extending it with 2 fds support is trivial, here is how such a change will look after this patch: f8ce671#diff-e70adfab80664f71832232d71b3055c78e054dd75fbeec69d019cb0e9dce6b2dL283

Since fd_ is supposed to be always valid inside the running_ list, assert() should be the correct way of handling it. Fatal() is for catching runtime issues, assert() is for source code bugs.

I will make appropriate changes, and will rephrase this PR description.

mcprat · 2024-08-29T03:49:53Z

the explanation doesn't need to be so complex

the solution is to add instances of ++i wherever they are missing

blocking of error conditions is preferred over the acceptance of non-error conditions, because otherwise the rest of the function is subject to fallthrough conditions

mcprat · 2024-08-29T04:08:44Z

src/subprocess-posix.cc

  for (vector<Subprocess*>::iterator i = running_.begin();
       i != running_.end(); ++i) {


by the way this one has the increment anyway

mcprat · 2024-08-29T04:24:30Z

src/subprocess-posix.cc

    if (fd >= 0 && FD_ISSET(fd, &set)) {
      (*i)->OnPipeReady();


this is existing already, changes below have nothing to do with the fd value...

if an infinite loop is still possible after the fd value is handled then continue should cause increment of i, like how a normal for loop is written...

are we sure that omitting the increment in the loop itself isn't the actual mistake and all you need to do is put it back?

dendy · 2024-08-29T04:26:26Z

by the way this one has the increment anyway

In that loop running_ list is immutable, thus it increments iterator with ++i on each cycle.

The code we are talking about is unreachable:
if (fd < 0) continue;

This "fixup" patch was originally created in 2015, at that time I assumed that there may be more conditions when Subprocess fd_ is closed, but Subprocess itself may be still running. That thought also matched the feature I was working on: adding second error_fd_ for tracking the error output, which could be still open.

By checking the code now I may claim that by design there is no condition when running_ list may contain Subprocess, which Done() == true. There is also a test case that validates this: SubprocessTest.SetWithMulti. Thus that check for the if (fd < 0) is unreachable and dead, and need to be removed or replaced with an assert().

mcprat · 2024-08-29T05:10:28Z

by the way this one has the increment anyway

In that loop running_ list is immutable

doesn't matter, the specific loop i'm referring to only reads from running_

By checking the code now I may claim that by design there is no condition when running_ list may contain Subprocess, which Done() == true. There is also a test case that validates this: SubprocessTest.SetWithMulti. Thus that check for the if (fd < 0) is unreachable and dead, and need to be removed or replaced with an assert().

This is likely true in real-life cases, but perhaps the reason that the fd check was put there to begin with was for the crazy unrealistic scenario of the system running out of available FDs to open. Without a supercomputer it would take a patched kernel to verify that.

dendy · 2024-08-29T05:17:16Z

crazy unrealistic scenario of the system running out of available FDs to open

There is validation for that condition in code in 2 places: pipe() syscall in Subprocess::Start() and SubprocessTest.SetWithMulti. So it is still not possible to have undone Subprocess in the running_ list.

mcprat · 2024-08-29T06:07:48Z

simply removing the fd checks from the loops is likely acceptable then

jhasse · 2024-08-30T20:34:02Z

puh, sorry just skimmed over most of it. When exactly is this infinite loop triggered?

dendy · 2024-08-30T20:41:52Z

When exactly is this infinite loop triggered?

It is never triggered with current implementation, and per my understanding that is by design that running_ list may contain finished Subprocess instances. Thus this is not a fix, but rather cleanup of the old dead code. I will change the description and update the commit.

digit-google suggested changes Aug 28, 2024

View reviewed changes

dendy force-pushed the fix-infinite-loop-in-SubprocessSet-DoWork branch from ad18d33 to a6da182 Compare August 28, 2024 16:42

mcprat reviewed Aug 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix infinite loop in SubprocessSet::DoWork() #2484

Fix infinite loop in SubprocessSet::DoWork() #2484

dendy commented Aug 28, 2024

digit-google left a comment

digit-google commented Aug 28, 2024

dendy commented Aug 28, 2024

digit-google commented Aug 28, 2024

dendy commented Aug 28, 2024

dendy commented Aug 28, 2024

digit-google commented Aug 28, 2024

dendy commented Aug 28, 2024

mcprat commented Aug 29, 2024

mcprat Aug 29, 2024

mcprat Aug 29, 2024

dendy commented Aug 29, 2024 •

edited

Loading

mcprat commented Aug 29, 2024

dendy commented Aug 29, 2024

mcprat commented Aug 29, 2024

jhasse commented Aug 30, 2024

dendy commented Aug 30, 2024

		for (vector<Subprocess*>::iterator i = running_.begin();
		i != running_.end(); ++i) {

Fix infinite loop in SubprocessSet::DoWork() #2484

Are you sure you want to change the base?

Fix infinite loop in SubprocessSet::DoWork() #2484

Conversation

dendy commented Aug 28, 2024

digit-google left a comment

Choose a reason for hiding this comment

digit-google commented Aug 28, 2024

dendy commented Aug 28, 2024

digit-google commented Aug 28, 2024

dendy commented Aug 28, 2024

dendy commented Aug 28, 2024

digit-google commented Aug 28, 2024

dendy commented Aug 28, 2024

mcprat commented Aug 29, 2024

mcprat Aug 29, 2024

Choose a reason for hiding this comment

mcprat Aug 29, 2024

Choose a reason for hiding this comment

dendy commented Aug 29, 2024 • edited Loading

mcprat commented Aug 29, 2024

dendy commented Aug 29, 2024

mcprat commented Aug 29, 2024

jhasse commented Aug 30, 2024

dendy commented Aug 30, 2024

dendy commented Aug 29, 2024 •

edited

Loading