Implementation of a Concurrent Indexing Algorithm #279

muhammadharis · 2021-08-05T21:14:37Z

Aim

This change aims to modify the indexing algorithm for DUC to add concurrency. It does this by implementing a concurrent topological sort. This also closes #161.

Context/Motivation

Duc's performance naturally gets slower as the number of dirents in a filesystem increases. If the filesystem consists of an n node hierarchy, then the topological sort will take O(n) time. Each element has to be independently processed. For very large filesystems, Duc can take a long time to finish indexing. However, if Duc's indexing algorithm can be made concurrent with t threads (t being a parameter provided by a Duc user), then Duc's performance will significantly increase, as noted by users (#161).

Currently, Duc's indexing algorithm consists of a topological sort. For any given node N, the scanning algorithm is recursively ran on all the children. As each child terminates, it "frees", which consists of aggregating data that it has onto N, writing its own data to a buffer, and then freeing its resources.

The crux of parallelizing Duc's indexing algorithm then lies in implementing a concurrent topological sort that does exactly what Duc's current algorithm does, using a provided number of threads.

This motivates the concurrent indexing algorithm, which will now be described.

Algorithm

DFS is an interesting algorithm. It creates a callstack of recursive calls to return to as it traverses a tree. If we view it from the perspective of a main thread T, Each element in T's callstack can be worked independently on by another thread T1. If we can concretely represent this callstack (by making the algorithm iterative), then a worker-pool can be used to make DFS concurrrent. This was inspired by a paper written by the Dept. of CS at the University of Texas.

The algorithm is as follows:

Initialize a worker pool of t workers (t is a user-supplied parameter).
The initial worker is arbitrarily chosen, and the root of the tree is added to its stack. It then begins DFSing starting at the root of the tree.
As the worker progresses its DFS, it creates a stack of nodes that it needs to return to.
If the other workers see that they are out of work (their stacks are empty), they poll each others' stacks. If the number of elements on another worker's stack is greater than the cutoff frequency (a configured parameter), the worker takes tasks off the other worker's stack.

In this way, all workers are doing work to traverse the tree.

But how do we guarantee topological order of visiting the nodes?

This point is not so subtle and the original paper does not make mention of how to do this. However, I have constructed an algorithm that can do this:

Let us define a node N "processing" to mean the process whereby a worker puts N's relevant children (directories under N) onto its DFS stack. Each node maintains a count of the number of children it possesses num_children, a boolean to indicate whether it is finished processing done_processing, as well as a count of its children which finished processing completed_children.

When a node N is being processed by a worker:
	Each child N possesses is placed onto the DFS stack and N's <num_children> is incremented.
After processing:
	The worker checks if either of two conditions are true for N:
		- If N is a leaf, then it has no dependencies in the topological sort,
		  and so it can be freed.
		- If N's <completed_children> count equals the <num_children>,
		  then this means that the children all finished processing (possible due to
                  concurrency of the algorithm). Then topological order is guaranteed, and
                  so the node can be freed.
	Otherwise:
		- N cannot yet be freed, otherwise topological order will be violated. It has children,
		  and those dependencies have not yet finished processing. Thus, the worker marks N as 
                  <done_processing>, and leaves the freeing of N as a duty to the LAST CHILD of N
                  which finishes processing.

This algorithm makes mention of "leaves the freeing of N as a duty to the last child". How does this happen? This requires additional logic when any given node is freed. Any time a node is freed, the worker frees it and then walks up the tree, ensuring topological order is maintained. The algorithm for freeing is as follows:

Given a node N:

LOOP: The worker frees N and keeps track of its parent.
The <completed_children> count of the node's parent is incremented (since the child was freed).
If the freed node's parent is not done processing:
	The parent obviously cannot be freed because it hasn't finished processing. And so
	the loop is broken.
Otherwise:
	 The freeing of the parent was left as a duty to the LAST CHILD to finish.
	 So the worker checks if this node is the last child. If it is, LOOP is repeated with
	 the parent being the new node. Otherwise, the loop terminates.

Usage

This change adds two new command line options to duc index. These two options are thread-count=VAL (-t) and cutoff-depth=VAL (-c).

The thread-count=VAL option stipulates that the indexing algorithm use a given number of threads corresponding to workers. When this option is provided, duc will multithread the indexing algorithm and use VAL workers to traverse the filesystem. The default value is 1.

The cutoff-depth=VAL option ensures that each worker has at least VAL tasks before other workers start taking its tasks. In general the lower this is, the higher the concurrency. Default value is 2.

Note: The cutoff-depth

Example:

Thread count (specified by -t or --thread-count):
To test with 4 threads, do:
./duc index /path/to/directory -t 4
or, equivalently:
./duc index /path/to/directory --thread-count=4

Cutoff depth (specified by -c or --cutoff-depth):
To test with a cutoff depth of 3, do:
./duc index /path/to/directory -c 3
or, equivalently:
./duc index /path/to/directory --cutoff-depth=3

minor fix fix comment in configure.ac

zevv · 2021-08-06T06:40:05Z

Hey Muhammad,

Thanks for giving this a go! We did some threading prototypes in the past but never got this to a state where we were happy with the results in terms of complexity and performance. I'm currently packing my suitcases to leave for my holidays, so unfortunately I'll not be able to take a proper look at your code in the short term, sorry for that.

I briefly read your description of the algorithm, I'll go over that with some more scrutiny to properly understand what you're doing, but from the first look it seems backed by some proper theory - nice.

A few random notes from my first findings (note I spent only 10 minutes on this, so not very in-depth!)

Running your code with 16 threads on my local laptop (i5+SSD storage) shows a speedup of approx x1.8. Experiments from the past with parallelizing indexers have shown that it usually does not make sense to index one physical device with a lot of separate threads: reading is typically I/O bound on that particular device, not CPU bound on the machine itself. On physical spinning hard disks, indexing with more threads have shown to decrease performance in some cases, most likely caused by increased seeking times introduced by more random access of the disk. I believe the right way to approach this is to run X threads per physical device only. When indexing on a large number of different file systems, threading can give a huge boost, but on one device the expected speedup is limited. The scanner algo should probably take this into account.
We should take care that the 1-thread results should not decrease performance over the original code - at this time I see a 25% drop in performance, which might be caused by the added locking in buffer.c
Some of the runs crashed for me with a double free message.

Let's see if @l8gravely has anything to say on this first - we have not done any significant Duc development or maintenance over the last few years so the project is kind of dormant currently.

Thanks again!

muhammadharis · 2021-08-06T18:19:12Z

Hey Ico,

These comments are interesting. My team has ran this recently with 16 threads on very large directory structures that take days to complete and have had approx. 6x performance improvement, and zero memory errors. Do you have the test cases/directory structure that you ran this test on, as well as details about the operating system and architecture that you ran it on? I want to be able to replicate some of the issues that you're talking about.

l8gravely · 2021-08-10T15:14:56Z

Muhammad> Thanks for giving this a go! We did some threading Muhammad> prototypes in the past but never got this to a state Muhammad> where we were happy with the results in terms of Muhammad> complexity and performance. I'm currently packing my Muhammad> suitcases to leave for my holidays, so unfortunately Muhammad> I'll not be able to take a proper look at your code in Muhammad> the short term, sorry for that. And I'm busy with a bunch of personal stuff myself right now, so I won't have major time to look into this. Muhammad> I briefly read your description of the algorithm, I'll Muhammad> go over that with some more scrutiny to properly Muhammad> understand what you're doing, but from the first look it Muhammad> seems backed by some proper theory - nice. Muhammad> A few random notes from my first findings (note I spent Muhammad> only 10 minutes on this, so not very in-depth!) Muhammad> + Running your code with 16 threads on my local laptop Muhammad> (i5+SSD storage) shows a speedup of approx Muhammad> x1.8. Experiments from the past with parallelizing Muhammad> indexers have shown that it usually does not make Muhammad> sense to index one physical device with a lot of Muhammad> separate threads: reading is typically I/O bound on Muhammad> that particular device, not CPU bound on the machine Muhammad> itself. On physical spinning hard disks, indexing Muhammad> with more threads have shown to decrease performance Muhammad> in some cases, most likely caused by increased Muhammad> seeking times introduced by more random access of Muhammad> the disk. I believe the right way to approach this Muhammad> is to run X threads per physical device only. When Muhammad> indexing on a large number of different file Muhammad> systems, threading can give a huge boost, but on one Muhammad> device the expected speedup is limited. The scanner Muhammad> algo should probably take this into account. + We Muhammad> should take care that the 1-thread results should Muhammad> not decrease performance over the original code - at Muhammad> this time I see a 25% drop in performance, which Muhammad> might be caused by the added locking in buffer.c + Muhammad> Some of the runs crashed for me with a double free Muhammad> message. Muhammad> Let's see if @l8gravely has anything to say on this Muhammad> first - we have not done any significant Duc development Muhammad> or maintenance over the last few years so the project is Muhammad> kind of dormant currently. I've been meaning to get a release out with various patches and updates... but life got in the way. There are a bunch of patches queued up in the master branch past the 1.4.4 tag, so maybe build your patches on there? I haven't had a chance to read the proposed patch up front, but not losing performance in the default case is a big issue for me. I remember trying to parallize philesight way back when by ran into the Global Lock problems of Ruby which didn't lead to much speedup at all. So I'd really be interested in seeing if -t 2 shows any speedup at all. I'll try to build your code and scan some of my big NFS filesystems to see how it works, since large backend NFS file servers should be a good target for this code, but not local disks. I'm not even sure if an SSD will work better here. But again, numbers talk, so we'll have to see what we find. Thank you for doing this work! John

…itions

muhammadharis · 2021-08-13T19:24:04Z

The commit that I just made fixes two possible race conditions that would result in the double free that @zevv described. The reasoning is as follows:

For the following examples, imagine the following tree structure:

Imagine the following two race conditions that are very rare in practice:

Race condition 1:

1 starts to process, puts both 2 and 3 onto the DFS stack, finishes processing, and sets done_processing to true. num_children is now 2
2 and 3 start processing in parallel.
2 increments completed_children on 1, so now completed_children = 1
3 increments completed_children on 1, so now completed_children = 2
Both 2 and 3 see that completed_children == num_children is True.
Both try to free 1.
Double free occurs

Race condition 2:

1 starts to process, puts both 2 and 3 onto the DFS stack, but does not get to the part of the code which checks if it needs to be freed yet
2 finishes processing, increments completed_children=1 when it frees, and exits
3 finishes processing, increments completed_children=2, but does not check the condition that 1 is done processing yet
1 gets to the part of the code where it checks whether it needs to free. It marks itself as done_processing=True, and sees that completed_children == num_children is True. 1 frees itself.
3 checks the condition that 1 is done processing, which is True. It also sees that completed_children = num_children is True
3 also tries to free 1.
Double free occurs

Solution
Both issues result from a flag or a counter being incremented and then a condition being checked non-atomically. By locking the code around incrementing the counter and checking the condition, these race conditions are rendered impossible to happen.

l8gravely · 2021-08-15T01:47:19Z

>>>> "Muhammad" == Muhammad Haris ***@***.***> writes:

Muhammad> The commit that I just made fixes two possible race Muhammad> conditions that would result in the double free that @zevv Muhammad> described. The reasoning is as follows: Do you have a test case for this in your commit? And do you have more performance numbers of your code? I'll try to run some more tests myself this week as I get a chance across some big NFS servers. Muhammad> For the following examples, imagine the following tree structure: Muhammad> Screen Shot 2021-08-13 at 3 23 16 PM Muhammad> Imagine the following two race conditions that are very rare in practice: Muhammad> Race condition 1: Muhammad> * 1 starts to process, puts both 2 and 3 onto the DFS stack, finishes processing, and sets Muhammad> done_processing to true. num_children is now 2 Muhammad> * 2 and 3 start processing in parallel. Muhammad> * 2 increments completed_children on 1, so now completed_children = 1 Muhammad> * 3 increments completed_children on 1, so now completed_children = 2 Muhammad> * Both 2 and 3 see that completed_children == num_children is True. Muhammad> * Both try to free 1. Muhammad> * Double free occurs Muhammad> Race condition 2: Muhammad> * 1 starts to process, puts both 2 and 3 onto the DFS stack, but does not get to the part of the Muhammad> code which checks if it needs to be freed yet Muhammad> * 2 finishes processing, increments completed_children=1 when it frees, and exits Muhammad> * 3 finishes processing, increments completed_children=2, but does not check the condition that Muhammad> 1 is done processing yet Muhammad> * 1 gets to the part of the code where it checks whether it needs to free. It marks itself as Muhammad> done_processing=True, and sees that completed_children == num_children is True. 1 frees Muhammad> itself. Muhammad> * 3 checks the condition that 1 is done processing, which is True. It also sees that Muhammad> completed_children = num_children is True Muhammad> * 3 also tries to free 1. Muhammad> * Double free occurs Muhammad> Solution Muhammad> Both issues result from a flag or a counter being incremented and then a condition being checked Muhammad> non-atomically. By locking the code around incrementing the counter and checking the condition, Muhammad> these race conditions are rendered impossible to happen. Muhammad> — Muhammad> You are receiving this because you were mentioned. Muhammad> Reply to this email directly, view it on GitHub, or unsubscribe. Muhammad> Triage notifications on the go with GitHub Mobile for iOS or Android.*

Implementation of a concurrent indexing algorithm

cde97d6

minor fix fix comment in configure.ac

zevv closed this Aug 6, 2021

zevv reopened this Aug 6, 2021

Create the fix to two conditions that could create possible race cond…

3c39f5c

…itions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of a Concurrent Indexing Algorithm #279

Implementation of a Concurrent Indexing Algorithm #279

muhammadharis commented Aug 5, 2021

zevv commented Aug 6, 2021 •

edited

Loading

muhammadharis commented Aug 6, 2021 •

edited

Loading

l8gravely commented Aug 10, 2021 via email

muhammadharis commented Aug 13, 2021

l8gravely commented Aug 15, 2021 via email

Implementation of a Concurrent Indexing Algorithm #279

Are you sure you want to change the base?

Implementation of a Concurrent Indexing Algorithm #279

Conversation

muhammadharis commented Aug 5, 2021

Aim

Context/Motivation

Algorithm

But how do we guarantee topological order of visiting the nodes?

Usage

Example:

zevv commented Aug 6, 2021 • edited Loading

muhammadharis commented Aug 6, 2021 • edited Loading

l8gravely commented Aug 10, 2021 via email

muhammadharis commented Aug 13, 2021

l8gravely commented Aug 15, 2021 via email

zevv commented Aug 6, 2021 •

edited

Loading

muhammadharis commented Aug 6, 2021 •

edited

Loading