Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of a Concurrent Indexing Algorithm #279

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

muhammadharis
Copy link

Aim

This change aims to modify the indexing algorithm for DUC to add concurrency. It does this by implementing a concurrent topological sort. This also closes #161.

Context/Motivation

Duc's performance naturally gets slower as the number of dirents in a filesystem increases. If the filesystem consists of an n node hierarchy, then the topological sort will take O(n) time. Each element has to be independently processed. For very large filesystems, Duc can take a long time to finish indexing. However, if Duc's indexing algorithm can be made concurrent with t threads (t being a parameter provided by a Duc user), then Duc's performance will significantly increase, as noted by users (#161).

Currently, Duc's indexing algorithm consists of a topological sort. For any given node N, the scanning algorithm is recursively ran on all the children. As each child terminates, it "frees", which consists of aggregating data that it has onto N, writing its own data to a buffer, and then freeing its resources.

The crux of parallelizing Duc's indexing algorithm then lies in implementing a concurrent topological sort that does exactly what Duc's current algorithm does, using a provided number of threads.

This motivates the concurrent indexing algorithm, which will now be described.

Algorithm

DFS is an interesting algorithm. It creates a callstack of recursive calls to return to as it traverses a tree. If we view it from the perspective of a main thread T, Each element in T's callstack can be worked independently on by another thread T1. If we can concretely represent this callstack (by making the algorithm iterative), then a worker-pool can be used to make DFS concurrrent. This was inspired by a paper written by the Dept. of CS at the University of Texas.

The algorithm is as follows:

  • Initialize a worker pool of t workers (t is a user-supplied parameter).
  • The initial worker is arbitrarily chosen, and the root of the tree is added to its stack. It then begins DFSing starting at the root of the tree.
  • As the worker progresses its DFS, it creates a stack of nodes that it needs to return to.
  • If the other workers see that they are out of work (their stacks are empty), they poll each others' stacks. If the number of elements on another worker's stack is greater than the cutoff frequency (a configured parameter), the worker takes tasks off the other worker's stack.

In this way, all workers are doing work to traverse the tree.

But how do we guarantee topological order of visiting the nodes?

This point is not so subtle and the original paper does not make mention of how to do this. However, I have constructed an algorithm that can do this:

Let us define a node N "processing" to mean the process whereby a worker puts N's relevant children (directories under N) onto its DFS stack. Each node maintains a count of the number of children it possesses num_children, a boolean to indicate whether it is finished processing done_processing, as well as a count of its children which finished processing completed_children.

When a node N is being processed by a worker:
	Each child N possesses is placed onto the DFS stack and N's <num_children> is incremented.
After processing:
	The worker checks if either of two conditions are true for N:
		- If N is a leaf, then it has no dependencies in the topological sort,
		  and so it can be freed.
		- If N's <completed_children> count equals the <num_children>,
		  then this means that the children all finished processing (possible due to
                  concurrency of the algorithm). Then topological order is guaranteed, and
                  so the node can be freed.
	Otherwise:
		- N cannot yet be freed, otherwise topological order will be violated. It has children,
		  and those dependencies have not yet finished processing. Thus, the worker marks N as 
                  <done_processing>, and leaves the freeing of N as a duty to the LAST CHILD of N
                  which finishes processing.

This algorithm makes mention of "leaves the freeing of N as a duty to the last child". How does this happen? This requires additional logic when any given node is freed. Any time a node is freed, the worker frees it and then walks up the tree, ensuring topological order is maintained. The algorithm for freeing is as follows:

Given a node N:

LOOP: The worker frees N and keeps track of its parent.
The <completed_children> count of the node's parent is incremented (since the child was freed).
If the freed node's parent is not done processing:
	The parent obviously cannot be freed because it hasn't finished processing. And so
	the loop is broken.
Otherwise:
	 The freeing of the parent was left as a duty to the LAST CHILD to finish.
	 So the worker checks if this node is the last child. If it is, LOOP is repeated with
	 the parent being the new node. Otherwise, the loop terminates.

Usage

This change adds two new command line options to duc index. These two options are thread-count=VAL (-t) and cutoff-depth=VAL (-c).

The thread-count=VAL option stipulates that the indexing algorithm use a given number of threads corresponding to workers. When this option is provided, duc will multithread the indexing algorithm and use VAL workers to traverse the filesystem. The default value is 1.

The cutoff-depth=VAL option ensures that each worker has at least VAL tasks before other workers start taking its tasks. In general the lower this is, the higher the concurrency. Default value is 2.

Note: The cutoff-depth

Example:

Thread count (specified by -t or --thread-count):
To test with 4 threads, do:
./duc index /path/to/directory -t 4
or, equivalently:
./duc index /path/to/directory --thread-count=4

Cutoff depth (specified by -c or --cutoff-depth):
To test with a cutoff depth of 3, do:
./duc index /path/to/directory -c 3
or, equivalently:
./duc index /path/to/directory --cutoff-depth=3

minor fix

fix comment in configure.ac
@zevv
Copy link
Owner

zevv commented Aug 6, 2021

Hey Muhammad,

Thanks for giving this a go! We did some threading prototypes in the past but never got this to a state where we were happy with the results in terms of complexity and performance. I'm currently packing my suitcases to leave for my holidays, so unfortunately I'll not be able to take a proper look at your code in the short term, sorry for that.

I briefly read your description of the algorithm, I'll go over that with some more scrutiny to properly understand what you're doing, but from the first look it seems backed by some proper theory - nice.

A few random notes from my first findings (note I spent only 10 minutes on this, so not very in-depth!)

  • Running your code with 16 threads on my local laptop (i5+SSD storage) shows a speedup of approx x1.8. Experiments from the past with parallelizing indexers have shown that it usually does not make sense to index one physical device with a lot of separate threads: reading is typically I/O bound on that particular device, not CPU bound on the machine itself. On physical spinning hard disks, indexing with more threads have shown to decrease performance in some cases, most likely caused by increased seeking times introduced by more random access of the disk. I believe the right way to approach this is to run X threads per physical device only. When indexing on a large number of different file systems, threading can give a huge boost, but on one device the expected speedup is limited. The scanner algo should probably take this into account.

  • We should take care that the 1-thread results should not decrease performance over the original code - at this time I see a 25% drop in performance, which might be caused by the added locking in buffer.c

  • Some of the runs crashed for me with a double free message.

Let's see if @l8gravely has anything to say on this first - we have not done any significant Duc development or maintenance over the last few years so the project is kind of dormant currently.

Thanks again!

@zevv zevv closed this Aug 6, 2021
@zevv zevv reopened this Aug 6, 2021
@muhammadharis
Copy link
Author

muhammadharis commented Aug 6, 2021

Hey Ico,

These comments are interesting. My team has ran this recently with 16 threads on very large directory structures that take days to complete and have had approx. 6x performance improvement, and zero memory errors. Do you have the test cases/directory structure that you ran this test on, as well as details about the operating system and architecture that you ran it on? I want to be able to replicate some of the issues that you're talking about.

@l8gravely
Copy link
Collaborator

l8gravely commented Aug 10, 2021 via email

@muhammadharis
Copy link
Author

The commit that I just made fixes two possible race conditions that would result in the double free that @zevv described. The reasoning is as follows:

For the following examples, imagine the following tree structure:

Screen Shot 2021-08-13 at 3 23 16 PM

Imagine the following two race conditions that are very rare in practice:

Race condition 1:

  • 1 starts to process, puts both 2 and 3 onto the DFS stack, finishes processing, and sets done_processing to true. num_children is now 2
  • 2 and 3 start processing in parallel.
  • 2 increments completed_children on 1, so now completed_children = 1
  • 3 increments completed_children on 1, so now completed_children = 2
  • Both 2 and 3 see that completed_children == num_children is True.
  • Both try to free 1.
  • Double free occurs

Race condition 2:

  • 1 starts to process, puts both 2 and 3 onto the DFS stack, but does not get to the part of the code which checks if it needs to be freed yet
  • 2 finishes processing, increments completed_children=1 when it frees, and exits
  • 3 finishes processing, increments completed_children=2, but does not check the condition that 1 is done processing yet
  • 1 gets to the part of the code where it checks whether it needs to free. It marks itself as done_processing=True, and sees that completed_children == num_children is True. 1 frees itself.
  • 3 checks the condition that 1 is done processing, which is True. It also sees that completed_children = num_children is True
  • 3 also tries to free 1.
  • Double free occurs

Solution
Both issues result from a flag or a counter being incremented and then a condition being checked non-atomically. By locking the code around incrementing the counter and checking the condition, these race conditions are rendered impossible to happen.

@l8gravely
Copy link
Collaborator

l8gravely commented Aug 15, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Indexing algorithm
3 participants