-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of a Concurrent Indexing Algorithm #279
base: master
Are you sure you want to change the base?
Conversation
minor fix fix comment in configure.ac
Hey Muhammad, Thanks for giving this a go! We did some threading prototypes in the past but never got this to a state where we were happy with the results in terms of complexity and performance. I'm currently packing my suitcases to leave for my holidays, so unfortunately I'll not be able to take a proper look at your code in the short term, sorry for that. I briefly read your description of the algorithm, I'll go over that with some more scrutiny to properly understand what you're doing, but from the first look it seems backed by some proper theory - nice. A few random notes from my first findings (note I spent only 10 minutes on this, so not very in-depth!)
Let's see if @l8gravely has anything to say on this first - we have not done any significant Duc development or maintenance over the last few years so the project is kind of dormant currently. Thanks again! |
Hey Ico, These comments are interesting. My team has ran this recently with 16 threads on very large directory structures that take days to complete and have had approx. 6x performance improvement, and zero memory errors. Do you have the test cases/directory structure that you ran this test on, as well as details about the operating system and architecture that you ran it on? I want to be able to replicate some of the issues that you're talking about. |
Muhammad> Thanks for giving this a go! We did some threading
Muhammad> prototypes in the past but never got this to a state
Muhammad> where we were happy with the results in terms of
Muhammad> complexity and performance. I'm currently packing my
Muhammad> suitcases to leave for my holidays, so unfortunately
Muhammad> I'll not be able to take a proper look at your code in
Muhammad> the short term, sorry for that.
And I'm busy with a bunch of personal stuff myself right now, so I
won't have major time to look into this.
Muhammad> I briefly read your description of the algorithm, I'll
Muhammad> go over that with some more scrutiny to properly
Muhammad> understand what you're doing, but from the first look it
Muhammad> seems backed by some proper theory - nice.
Muhammad> A few random notes from my first findings (note I spent
Muhammad> only 10 minutes on this, so not very in-depth!)
Muhammad> + Running your code with 16 threads on my local laptop
Muhammad> (i5+SSD storage) shows a speedup of approx
Muhammad> x1.8. Experiments from the past with parallelizing
Muhammad> indexers have shown that it usually does not make
Muhammad> sense to index one physical device with a lot of
Muhammad> separate threads: reading is typically I/O bound on
Muhammad> that particular device, not CPU bound on the machine
Muhammad> itself. On physical spinning hard disks, indexing
Muhammad> with more threads have shown to decrease performance
Muhammad> in some cases, most likely caused by increased
Muhammad> seeking times introduced by more random access of
Muhammad> the disk. I believe the right way to approach this
Muhammad> is to run X threads per physical device only. When
Muhammad> indexing on a large number of different file
Muhammad> systems, threading can give a huge boost, but on one
Muhammad> device the expected speedup is limited. The scanner
Muhammad> algo should probably take this into account. + We
Muhammad> should take care that the 1-thread results should
Muhammad> not decrease performance over the original code - at
Muhammad> this time I see a 25% drop in performance, which
Muhammad> might be caused by the added locking in buffer.c +
Muhammad> Some of the runs crashed for me with a double free
Muhammad> message.
Muhammad> Let's see if @l8gravely has anything to say on this
Muhammad> first - we have not done any significant Duc development
Muhammad> or maintenance over the last few years so the project is
Muhammad> kind of dormant currently.
I've been meaning to get a release out with various patches and
updates... but life got in the way. There are a bunch of patches
queued up in the master branch past the 1.4.4 tag, so maybe build your
patches on there?
I haven't had a chance to read the proposed patch up front, but
not losing performance in the default case is a big issue for me. I
remember trying to parallize philesight way back when by ran into the
Global Lock problems of Ruby which didn't lead to much speedup at
all.
So I'd really be interested in seeing if -t 2 shows any speedup at
all. I'll try to build your code and scan some of my big NFS
filesystems to see how it works, since large backend NFS file servers
should be a good target for this code, but not local disks. I'm not
even sure if an SSD will work better here.
But again, numbers talk, so we'll have to see what we find.
Thank you for doing this work!
John
|
The commit that I just made fixes two possible race conditions that would result in the double free that @zevv described. The reasoning is as follows: For the following examples, imagine the following tree structure: Imagine the following two race conditions that are very rare in practice: Race condition 1:
Race condition 2:
Solution |
>>>> "Muhammad" == Muhammad Haris ***@***.***> writes:
Muhammad> The commit that I just made fixes two possible race
Muhammad> conditions that would result in the double free that @zevv
Muhammad> described. The reasoning is as follows:
Do you have a test case for this in your commit? And do you have more
performance numbers of your code? I'll try to run some more tests
myself this week as I get a chance across some big NFS servers.
Muhammad> For the following examples, imagine the following tree structure:
Muhammad> Screen Shot 2021-08-13 at 3 23 16 PM
Muhammad> Imagine the following two race conditions that are very rare in practice:
Muhammad> Race condition 1:
Muhammad> * 1 starts to process, puts both 2 and 3 onto the DFS stack, finishes processing, and sets
Muhammad> done_processing to true. num_children is now 2
Muhammad> * 2 and 3 start processing in parallel.
Muhammad> * 2 increments completed_children on 1, so now completed_children = 1
Muhammad> * 3 increments completed_children on 1, so now completed_children = 2
Muhammad> * Both 2 and 3 see that completed_children == num_children is True.
Muhammad> * Both try to free 1.
Muhammad> * Double free occurs
Muhammad> Race condition 2:
Muhammad> * 1 starts to process, puts both 2 and 3 onto the DFS stack, but does not get to the part of the
Muhammad> code which checks if it needs to be freed yet
Muhammad> * 2 finishes processing, increments completed_children=1 when it frees, and exits
Muhammad> * 3 finishes processing, increments completed_children=2, but does not check the condition that
Muhammad> 1 is done processing yet
Muhammad> * 1 gets to the part of the code where it checks whether it needs to free. It marks itself as
Muhammad> done_processing=True, and sees that completed_children == num_children is True. 1 frees
Muhammad> itself.
Muhammad> * 3 checks the condition that 1 is done processing, which is True. It also sees that
Muhammad> completed_children = num_children is True
Muhammad> * 3 also tries to free 1.
Muhammad> * Double free occurs
Muhammad> Solution
Muhammad> Both issues result from a flag or a counter being incremented and then a condition being checked
Muhammad> non-atomically. By locking the code around incrementing the counter and checking the condition,
Muhammad> these race conditions are rendered impossible to happen.
Muhammad> —
Muhammad> You are receiving this because you were mentioned.
Muhammad> Reply to this email directly, view it on GitHub, or unsubscribe.
Muhammad> Triage notifications on the go with GitHub Mobile for iOS or Android.*
|
Aim
This change aims to modify the indexing algorithm for DUC to add concurrency. It does this by implementing a concurrent topological sort. This also closes #161.
Context/Motivation
Duc's performance naturally gets slower as the number of dirents in a filesystem increases. If the filesystem consists of an
n
node hierarchy, then the topological sort will takeO(n)
time. Each element has to be independently processed. For very large filesystems, Duc can take a long time to finish indexing. However, if Duc's indexing algorithm can be made concurrent with t threads (t being a parameter provided by a Duc user), then Duc's performance will significantly increase, as noted by users (#161).Currently, Duc's indexing algorithm consists of a topological sort. For any given node
N
, the scanning algorithm is recursively ran on all the children. As each child terminates, it "frees", which consists of aggregating data that it has ontoN
, writing its own data to a buffer, and then freeing its resources.The crux of parallelizing Duc's indexing algorithm then lies in implementing a concurrent topological sort that does exactly what Duc's current algorithm does, using a provided number of threads.
This motivates the concurrent indexing algorithm, which will now be described.
Algorithm
DFS is an interesting algorithm. It creates a callstack of recursive calls to return to as it traverses a tree. If we view it from the perspective of a main thread
T
, Each element inT
's callstack can be worked independently on by another threadT1
. If we can concretely represent this callstack (by making the algorithm iterative), then a worker-pool can be used to make DFS concurrrent. This was inspired by a paper written by the Dept. of CS at the University of Texas.The algorithm is as follows:
In this way, all workers are doing work to traverse the tree.
But how do we guarantee topological order of visiting the nodes?
This point is not so subtle and the original paper does not make mention of how to do this. However, I have constructed an algorithm that can do this:
Let us define a node
N
"processing" to mean the process whereby a worker putsN
's relevant children (directories underN
) onto its DFS stack. Each node maintains a count of the number of children it possessesnum_children
, a boolean to indicate whether it is finished processingdone_processing
, as well as a count of its children which finished processingcompleted_children
.This algorithm makes mention of "leaves the freeing of
N
as a duty to the last child". How does this happen? This requires additional logic when any given node is freed. Any time a node is freed, the worker frees it and then walks up the tree, ensuring topological order is maintained. The algorithm for freeing is as follows:Usage
This change adds two new command line options to
duc index
. These two options arethread-count=VAL
(-t
) andcutoff-depth=VAL
(-c
).The
thread-count=VAL
option stipulates that the indexing algorithm use a given number of threads corresponding to workers. When this option is provided, duc will multithread the indexing algorithm and useVAL
workers to traverse the filesystem. The default value is 1.The
cutoff-depth=VAL
option ensures that each worker has at leastVAL
tasks before other workers start taking its tasks. In general the lower this is, the higher the concurrency. Default value is 2.Note: The
cutoff-depth
Example:
Thread count (specified by -t or --thread-count):
To test with 4 threads, do:
./duc index /path/to/directory -t 4
or, equivalently:
./duc index /path/to/directory --thread-count=4
Cutoff depth (specified by -c or --cutoff-depth):
To test with a cutoff depth of 3, do:
./duc index /path/to/directory -c 3
or, equivalently:
./duc index /path/to/directory --cutoff-depth=3