-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Scan with multiple processes #63
Conversation
mkaruza
commented
Jul 2, 2024
- We can now start multiple worker process that can read relation blocks and write buffer to shared memory to be consumed by duckdb reader threads. Problem can be viewed as variant of multi producer / multi consumer where producer are responsible for releasing buffer after they are read.
@mkaruza what's the deal with this? I guess it's not critical for 0.1.0. Given the amount of merge conflicts it has now, I'm think it probably makes the most sense to simply close this and maybe create an issue for it instead. |
@JelteF agree not critical, but could be approach to get better performances of heap table scans (specifically to fetch table pages in parallel; currently there is bottleneck there as only one thread in process can request page fetching). |
2a8c461
to
48b9ae6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level looks reasonable (and perf improvements are promising).
Wonder if we could implement communication between thread and worker through signals rather than busy-spin on atomic variables, and if this would improve perfs.
src/scan/heap_reader_worker.cpp
Outdated
LockBuffer(buffer, BUFFER_LOCK_SHARE); | ||
|
||
/* is previous buffer done */ | ||
while (!pg_atomic_unlocked_test_flag(&thread_worker_shared_state->buffer_ready)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try adding a very small "sleep" in the loop rather than busy-spinning? I wonder if it would help performances.
Also could we use signals instead?
Should we have a timeout to handle situation where thread dies without ever resetting its thread_running
or buffer_ready
flags?
src/scan/heap_reader_worker.cpp
Outdated
|
||
if (thread_running) { | ||
/* We are out of blocks fo reading so wait for last buffer to be done */ | ||
while (!pg_atomic_unlocked_test_flag(&thread_worker_shared_state->buffer_ready)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess same apply here - and we could factorize the two loops in a "wait" function?
src/scan/heap_reader.cpp
Outdated
} | ||
|
||
/* Is buffer ready for reading */ | ||
while (pg_atomic_unlocked_test_flag(&m_thread_worker_shared_state->buffer_ready)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could use signals here rather than busy-spinning?
@@ -1,6 +1,6 @@ | |||
# Configuration | |||
|
|||
shared_preload_libraries = 'pg_duckdb' | |||
shared_preload_libraries = 'pg_duckdb.so' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary, and actually probably breaks OSX
Doing parallel thread seq scan on heap table is slower than on single thread and that is because of need for global lock that needs to be taken after each fetching of buffer, checking tuple visibility. To speed up execution, we start parallel worker dedicated for thread that will fetch buffer and pass them to thread. Thread works with this page directly and once scan for buffer is done worker will relase it. HeapTupleSatisfiesVisibility call is also problematic because on some situations it will try to use SetHintBits on same page and that requires to have lock on page (which is not true for thread). For this purpose HeapTupleSatisfiesVisibilityNoHintBits was added which has same logic but doesn't use SetHintBits. Preliminatory testing showed that there is small difference between 3,4,.. parallel works so to simplfy logic we use hardcoded rule that will spawn 1 parallel process if number of blocks in thread is bigger than 2024 and if bigger 2 parallel workers (threads) will be created.
48b9ae6
to
59f781d
Compare
Closing |