Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Scan with multiple processes #63

Closed
wants to merge 2 commits into from
Closed

Parallel Scan with multiple processes #63

wants to merge 2 commits into from

Conversation

mkaruza
Copy link
Collaborator

@mkaruza mkaruza commented Jul 2, 2024

  • We can now start multiple worker process that can read relation blocks and write buffer to shared memory to be consumed by duckdb reader threads. Problem can be viewed as variant of multi producer / multi consumer where producer are responsible for releasing buffer after they are read.

@mkaruza mkaruza marked this pull request as draft July 3, 2024 09:16
Base automatically changed from index-scan to main July 3, 2024 16:37
@JelteF
Copy link
Collaborator

JelteF commented Sep 30, 2024

@mkaruza what's the deal with this? I guess it's not critical for 0.1.0. Given the amount of merge conflicts it has now, I'm think it probably makes the most sense to simply close this and maybe create an issue for it instead.

@JelteF JelteF added the enhancement New feature or request label Sep 30, 2024
@mkaruza
Copy link
Collaborator Author

mkaruza commented Sep 30, 2024

@JelteF agree not critical, but could be approach to get better performances of heap table scans (specifically to fetch table pages in parallel; currently there is bottleneck there as only one thread in process can request page fetching).

@JelteF JelteF added performance We need more speed and removed enhancement New feature or request labels Sep 30, 2024
@JelteF JelteF added this to the 0.2.0 milestone Sep 30, 2024
@mkaruza mkaruza requested review from JelteF and Y-- and removed request for JelteF November 4, 2024 11:45
Copy link
Collaborator

@Y-- Y-- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level looks reasonable (and perf improvements are promising).
Wonder if we could implement communication between thread and worker through signals rather than busy-spin on atomic variables, and if this would improve perfs.

LockBuffer(buffer, BUFFER_LOCK_SHARE);

/* is previous buffer done */
while (!pg_atomic_unlocked_test_flag(&thread_worker_shared_state->buffer_ready)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try adding a very small "sleep" in the loop rather than busy-spinning? I wonder if it would help performances.
Also could we use signals instead?
Should we have a timeout to handle situation where thread dies without ever resetting its thread_running or buffer_ready flags?


if (thread_running) {
/* We are out of blocks fo reading so wait for last buffer to be done */
while (!pg_atomic_unlocked_test_flag(&thread_worker_shared_state->buffer_ready)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess same apply here - and we could factorize the two loops in a "wait" function?

}

/* Is buffer ready for reading */
while (pg_atomic_unlocked_test_flag(&m_thread_worker_shared_state->buffer_ready)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could use signals here rather than busy-spinning?

@@ -1,6 +1,6 @@
# Configuration

shared_preload_libraries = 'pg_duckdb'
shared_preload_libraries = 'pg_duckdb.so'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, and actually probably breaks OSX

Doing parallel thread seq scan on heap table is slower than on single
thread and that is because of need for global lock that needs to be
taken after each fetching of buffer, checking tuple visibility.

To speed up execution, we start parallel worker dedicated for thread
that will fetch buffer and pass them to thread. Thread works with this
page directly and once scan for buffer is done worker will relase it.

HeapTupleSatisfiesVisibility call is also problematic because on some
situations it will try to use SetHintBits on same page and that requires
to have lock on page (which is not true for thread). For this purpose
HeapTupleSatisfiesVisibilityNoHintBits was added which has same logic
but doesn't use SetHintBits.

Preliminatory testing showed that there is small difference between
3,4,.. parallel works so to simplfy logic we use hardcoded rule that
will spawn 1 parallel process if number of blocks in thread is bigger
than 2024 and if bigger 2 parallel workers (threads) will be created.
@JelteF
Copy link
Collaborator

JelteF commented Dec 9, 2024

@mkaruza do you want to close this one in favor of #477

@mkaruza
Copy link
Collaborator Author

mkaruza commented Dec 9, 2024

Closing

@mkaruza mkaruza closed this Dec 9, 2024
@mkaruza mkaruza deleted the parallel-scan branch December 14, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance We need more speed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants