Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added documentation for SortMergeJoin #13469

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

athultr1997
Copy link
Contributor

Which issue does this PR close?

Closes #10357

Rationale for this change

What changes are included in this PR?

Added documentation for SortMergeJoin

Are these changes tested?

No tests needed since only documentation is added.

Are there any user-facing changes?

No

@athultr1997 athultr1997 marked this pull request as draft November 18, 2024 15:55
@github-actions github-actions bot added the physical-expr Physical Expressions label Nov 18, 2024
@athultr1997 athultr1997 marked this pull request as ready for review November 18, 2024 16:59
@comphead
Copy link
Contributor

I would be adding this short video to make the dev/user familiar with SMJ concepts https://www.youtube.com/watch?v=jiWCPJtDE2c

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @athultr1997 this is good start
@korowa @viirya do you have anything to add on it?

/// partitions.
/// Join execution plan that executes equi-join predicates on multiple partitions using Sort-Merge
/// join algorithm and applies an optional filter post join. Can be used to join arbitrarily large
/// inputs where one or both of the inputs don't fit in the available memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where one or both of the inputs don't fit in the available memory.

Hmm, is this true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streamed - Always taken one batch at a time.
buffered - Has spilling support.
Hence, the inputs don't have to fit in memory.

Also, I think this was the vision behind SMJ: #1599 (comment).

///
/// Buffered input is buffered for all record batches having the same value of join key.
/// If the memory limit increases beyond the specified value and spilling is enabled,
/// buffered batches could be spilled to disk. If spilling is disabled, the execution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a config for spilling? Shall we mention it here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One has to enable the disk manager when creating the RuntimeEnv during the creation of TaskContext.

let runtime = RuntimeEnvBuilder::new()
            .with_memory_limit(100, 1.0)
            .with_disk_manager(DiskManagerConfig::NewOs)
            .build_arc()?;

I am not sure if we should mention it here, since this is all the way over in RuntimeEnv and there are multiple strategies to enable the disk manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can point to DiskManager documentation here

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @athultr1997
its a good start, however its many more things to document there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document Sort Merge Join algorithm
3 participants