Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example for using a separate threadpool for CPU bound work #13424

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 14, 2024

TODOs:

  • Add some example of trying to do IO on an object store (and should error with no io registered)
  • Contemplate adding the DedicatedExecutor code to make the example simple
  • Add wrapper over object store
  • Add wrapper over streams on the DedicatedExecutor
  • Complete example
  • Split the DedicatedExecutor / etc into its own PR

Which issue does this PR close?

Rationale for this change

I added documentation that explains the problem here:

But now we need to show people how to fix it

I am reminded when trying to do this how non obvious it is

What changes are included in this PR?

Add a well commented example of how to use mutiple runtimes

The DedicatedExecutor code is orginally from

  1. InfluxDB 3.0 (todo link), largely written by @tustvold and @crepererum
  2. Largely based on Add DedicatedExecutor to FlightSQL Server datafusion-contrib/datafusion-dft#247 from @matthewmturner

The XXX object store code is also based on work from @matthewmturner in

Are these changes tested?

By CI

Are there any user-facing changes?

@github-actions github-actions bot added the common Related to common crate label Nov 20, 2024
Comment on lines +126 to +129
// Calling `next()` to drive the plan on the different threadpool
while let Some(batch) = stream.next().await {
println!("{}", pretty_format_batches(&[batch?]).unwrap());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the tricky bit and shouldn't be underestimated. This means that for streaming responses you need to buffer data somewhere or accept higher latency. Also note that if you use ANY form of IO within DF (e.g. to talk to the object store) and, you need to isolate that as well.

So to me that mostly looks like a hack/workaround for DF not handling this properly by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I hope that the different_runtime_advanced will show how do it "right" -- I haven't yet figued out how to do it.

@tustvold and @matthewmturner and I have been discussing the same issue here: datafusion-contrib/datafusion-dft#248 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work)
2 participants