-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) #12393
Open
Tracked by
#13456
Labels
enhancement
New feature or request
Comments
I recommend two things:
|
alamb
changed the title
Document DataFusion Threading
Document DataFusion Threading (and how to separate IO and CPU bound work)
Sep 9, 2024
I think it'd be great to have a good documentation on this. |
100% agree -- @itsjunetime and @tustvold are working on a bit of it in apache/arrow-rs#6612. I'll try and help with the documentation as well |
alamb
changed the title
Document DataFusion Threading (and how to separate IO and CPU bound work)
Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work)
Nov 11, 2024
This was referenced Nov 14, 2024
Documentation I hope to work on the example a bit more shortly |
This was referenced Nov 17, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem or challenge?
DataFusion performs CPU bound work within async closures. This causes issues if running IO on the same async runtime, as the cooperative nature of such schedulers allows the CPU bound work to starve servicing of IO. This leads to errors such as apache/arrow-rs#5882.
Describe the solution you'd like
I think at the very least this needs to be better documented, I couldn't find any mention of this in the DataFusion documentation following a cursory search.
I also think more holistic approach would be valuable to this, as it stands the use of async within DataFusion acts as a massive footgun that encourages users to intermix IO and CPU work in a way that is at best inefficient, but this can be tracked as a separate follow on task.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: