Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support S3 transferring use "ProcessPoolExecutor" with s3tranfer #8852

Open
2 tasks
kimsehwan96 opened this issue Aug 7, 2024 · 2 comments
Open
2 tasks
Assignees
Labels
feature-request A feature should be added or improved. needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue s3transfer s3

Comments

@kimsehwan96
Copy link

kimsehwan96 commented Aug 7, 2024

Describe the feature

As I know, aws-cli uses "s3tranfer" to use s3 and it uses ThreadPoolExecutor like this
https://github.com/boto/s3transfer/blob/da68b50bb5a6b0c342ad0d87f9b1f80ab81dffce/s3transfer/futures.py#L402-L403

In some environment like enough available network bandwidth, enough CPU cores and lots for files to be downloaded, then using ProcessPoolExecutor would be better.
And s3transfer has implemented an interface to use ProcessPoolExecutor

https://github.com/boto/s3transfer/blob/develop/s3transfer/processpool.py

So I think addding feature flag for selecting thread or process for using s3 that could be determined by user would be better.

Use Case

If we have to download many files and the environment that using aws-cli has enough resources(CPU, Memory, Network bandwitdh) than we can choice that use more CPUs for boost the S3 throughput

Proposed Solution

No response

Other Information

No response

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

CLI version used

2.15.30

Environment details (OS name and version, etc.)

Amazon Linux 2023

@kimsehwan96 kimsehwan96 added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Aug 7, 2024
@kimsehwan96 kimsehwan96 changed the title Support S3 transfering with "ProcessPoolExecutor" in s3tranfer Support S3 transferring with "ProcessPoolExecutor" in s3tranfer Aug 7, 2024
@kimsehwan96 kimsehwan96 changed the title Support S3 transferring with "ProcessPoolExecutor" in s3tranfer Support S3 transferring use "ProcessPoolExecutor" with s3tranfer Aug 7, 2024
@tim-finnigan tim-finnigan self-assigned this Aug 14, 2024
@tim-finnigan
Copy link
Contributor

Thanks for the feature request, we can review with the team. In the meantime can you provide any more details on your use case and the results you're seeing? Have you tried setting any of the S3 configurations documented here to optimize downloads: https://awscli.amazonaws.com/v2/documentation/api/latest/topic/s3-config.html ?

@tim-finnigan tim-finnigan added s3 p2 This is a standard priority issue s3transfer response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. needs-review This issue or pull request needs review from a core team member. and removed needs-triage This issue or PR still needs to be triaged. labels Aug 14, 2024
@kimsehwan96
Copy link
Author

@tim-finnigan

I reviewed the documentation and tried increasing the max_concurrent_requests to improve performance.

For example, I tested this on a c7g.16xlarge instance, which has a network interface capable of 30Gbps bandwidth. I set max_concurrent_requests to 64, matching the number of vCPUs, but the download speed didn’t improve as much as I expected.

Since s3transfer uses ThreadPoolExecutor by default, it might be helpful to give users the option to use ProcessPoolExecutor. This way, users with more CPU resources available could potentially speed up their downloads.

In my tests, using ProcessPoolExecutor for parallel downloads from S3 with boto3, I was almost able to fully use the 30Gbps bandwidth—something that wasn’t possible with ThreadPoolExecutor.

I think adding an option for ProcessPoolExecutor could help achieve download speeds similar to tools like s5cmd.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue s3transfer s3
Projects
None yet
Development

No branches or pull requests

2 participants