-
Notifications
You must be signed in to change notification settings - Fork 4.1k
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support S3 transferring use "ProcessPoolExecutor" with s3tranfer #8852
Comments
Thanks for the feature request, we can review with the team. In the meantime can you provide any more details on your use case and the results you're seeing? Have you tried setting any of the S3 configurations documented here to optimize downloads: https://awscli.amazonaws.com/v2/documentation/api/latest/topic/s3-config.html ? |
I reviewed the documentation and tried increasing the max_concurrent_requests to improve performance. For example, I tested this on a c7g.16xlarge instance, which has a network interface capable of 30Gbps bandwidth. I set max_concurrent_requests to 64, matching the number of vCPUs, but the download speed didn’t improve as much as I expected. Since s3transfer uses ThreadPoolExecutor by default, it might be helpful to give users the option to use ProcessPoolExecutor. This way, users with more CPU resources available could potentially speed up their downloads. In my tests, using ProcessPoolExecutor for parallel downloads from S3 with boto3, I was almost able to fully use the 30Gbps bandwidth—something that wasn’t possible with ThreadPoolExecutor. I think adding an option for ProcessPoolExecutor could help achieve download speeds similar to tools like s5cmd. |
Describe the feature
As I know, aws-cli uses "s3tranfer" to use s3 and it uses
ThreadPoolExecutor
like thishttps://github.com/boto/s3transfer/blob/da68b50bb5a6b0c342ad0d87f9b1f80ab81dffce/s3transfer/futures.py#L402-L403
In some environment like enough available network bandwidth, enough CPU cores and lots for files to be downloaded, then using
ProcessPoolExecutor
would be better.And
s3transfer
has implemented an interface to useProcessPoolExecutor
https://github.com/boto/s3transfer/blob/develop/s3transfer/processpool.py
So I think addding feature flag for selecting
thread
orprocess
for using s3 that could be determined by user would be better.Use Case
If we have to download many files and the environment that using aws-cli has enough resources(CPU, Memory, Network bandwitdh) than we can choice that use more CPUs for boost the S3 throughput
Proposed Solution
No response
Other Information
No response
Acknowledgements
CLI version used
2.15.30
Environment details (OS name and version, etc.)
Amazon Linux 2023
The text was updated successfully, but these errors were encountered: