Support S3 transferring use "ProcessPoolExecutor" with s3tranfer #8852

kimsehwan96 · 2024-08-07T03:35:51Z

Describe the feature

As I know, aws-cli uses "s3tranfer" to use s3 and it uses ThreadPoolExecutor like this
https://github.com/boto/s3transfer/blob/da68b50bb5a6b0c342ad0d87f9b1f80ab81dffce/s3transfer/futures.py#L402-L403

In some environment like enough available network bandwidth, enough CPU cores and lots for files to be downloaded, then using ProcessPoolExecutor would be better.
And s3transfer has implemented an interface to use ProcessPoolExecutor

https://github.com/boto/s3transfer/blob/develop/s3transfer/processpool.py

So I think addding feature flag for selecting thread or process for using s3 that could be determined by user would be better.

Use Case

If we have to download many files and the environment that using aws-cli has enough resources(CPU, Memory, Network bandwitdh) than we can choice that use more CPUs for boost the S3 throughput

Proposed Solution

No response

Other Information

No response

Acknowledgements

I may be able to implement this feature request
This feature might incur a breaking change

CLI version used

2.15.30

Environment details (OS name and version, etc.)

Amazon Linux 2023

The text was updated successfully, but these errors were encountered:

tim-finnigan · 2024-08-14T20:52:06Z

Thanks for the feature request, we can review with the team. In the meantime can you provide any more details on your use case and the results you're seeing? Have you tried setting any of the S3 configurations documented here to optimize downloads: https://awscli.amazonaws.com/v2/documentation/api/latest/topic/s3-config.html ?

kimsehwan96 · 2024-08-15T08:43:48Z

@tim-finnigan

I reviewed the documentation and tried increasing the max_concurrent_requests to improve performance.

For example, I tested this on a c7g.16xlarge instance, which has a network interface capable of 30Gbps bandwidth. I set max_concurrent_requests to 64, matching the number of vCPUs, but the download speed didn’t improve as much as I expected.

Since s3transfer uses ThreadPoolExecutor by default, it might be helpful to give users the option to use ProcessPoolExecutor. This way, users with more CPU resources available could potentially speed up their downloads.

In my tests, using ProcessPoolExecutor for parallel downloads from S3 with boto3, I was almost able to fully use the 30Gbps bandwidth—something that wasn’t possible with ThreadPoolExecutor.

I think adding an option for ProcessPoolExecutor could help achieve download speeds similar to tools like s5cmd.

kimsehwan96 added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Aug 7, 2024

kimsehwan96 changed the title ~~Support S3 transfering with "ProcessPoolExecutor" in s3tranfer~~ Support S3 transferring with "ProcessPoolExecutor" in s3tranfer Aug 7, 2024

kimsehwan96 changed the title ~~Support S3 transferring with "ProcessPoolExecutor" in s3tranfer~~ Support S3 transferring use "ProcessPoolExecutor" with s3tranfer Aug 7, 2024

tim-finnigan self-assigned this Aug 14, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support S3 transferring use "ProcessPoolExecutor" with s3tranfer #8852

Support S3 transferring use "ProcessPoolExecutor" with s3tranfer #8852

kimsehwan96 commented Aug 7, 2024 •

edited

Loading

tim-finnigan commented Aug 14, 2024

kimsehwan96 commented Aug 15, 2024

Support S3 transferring use "ProcessPoolExecutor" with s3tranfer #8852

Support S3 transferring use "ProcessPoolExecutor" with s3tranfer #8852

Comments

kimsehwan96 commented Aug 7, 2024 • edited Loading

Describe the feature

Use Case

Proposed Solution

Other Information

Acknowledgements

CLI version used

Environment details (OS name and version, etc.)

tim-finnigan commented Aug 14, 2024

kimsehwan96 commented Aug 15, 2024

kimsehwan96 commented Aug 7, 2024 •

edited

Loading