Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize S3 read writes on transcription flow #14

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

manjunatha-ds
Copy link

Description of changes:
In the existing flow, for each voice chunk we store the text in individual s3 files, spend time to upload inside the loop.
Similarly we download the file from s3 on each iteration in lambda while processing.
If on case any error occurs during transcibe flow, we are anyway restarting from that step.

Hence storing the information of each chunk in formatted data structure inside one single file which is compressed to gzip and uploaded to S3. This provides below advantages

  • Remove I/O call to/from S3 inside the loop
  • compressed json helps to reduce final file size & reduces network latency required for upload & download
  • Since the output is now structured, we can add additional metadata from the ML model to be utilized by client if required.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@duttaanup
Copy link
Contributor

Thanks @manjunatha-ds for the update. We will review the changes and will merge it to the main

@duttaanup duttaanup requested a review from kousikraj September 23, 2024 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants