Optimize S3 read writes on transcription flow #14

manjunatha-ds · 2024-09-01T09:16:15Z

Description of changes:
In the existing flow, for each voice chunk we store the text in individual s3 files, spend time to upload inside the loop.
Similarly we download the file from s3 on each iteration in lambda while processing.
If on case any error occurs during transcibe flow, we are anyway restarting from that step.

Hence storing the information of each chunk in formatted data structure inside one single file which is compressed to gzip and uploaded to S3. This provides below advantages

Remove I/O call to/from S3 inside the loop
compressed json helps to reduce final file size & reduces network latency required for upload & download
Since the output is now structured, we can add additional metadata from the ML model to be utilized by client if required.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

duttaanup · 2024-09-23T07:11:57Z

Thanks @manjunatha-ds for the update. We will review the changes and will merge it to the main

Optimize S3 read writes on transcription flow

7382eb8

duttaanup requested a review from kousikraj September 23, 2024 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize S3 read writes on transcription flow #14

Optimize S3 read writes on transcription flow #14

manjunatha-ds commented Sep 1, 2024

duttaanup commented Sep 23, 2024

Optimize S3 read writes on transcription flow #14

Are you sure you want to change the base?

Optimize S3 read writes on transcription flow #14

Conversation

manjunatha-ds commented Sep 1, 2024

duttaanup commented Sep 23, 2024