Skip to content

Commit

Permalink
configuration changes for cli options & update README
Browse files Browse the repository at this point in the history
- `tags`, `speakers` and `category` must now be used one time per each
item (tag, speaker, category) that we want to add to the metadata of the
transcript
- Better wording for the help text of cli options
- Update README
  • Loading branch information
kouloumos committed Nov 16, 2023
1 parent 7159d95 commit 0c30d15
Show file tree
Hide file tree
Showing 5 changed files with 183 additions and 170 deletions.
156 changes: 77 additions & 79 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,36 +2,63 @@

This cli app transcribes audio and video for submission to the [bitcointranscripts](https://github.com/bitcointranscripts/bitcointranscripts) repo.

**Available transcription models and services**

- (local) Whisper `--model xxx [default: tiny.en]`
- (remote) Deepgram (whisper-large) `--deepgram [default: False]`
- summarization `--summarize`
- diarization `--diarize`

**Features**:

- Transcription using [`openai-whisper`](https://github.com/openai/whisper) or [Deepgram](https://deepgram.com/)
- Collection of video's metadata when sourcing from YouTube
- Collection of video's metadata when sourcing from YouTube.
- Open Pull Request on the [bitcointranscripts](https://github.com/bitcointranscripts/bitcointranscripts) repo for the resulting transcript.
- Save the resulting transcript to a markdown format supported by bitcointranscripts.
- Upload the resulting transcript to an AWS S3 Bucket
repo.
- Push the resulting transcript to [a Queuer backend](https://github.com/bitcointranscripts/transcription-review-backend)
- Push the resulting transcript to [a Queuer backend](https://github.com/bitcointranscripts/transcription-review-backend), or save the payload in a json for later use.

## Steps:
## Prerequisites

The step-by-step flow for the scripts are:
- To use [deepgram](https://deepgram.com/) as a transcription service,
you must have a valid `DEEPGRAM_API_KEY` in the `.env` file.

- transcribe given video and generate the output file
- To push the resulting transcript to a Queuer backend, you must have a
valid `QUEUE_ENDPOINT` in the `.env` file. If not, you can instead save
the payload in a json file using the `--noqueue` flag.

- authenticate the user to GitHub
- To enable us fork bitcointranscript repo and open a PR, we require you to
login into your GitHub account. Kindly install `GITHUB CLI` using the
instructions on their repo [here](https://github.com/cli/cli#installation).
Following the prompt, please select the below options from the prompt to
login:

- fork the transcript repo/use their existing fork, clone it and branch out
- what account do you want to log into? `Github.com`

- what is your preferred protocol for Git operations? `SSH`

- copy the transcript file to the new transcript repo
- Upload your SSH public key to your GitHub account? `skip`

- commit new file and push
- How would you like to authenticate GitHub CLI? `Login with a web browser`

- copy the generated one-time pass-code and paste in the browser to
authenticate if you have enabled 2FA

- then open a PR
- To enable pushing the models to a S3 bucket,
- [Install](https://aws.amazon.com/cli/) aws-cli to your system.
- [Configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
aws-cli by first generating IAM credentials (if not already present) and
using `aws configure` to set them.
- To verify proper configuration, run `aws s3 ls` to show the list of S3
buckets. Don't forget to set a valid `S3_BUCKET` in the `.env` file.

or
- To be able to convert the intermediary media files to mp3, install `FFmpeg`

- add the backend url to a `.env` file as `QUEUE_ENDPOINT`. Optionally,
specify `S3_BUCKET` in `.env` for uploading model files.
- for Mac Os users, run `brew install ffmpeg`

- send the transcript data to the backend queue
- for other users, follow the instruction on
their [site](https://ffmpeg.org/) to install

## Install/Uninstall

Expand All @@ -52,51 +79,55 @@ To check the version:

## Usage

`tstbtc {video_id} {directory}` create video transcript supplying the id of the
YouTube video and the associated directory bitcointranscripts destination folder
`tstbtc {source_file/url} {directory}` transcribe the given source

Note: The https links need to be wrapped in quotes when running the command on
zsh
Suported sources:
- YouTube videos
- YouTube playlists
- Local and remote audio files

`tstbtc {audio_url} {directory} --title {title}` create audio transcript
supplying the url of the audio, the source/year and the title of the audio
Note:
- The `directory` is the bitcointranscripts directory that you want to associate the transcript with
- The https links need to be wrapped in quotes when running the command on zsh

To include optional metadata in your transcript, you can add the following
parameters:

- `-t` or `--title`: Supply transcribed file title in 'quotes'
- `-d` or `--date`: Supply the event date in format 'yyyy-mm-dd'
- `-T` or `--tags`: Supply the tags for the transcript in 'quotes' and separated
by commas
- `-s` or `--speakers`: Supply the speakers for the transcript in 'quotes' and
separated by commas
- `-c` or `--category`: Supply the category for the transcript in 'quotes' and
separated by commas
- `-C` or `--chapters`: Split the transcript into chapters based on the supplied
timestamps in the youtube video.
- `-t` or `--title`: Add the title for the resulting transcript (required for audio files)
- `-d` or `--date`: Add the event date to transcript's metadata in format 'yyyy-mm-dd'
- can be used multiple times:
- `-T` or `--tags`: Add a tag to transcript's metadata
- `-s` or `--speakers`: Add a speaker to the transcript's metadata
- `-c` or `--category`: Add a category to the transcript's metadata

To configure the transcription process, you can use the following flags:

- `-m` or `--model`: Select which whisper model to use for the transcription [default: tiny.en]
- `-D` or `--deepgram`: Use deepgram for transcription, instead of using the whisper model [default: False]
- `-M` or `--diarize`: Supply this flag if you have multiple speakers AKA want to diarize the content [only available with deepgram]
- `-S` or `--summarize`: Summarize the transcript [only available with deepgram]
- `-C` or `--chapters`: For YouTube videos, include the YouTube chapters and timestamps in the resulting transcript.
- `-p` or `--pr`: Open a PR on the bitcointranscripts repo
- `-m` or `model`: Supply optional whisper model
- `-u` or `--upload`: Specify if you want to upload the generated model files in
AWS S3.
- `-u` or `--upload`: Upload processed model files to AWS S3
- `--markdown`: Save the resulting transcript to a markdown format supported by bitcointranscripts
- `--noqueue`: Do not push the resulting transcript to the Queuer, instead store the payload in a json file
- `--nocleanup`: Do not remove temp files on exit

#### Examples
### Examples

To
transcribe [this podcast episode](https://www.youtube.com/watch?v=Nq6WxJ0PgJ4)
from Stephan Livera's podcast with the associated metadata, we would run either
To transcribe [this podcast episode](https://www.youtube.com/watch?v=Nq6WxJ0PgJ4) from YouTube
from Stephan Livera's podcast and add the associated metadata, we would run either
of the below commands. The first uses short argument tags, while the second uses
long argument tags. The result is the same.

- `tstbtc Nq6WxJ0PgJ4 bitcointranscripts/stephan-livera-podcast -t 'OP_Vault - A New Way to HODL?' -d '2023-01-30' -T 'op_vault' -s 'Stephan Livera, James O’Beirn' -c ‘podcast’`
- `tstbtc Nq6WxJ0PgJ4 bitcointranscripts/stephan-livera-podcast --title 'OP_Vault - A New Way to HODL?' --date '2023-01-30' --tags 'op_vault' --speakers 'Stephan Livera, James O’Beirn' --category ‘podcast’`
- `tstbtc Nq6WxJ0PgJ4 bitcointranscripts/stephan-livera-podcast -t 'OP_Vault - A New Way to HODL?' -d '2023-01-30' -T 'script' -T 'op_vault' -s 'James O’Beirne' -s 'Stephan Livera' -c ‘podcast’`
- `tstbtc Nq6WxJ0PgJ4 bitcointranscripts/stephan-livera-podcast --title 'OP_Vault - A New Way to HODL?' --date '2023-01-30' --tags 'script' --tags 'op_vault' --speakers 'James O’Beirne' --speakers 'Stephan Livera' --category ‘podcast’`

You can also transcribe a mp3 link, such as the following from Stephan Livera's
podcast: https://anchor.fm/s/7d083a4/podcast/play/64348045/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2023-1-1%2Ff7fafb12-9441-7d85-d557-e9e5d18ab788.mp3

For demonstration purposes, let's substitute the link above with the following:
websitelink.mp3. In this scenario, we would run the below command.

- `tstbtc websitelink.mp3 bitcointranscripts/stephan-livera-podcast --title 'SLP455 Anant Tapadia - Single Sig or Multi Sig?' --date '2023-02-01' --tags 'multisig' --speakers 'Stephan Livera, Anant Tapadia' --category 'podcast'`
You can also transcribe a remote audio/mp3 link, such as the following from Stephan Livera's podcast:
```shell
mp3_link="https://anchor.fm/s/7d083a4/podcast/play/64348045/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2023-1-1%2Ff7fafb12-9441-7d85-d557-e9e5d18ab788.mp3"
tstbtc $mp3_link bitcointranscripts/stephan-livera-podcast --title 'SLP455 Anant Tapadia - Single Sig or Multi Sig?' --date '2023-02-01' --tags 'multisig' --speakers 'Anant Tapadia' --speakers 'Stephan Livera' --category 'podcast'
```

## Testing

Expand All @@ -112,39 +143,6 @@ To run the full test suite

`pytest -v -s`

## OTHER REQUIREMENTS

- To enable us fork bitcointranscript repo and open a PR, we require you to
login into your GitHub account. Kindly install `GITHUB CLI` using the
instructions on their repo [here](https://github.com/cli/cli#installation).
Following the prompt, please select the below options from the prompt to
login:

- what account do you want to log into? `Github.com`

- what is your preferred protocol for Git operations? `SSH`

- Upload your SSH public key to your GitHub account? `skip`

- How would you like to authenticate GitHub CLI? `Login with a web browser`

- copy the generated one-time pass-code and paste in the browser to
authenticate if you have enabled 2FA

- To enable pushing the models to a S3 bucket,
- [Install](https://aws.amazon.com/cli/) aws-cli to your system.
- [Configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
aws-cli by first generating IAM credentials (if not already present) and
using `aws configure` to set them.
- To verify proper configuration, run `aws s3 ls` to show the list of S3
buckets. Set a valid bucket in the `.env` file.

- Install `FFmpeg`

- for Mac Os users, run `brew install ffmpeg`

- for other users, follow the instruction on
their [site](https://ffmpeg.org/) to install

## License

Expand Down
2 changes: 1 addition & 1 deletion app/transcription.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def check_if_youtube(source: Source):
except Exception as e:
raise Exception(f"Error from assigning source: {e}")

def add_transcription_source(self, source_file, title=None, date=None, tags=None, category=None, speakers=None, preprocess=True, youtube_metadata=None, chapters=None):
def add_transcription_source(self, source_file, title=None, date=None, tags=[], category=[], speakers=[], preprocess=True, youtube_metadata=None, chapters=None):
"""Calculates the type of the source based on its file name
"""
transcription_sources = {"added": [], "exist": []}
Expand Down
2 changes: 1 addition & 1 deletion test/testAssets/payload.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"title": "test_title",
"transcript_by": "username via TBTBTC v1.0.0",
"categories": ["category1", "category2"],
"tags": ["tag1", "tag2"],
"tags": [],
"speakers": ["speaker1", "speaker2"],
"date": "2020-01-31",
"loc": "yada/yada",
Expand Down
2 changes: 1 addition & 1 deletion test/test_video.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@ def test_generate_payload():
username = "username"
title = "test_title"
speakers = ["speaker1", "speaker2"]
tags = ["tag1", "tag2"]
tags = []
category = ["category1", "category2"]
date = "2020-01-31"
loc = "yada/yada"
Expand Down
Loading

0 comments on commit 0c30d15

Please sign in to comment.