diff --git a/Readme.md b/Readme.md index 11748f7..35d8be5 100644 --- a/Readme.md +++ b/Readme.md @@ -2,36 +2,63 @@ This cli app transcribes audio and video for submission to the [bitcointranscripts](https://github.com/bitcointranscripts/bitcointranscripts) repo. +**Available transcription models and services** + +- (local) Whisper `--model xxx [default: tiny.en]` +- (remote) Deepgram (whisper-large) `--deepgram [default: False]` + - summarization `--summarize` + - diarization `--diarize` + **Features**: + - Transcription using [`openai-whisper`](https://github.com/openai/whisper) or [Deepgram](https://deepgram.com/) -- Collection of video's metadata when sourcing from YouTube +- Collection of video's metadata when sourcing from YouTube. - Open Pull Request on the [bitcointranscripts](https://github.com/bitcointranscripts/bitcointranscripts) repo for the resulting transcript. +- Save the resulting transcript to a markdown format supported by bitcointranscripts. - Upload the resulting transcript to an AWS S3 Bucket repo. -- Push the resulting transcript to [a Queuer backend](https://github.com/bitcointranscripts/transcription-review-backend) +- Push the resulting transcript to [a Queuer backend](https://github.com/bitcointranscripts/transcription-review-backend), or save the payload in a json for later use. -## Steps: +## Prerequisites -The step-by-step flow for the scripts are: +- To use [deepgram](https://deepgram.com/) as a transcription service, + you must have a valid `DEEPGRAM_API_KEY` in the `.env` file. -- transcribe given video and generate the output file +- To push the resulting transcript to a Queuer backend, you must have a + valid `QUEUE_ENDPOINT` in the `.env` file. If not, you can instead save + the payload in a json file using the `--noqueue` flag. -- authenticate the user to GitHub +- To enable us fork bitcointranscript repo and open a PR, we require you to + login into your GitHub account. Kindly install `GITHUB CLI` using the + instructions on their repo [here](https://github.com/cli/cli#installation). + Following the prompt, please select the below options from the prompt to + login: -- fork the transcript repo/use their existing fork, clone it and branch out + - what account do you want to log into? `Github.com` + + - what is your preferred protocol for Git operations? `SSH` -- copy the transcript file to the new transcript repo + - Upload your SSH public key to your GitHub account? `skip` -- commit new file and push + - How would you like to authenticate GitHub CLI? `Login with a web browser` + + - copy the generated one-time pass-code and paste in the browser to + authenticate if you have enabled 2FA -- then open a PR +- To enable pushing the models to a S3 bucket, + - [Install](https://aws.amazon.com/cli/) aws-cli to your system. + - [Configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) + aws-cli by first generating IAM credentials (if not already present) and + using `aws configure` to set them. + - To verify proper configuration, run `aws s3 ls` to show the list of S3 + buckets. Don't forget to set a valid `S3_BUCKET` in the `.env` file. - or +- To be able to convert the intermediary media files to mp3, install `FFmpeg` -- add the backend url to a `.env` file as `QUEUE_ENDPOINT`. Optionally, - specify `S3_BUCKET` in `.env` for uploading model files. + - for Mac Os users, run `brew install ffmpeg` -- send the transcript data to the backend queue + - for other users, follow the instruction on + their [site](https://ffmpeg.org/) to install ## Install/Uninstall @@ -52,51 +79,55 @@ To check the version: ## Usage -`tstbtc {video_id} {directory}` create video transcript supplying the id of the -YouTube video and the associated directory bitcointranscripts destination folder +`tstbtc {source_file/url} {directory}` transcribe the given source -Note: The https links need to be wrapped in quotes when running the command on -zsh +Suported sources: + - YouTube videos + - YouTube playlists + - Local and remote audio files -`tstbtc {audio_url} {directory} --title {title}` create audio transcript -supplying the url of the audio, the source/year and the title of the audio +Note: +- The `directory` is the bitcointranscripts directory that you want to associate the transcript with +- The https links need to be wrapped in quotes when running the command on zsh To include optional metadata in your transcript, you can add the following parameters: -- `-t` or `--title`: Supply transcribed file title in 'quotes' -- `-d` or `--date`: Supply the event date in format 'yyyy-mm-dd' -- `-T` or `--tags`: Supply the tags for the transcript in 'quotes' and separated - by commas -- `-s` or `--speakers`: Supply the speakers for the transcript in 'quotes' and - separated by commas -- `-c` or `--category`: Supply the category for the transcript in 'quotes' and - separated by commas -- `-C` or `--chapters`: Split the transcript into chapters based on the supplied - timestamps in the youtube video. +- `-t` or `--title`: Add the title for the resulting transcript (required for audio files) +- `-d` or `--date`: Add the event date to transcript's metadata in format 'yyyy-mm-dd' +- can be used multiple times: + - `-T` or `--tags`: Add a tag to transcript's metadata + - `-s` or `--speakers`: Add a speaker to the transcript's metadata + - `-c` or `--category`: Add a category to the transcript's metadata + +To configure the transcription process, you can use the following flags: + +- `-m` or `--model`: Select which whisper model to use for the transcription [default: tiny.en] +- `-D` or `--deepgram`: Use deepgram for transcription, instead of using the whisper model [default: False] +- `-M` or `--diarize`: Supply this flag if you have multiple speakers AKA want to diarize the content [only available with deepgram] +- `-S` or `--summarize`: Summarize the transcript [only available with deepgram] +- `-C` or `--chapters`: For YouTube videos, include the YouTube chapters and timestamps in the resulting transcript. - `-p` or `--pr`: Open a PR on the bitcointranscripts repo -- `-m` or `model`: Supply optional whisper model -- `-u` or `--upload`: Specify if you want to upload the generated model files in - AWS S3. +- `-u` or `--upload`: Upload processed model files to AWS S3 +- `--markdown`: Save the resulting transcript to a markdown format supported by bitcointranscripts +- `--noqueue`: Do not push the resulting transcript to the Queuer, instead store the payload in a json file +- `--nocleanup`: Do not remove temp files on exit -#### Examples +### Examples -To -transcribe [this podcast episode](https://www.youtube.com/watch?v=Nq6WxJ0PgJ4) -from Stephan Livera's podcast with the associated metadata, we would run either +To transcribe [this podcast episode](https://www.youtube.com/watch?v=Nq6WxJ0PgJ4) from YouTube +from Stephan Livera's podcast and add the associated metadata, we would run either of the below commands. The first uses short argument tags, while the second uses long argument tags. The result is the same. -- `tstbtc Nq6WxJ0PgJ4 bitcointranscripts/stephan-livera-podcast -t 'OP_Vault - A New Way to HODL?' -d '2023-01-30' -T 'op_vault' -s 'Stephan Livera, James O’Beirn' -c ‘podcast’` -- `tstbtc Nq6WxJ0PgJ4 bitcointranscripts/stephan-livera-podcast --title 'OP_Vault - A New Way to HODL?' --date '2023-01-30' --tags 'op_vault' --speakers 'Stephan Livera, James O’Beirn' --category ‘podcast’` +- `tstbtc Nq6WxJ0PgJ4 bitcointranscripts/stephan-livera-podcast -t 'OP_Vault - A New Way to HODL?' -d '2023-01-30' -T 'script' -T 'op_vault' -s 'James O’Beirne' -s 'Stephan Livera' -c ‘podcast’` +- `tstbtc Nq6WxJ0PgJ4 bitcointranscripts/stephan-livera-podcast --title 'OP_Vault - A New Way to HODL?' --date '2023-01-30' --tags 'script' --tags 'op_vault' --speakers 'James O’Beirne' --speakers 'Stephan Livera' --category ‘podcast’` -You can also transcribe a mp3 link, such as the following from Stephan Livera's -podcast: https://anchor.fm/s/7d083a4/podcast/play/64348045/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2023-1-1%2Ff7fafb12-9441-7d85-d557-e9e5d18ab788.mp3 - -For demonstration purposes, let's substitute the link above with the following: -websitelink.mp3. In this scenario, we would run the below command. - -- `tstbtc websitelink.mp3 bitcointranscripts/stephan-livera-podcast --title 'SLP455 Anant Tapadia - Single Sig or Multi Sig?' --date '2023-02-01' --tags 'multisig' --speakers 'Stephan Livera, Anant Tapadia' --category 'podcast'` +You can also transcribe a remote audio/mp3 link, such as the following from Stephan Livera's podcast: +```shell +mp3_link="https://anchor.fm/s/7d083a4/podcast/play/64348045/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2023-1-1%2Ff7fafb12-9441-7d85-d557-e9e5d18ab788.mp3" +tstbtc $mp3_link bitcointranscripts/stephan-livera-podcast --title 'SLP455 Anant Tapadia - Single Sig or Multi Sig?' --date '2023-02-01' --tags 'multisig' --speakers 'Anant Tapadia' --speakers 'Stephan Livera' --category 'podcast' +``` ## Testing @@ -112,39 +143,6 @@ To run the full test suite `pytest -v -s` -## OTHER REQUIREMENTS - -- To enable us fork bitcointranscript repo and open a PR, we require you to - login into your GitHub account. Kindly install `GITHUB CLI` using the - instructions on their repo [here](https://github.com/cli/cli#installation). - Following the prompt, please select the below options from the prompt to - login: - - - what account do you want to log into? `Github.com` - - - what is your preferred protocol for Git operations? `SSH` - - - Upload your SSH public key to your GitHub account? `skip` - - - How would you like to authenticate GitHub CLI? `Login with a web browser` - - - copy the generated one-time pass-code and paste in the browser to - authenticate if you have enabled 2FA - -- To enable pushing the models to a S3 bucket, - - [Install](https://aws.amazon.com/cli/) aws-cli to your system. - - [Configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) - aws-cli by first generating IAM credentials (if not already present) and - using `aws configure` to set them. - - To verify proper configuration, run `aws s3 ls` to show the list of S3 - buckets. Set a valid bucket in the `.env` file. - -- Install `FFmpeg` - - - for Mac Os users, run `brew install ffmpeg` - - - for other users, follow the instruction on - their [site](https://ffmpeg.org/) to install ## License diff --git a/app/application.py b/app/application.py index c5fe6a3..1362930 100644 --- a/app/application.py +++ b/app/application.py @@ -25,97 +25,15 @@ from pytube.exceptions import PytubeError from app import __app_name__, __version__ +from app.utils import write_to_json +from app.logging import get_logger - -def download_video(url, working_dir="tmp/"): - logger = logging.getLogger(__app_name__) - try: - logger.info("URL: " + url) - logger.info("Downloading video... Please wait.") - - ydl_opts = { - "format": "18", - "outtmpl": os.path.join(working_dir, "videoFile.%(ext)s"), - "nopart": True, - "writeinfojson": True, - } - with yt_dlp.YoutubeDL(ydl_opts) as ytdl: - ytdl.download([url]) - - with open(os.path.join(working_dir, "videoFile.info.json")) as file: - info = ytdl.sanitize_info(json.load(file)) - name = info["title"].replace("/", "-") - file.close() - - os.rename( - os.path.join(working_dir, "videoFile.mp4"), - os.path.join(working_dir, name + ".mp4"), - ) - - return os.path.abspath(os.path.join(working_dir, name + ".mp4")) - except Exception as e: - logger.error(f"Error downloading video: {e}") - shutil.rmtree(working_dir) - return - - -def read_description(prefix): - logger = logging.getLogger(__app_name__) - try: - list_of_chapters = [] - with open(prefix + "videoFile.info.json", "r") as f: - info = json.load(f) - if "chapters" not in info: - logger.info("No chapters found in description") - return list_of_chapters - for index, x in enumerate(info["chapters"]): - name = x["title"] - start = x["start_time"] - list_of_chapters.append((str(index), start, str(name))) - - return list_of_chapters - except Exception as e: - logger.error(f"Error reading description: {e}") - return [] - - -def write_chapters_file(chapter_file: str, chapter_list: list) -> None: - # Write out the chapter file based on simple MP4 format (OGM) - logger = logging.getLogger(__app_name__) - try: - with open(chapter_file, "w") as fo: - for current_chapter in chapter_list: - fo.write( - f"CHAPTER{current_chapter[0]}=" - f"{current_chapter[1]}\n" - f"CHAPTER{current_chapter[0]}NAME=" - f"{current_chapter[2]}\n" - ) - fo.close() - except Exception as e: - logger.error("Error writing chapter file") - logger.error(e) - - -def convert_video_to_mp3(filename, working_dir="tmp/"): - logger = logging.getLogger(__app_name__) - try: - clip = VideoFileClip(filename) - logger.info("Converting video to mp3... Please wait.") - logger.info(filename[:-4] + ".mp3") - clip.audio.write_audiofile( - os.path.join(working_dir, filename.split("/")[-1][:-4] + ".mp3") - ) - clip.close() - logger.info("Converted video to mp3") - return os.path.join(working_dir, filename.split("/")[-1][:-4] + ".mp3") - except Exception as e: - logger.error(f"Error converting video to mp3: {e}") - return None +logger = get_logger() def convert_wav_to_mp3(abs_path, filename, working_dir="tmp/"): logger = logging.getLogger(__app_name__) + logger.info(f"Converting {abs_path} to mp3...") op = subprocess.run( ["ffmpeg", "-i", abs_path, filename[:-4] + ".mp3"], cwd=working_dir, @@ -127,90 +45,6 @@ def convert_wav_to_mp3(abs_path, filename, working_dir="tmp/"): return os.path.abspath(os.path.join(working_dir, filename[:-4] + ".mp3")) -def check_if_playlist(media): - logger = logging.getLogger(__app_name__) - try: - if ( - media.startswith("PL") - or media.startswith("UU") - or media.startswith("FL") - or media.startswith("RD") - ): - return True - playlists = list(pytube.Playlist(media).video_urls) - if type(playlists) is not list: - return False - return True - except Exception as e: - logger.error(f"Pytube Error: {e}") - return False - - -def check_if_video(media): - logger = logging.getLogger(__app_name__) - if re.search(r"^([\dA-Za-z_-]{11})$", media): - return True - try: - pytube.YouTube(media) - return True - except PytubeError as e: - logger.error(f"Pytube Error: {e}") - return False - - -def get_playlist_videos(url): - logger = logging.getLogger(__app_name__) - try: - videos = pytube.Playlist(url) - return videos - except Exception as e: - logger.error("Error getting playlist videos") - logger.error(e) - return - - -def get_audio_file(url, title, working_dir="tmp/"): - logger = logging.getLogger(__app_name__) - logger.info("URL: " + url) - logger.info("downloading audio file") - try: - audio = requests.get(url, stream=True) - with open(os.path.join(working_dir, title + ".mp3"), "wb") as f: - total_length = int(audio.headers.get("content-length")) - for chunk in progress.bar( - audio.iter_content(chunk_size=1024), - expected_size=(total_length / 1024) + 1, - ): - if chunk: - f.write(chunk) - f.flush() - return os.path.join(working_dir, title + ".mp3") - except Exception as e: - logger.error("Error downloading audio file") - logger.error(e) - return - - -def process_mp3(filename, model, upload, model_output_dir): - logger = logging.getLogger(__app_name__) - logger.info("Transcribing audio to text using whisper ...") - try: - my_model = whisper.load_model(model) - result = my_model.transcribe(filename) - data = [] - for x in result["segments"]: - data.append(tuple((x["start"], x["end"], x["text"]))) - data_path = generate_srt(data, filename, model_output_dir) - if upload: - upload_file_to_s3(data_path) - logger.info("Removed video and audio files") - return data - except Exception as e: - logger.error("Error transcribing audio to text") - logger.error(e) - return - - def decimal_to_sexagesimal(dec): sec = int(dec % 60) minu = int((dec // 60) % 60) @@ -254,7 +88,7 @@ def combine_chapter(chapters, transcript, working_dir="tmp/"): def combine_deepgram_chapters_with_diarization(deepgram_data, chapters): - logger = logging.getLogger(__app_name__) + logger.info("(deepgram) Combining transcript with detected chapters...") try: para = "" string = "" @@ -311,42 +145,56 @@ def combine_deepgram_chapters_with_diarization(deepgram_data, chapters): logger.error(e) -def get_deepgram_transcript( - deepgram_data, diarize, title, upload, model_output_dir -): - if diarize: - para = "" - string = "" - curr_speaker = None - data_path = save_local_json(deepgram_data, title, model_output_dir) - if upload: - upload_file_to_s3(data_path) - for word in deepgram_data["results"]["channels"][0]["alternatives"][0][ - "words" - ]: - if word["speaker"] != curr_speaker: - if para != "": - para = para.strip(" ") - string = string + para + "\n\n" - para = "" - string = ( - string + f'Speaker {word["speaker"]}: ' - f'{decimal_to_sexagesimal(word["start"])}' - ) - curr_speaker = word["speaker"] - string = string + "\n\n" +def get_deepgram_transcript(deepgram_data, diarize, title, upload, model_output_dir): + logger = logging.getLogger(__app_name__) - para = para + " " + word["punctuated_word"] - para = para.strip(" ") - string = string + para - return string - else: - data_path = save_local_json(deepgram_data, title, model_output_dir) + def save_local_json(json_data, title, model_output_dir): + time_in_str = datetime.now().strftime("%Y-%m-%d-%H-%M-%S") + if not os.path.isdir(model_output_dir): + os.makedirs(model_output_dir) + file_path = os.path.join( + model_output_dir, title + "_" + time_in_str + ".json" + ) + with open(file_path, "w") as json_file: + json.dump(json_data, json_file, indent=4) + logger.info(f"(deepgram) Model stored at: {file_path}") + return file_path + try: + data_path = write_to_json( + deepgram_data, model_output_dir, title) + logger.info(f"(deepgram) Model stored at: {data_path}") if upload: upload_file_to_s3(data_path) - return deepgram_data["results"]["channels"][0]["alternatives"][0][ - "transcript" - ] + if diarize: + logger.info(f"(deepgram) Processing diarization...") + para = "" + string = "" + curr_speaker = None + for word in deepgram_data["results"]["channels"][0]["alternatives"][0][ + "words" + ]: + if word["speaker"] != curr_speaker: + if para != "": + para = para.strip(" ") + string = string + para + "\n\n" + para = "" + string = ( + string + f'Speaker {word["speaker"]}: ' + f'{decimal_to_sexagesimal(word["start"])}' + ) + curr_speaker = word["speaker"] + string = string + "\n\n" + + para = para + " " + word["punctuated_word"] + para = para.strip(" ") + string = string + para + return string + else: + return deepgram_data["results"]["channels"][0]["alternatives"][0][ + "transcript" + ] + except Exception as e: + raise Exception(f"Error while getting deepgram transcript: {e}") def get_deepgram_summary(deepgram_data): @@ -365,6 +213,7 @@ def get_deepgram_summary(deepgram_data): def process_mp3_deepgram(filename, summarize, diarize): + """using deepgram""" logger = logging.getLogger(__app_name__) logger.info("Transcribing audio to text using deepgram...") try: @@ -388,160 +237,7 @@ def process_mp3_deepgram(filename, summarize, diarize): audio.close() return response except Exception as e: - logger.error("Error transcribing audio to text") - logger.error(e) - return - - -def create_transcript(data): - result = "" - for x in data: - result = result + x[2] + " " - - return result - - -def initialize(): - logger = logging.getLogger(__app_name__) - try: - # FFMPEG installed on first use. - logger.debug("Initializing FFMPEG...") - static_ffmpeg.add_paths() - logger.debug("Initialized FFMPEG") - except Exception as e: - logger.error("Error initializing") - logger.error(e) - - -def write_to_file( - result, - loc, - url, - title, - date, - tags, - category, - speakers, - video_title, - username, - local, - test, - pr, - summary, - working_dir="tmp/", -): - logger = logging.getLogger(__app_name__) - try: - transcribed_text = result - if title: - file_title = title - else: - file_title = video_title - meta_data = ( - "---\n" - f"title: {file_title}\n" - f"transcript_by: {username} via TBTBTC v{__version__}\n" - ) - if not local: - meta_data += f"media: {url}\n" - if tags: - tags = tags.strip() - tags = tags.split(",") - for i in range(len(tags)): - tags[i] = tags[i].strip() - meta_data += f"tags: {tags}\n" - if speakers: - speakers = speakers.strip() - speakers = speakers.split(",") - for i in range(len(speakers)): - speakers[i] = speakers[i].strip() - meta_data += f"speakers: {speakers}\n" - if category: - category = category.strip() - category = category.split(",") - for i in range(len(category)): - category[i] = category[i].strip() - meta_data += f"categories: {category}\n" - if summary: - meta_data += f"summary: {summary}\n" - - file_name = video_title.replace(" ", "-") - file_name_with_ext = os.path.join(working_dir, file_name + ".md") - - if date: - meta_data += f"date: {date}\n" - - meta_data += "---\n" - if test is not None or pr: - with open(file_name_with_ext, "a") as opf: - opf.write(meta_data + "\n") - opf.write(transcribed_text + "\n") - opf.close() - if local: - url = None - if not pr: - generate_payload( - loc=loc, - title=file_title, - transcript=transcribed_text, - media=url, - tags=tags, - category=category, - speakers=speakers, - username=username, - event_date=date, - test=test, - ) - return os.path.abspath(file_name_with_ext) - except Exception as e: - logger.error("Error writing to file") - logger.error(e) - - -def get_md_file_path( - result, - loc, - video, - title, - event_date, - tags, - category, - speakers, - username, - local, - video_title, - test, - pr, - summary="", - working_dir="tmp/", -): - logger = logging.getLogger(__app_name__) - try: - logger.info("writing .md file") - file_name_with_ext = write_to_file( - result, - loc, - video, - title, - event_date, - tags, - category, - speakers, - video_title, - username, - local, - test, - pr, - summary, - working_dir=working_dir, - ) - logger.info("wrote .md file") - - absolute_path = os.path.abspath(file_name_with_ext) - return absolute_path - except Exception as e: - logger.error("Error getting markdown file path") - logger.error(e) + raise Exception(f"(deepgram) Error transcribing audio to text: {e}") def create_pr(absolute_path, loc, username, curr_time, title): @@ -564,210 +260,8 @@ def create_pr(absolute_path, loc, username, curr_time, title): logger.info("Please check the PR for the transcription.") -def get_username(): - logger = logging.getLogger(__app_name__) - try: - if os.path.isfile(".username"): - with open(".username", "r") as f: - username = f.read() - f.close() - else: - print("What is your github username?") - username = input() - with open(".username", "w") as f: - f.write(username) - f.close() - return username - except Exception as e: - logger.error("Error getting username") - logger.error(e) - - -def check_source_type(source): - """Returns the type of source based on the file name - """ - source_type = None - local = False - if source.endswith(".mp3") or source.endswith(".wav"): - source_type = "audio" - elif check_if_playlist(source): - source_type = "playlist" - elif check_if_video(source): - source_type = "video" - # check if source is a local file - if os.path.isfile(source): - local = True - return (source_type, local) - - -def process_audio( - source, - title, - event_date, - tags, - category, - speakers, - loc, - model, - username, - local, - test, - pr, - deepgram, - summarize, - diarize, - upload=False, - model_output_dir="local_models/", - working_dir="tmp/", -): - logger = logging.getLogger(__app_name__) - try: - logger.info("audio file detected") - curr_time = str(round(time.time() * 1000)) - - # check if title is supplied if not, return None - if title is None: - logger.error("Error: Please supply a title for the audio file") - return None - # process audio file - summary = None - result = None - if not local: - filename = get_audio_file( - url=source, title=title, working_dir=working_dir - ) - abs_path = os.path.abspath(path=filename) - logger.info(f"filename: {filename}") - logger.info(f"abs_path: {abs_path}") - else: - filename = source.split("/")[-1] - abs_path = os.path.abspath(source) - logger.info(f"processing audio file: {abs_path}") - if filename is None: - logger.info("File not found") - return - if filename.endswith("wav"): - initialize() - abs_path = convert_wav_to_mp3( - abs_path=abs_path, filename=filename, working_dir=working_dir - ) - if test: - result = test - else: - if deepgram or summarize: - deepgram_resp = process_mp3_deepgram( - filename=abs_path, summarize=summarize, diarize=diarize - ) - result = get_deepgram_transcript( - deepgram_data=deepgram_resp, - diarize=diarize, - title=title, - model_output_dir=model_output_dir, - upload=upload, - ) - if summarize: - summary = get_deepgram_summary(deepgram_data=deepgram_resp) - if not deepgram: - result = process_mp3(abs_path, model, upload, model_output_dir) - result = create_transcript(result) - absolute_path = get_md_file_path( - result=result, - loc=loc, - video=source, - title=title, - event_date=event_date, - tags=tags, - category=category, - speakers=speakers, - username=username, - local=local, - video_title=filename[:-4], - test=test, - pr=pr, - summary=summary, - working_dir=working_dir, - ) - - if pr: - create_pr( - absolute_path=absolute_path, - loc=loc, - username=username, - curr_time=curr_time, - title=title, - ) - return absolute_path - except Exception as e: - logger.error("Error processing audio file") - logger.error(e) - - -def process_videos( - source, - title, - event_date, - tags, - category, - speakers, - loc, - model, - username, - chapters, - pr, - deepgram, - summarize, - diarize, - upload=False, - model_output_dir="local_models", - working_dir="tmp/", -): - logger = logging.getLogger(__app_name__) - try: - logger.info("Playlist detected") - if source.startswith("http") or source.startswith("www"): - parsed_url = urlparse(source) - source = parse_qs(parsed_url.query)["list"][0] - url = "https://www.youtube.com/playlist?list=" + source - logger.info(url) - videos = get_playlist_videos(url) - if videos is None: - logger.info("Playlist is empty") - return - - selected_model = model + ".en" - filename = "" - - for video in videos: - filename = process_video( - video=video, - title=title, - event_date=event_date, - tags=tags, - category=category, - speakers=speakers, - loc=loc, - model=selected_model, - username=username, - pr=pr, - chapters=chapters, - test=False, - diarize=diarize, - deepgram=deepgram, - summarize=summarize, - upload=upload, - working_dir=working_dir, - model_output_dir=model_output_dir, - ) - if filename is None: - return None - return filename - except Exception as e: - logger.error("Error processing playlist") - logger.error(e) - - def combine_deepgram_with_chapters(deepgram_data, chapters): - logger = logging.getLogger(__app_name__) + logger.info("(deepgram) Combining transcript with detected chapters...") try: chapters_pointer = 0 words_pointer = 0 @@ -801,240 +295,6 @@ def combine_deepgram_with_chapters(deepgram_data, chapters): logger.error(e) -def process_video( - video, - title, - event_date, - tags, - category, - speakers, - loc, - model, - username, - chapters, - test, - pr, - local=False, - deepgram=False, - summarize=False, - diarize=False, - upload=False, - model_output_dir="local_models", - working_dir="tmp/", -): - logger = logging.getLogger(__app_name__) - try: - curr_time = str(round(time.time() * 1000)) - if not local: - if "watch?v=" in video: - parsed_url = urlparse(video) - video = parse_qs(parsed_url.query)["v"][0] - elif "youtu.be" in video or "embed" in video: - video = video.split("/")[-1] - video = "https://www.youtube.com/watch?v=" + video - logger.info("Transcribing video: " + video) - if event_date is None: - event_date = get_date(video) - abs_path = download_video(url=video, working_dir=working_dir) - if abs_path is None: - logger.info("File not found") - return None - filename = abs_path.split("/")[-1] - else: - filename = video.split("/")[-1] - logger.info("Transcribing video: " + filename) - abs_path = os.path.abspath(video) - - if not title: - title = filename[:-4] - initialize() - summary = None - result = "" - deepgram_data = None - if chapters and not test: - chapters = read_description(working_dir) - elif test: - chapters = read_description("test/testAssets/") - mp3_path = convert_video_to_mp3(abs_path, working_dir) - if deepgram or summarize: - deepgram_data = process_mp3_deepgram( - filename=mp3_path, summarize=summarize, diarize=diarize - ) - result = get_deepgram_transcript( - deepgram_data=deepgram_data, - diarize=diarize, - title=title, - model_output_dir=model_output_dir, - upload=upload, - ) - if summarize: - logger.info("Summarizing") - summary = get_deepgram_summary(deepgram_data=deepgram_data) - if not deepgram: - result = process_mp3(mp3_path, model, upload, model_output_dir) - if chapters and len(chapters) > 0: - logger.info("Chapters detected") - write_chapters_file( - os.path.join(working_dir, filename[:-4] + ".chapters"), chapters - ) - if deepgram: - if diarize: - result = combine_deepgram_chapters_with_diarization( - deepgram_data=deepgram_data, chapters=chapters - ) - else: - result = combine_deepgram_with_chapters( - deepgram_data=deepgram_data, chapters=chapters - ) - else: - result = combine_chapter( - chapters=chapters, - transcript=result, - working_dir=working_dir, - ) - else: - if not test and not deepgram: - result = create_transcript(result) - elif not deepgram: - result = "" - logger.info("Creating markdown file") - absolute_path = get_md_file_path( - result=result, - loc=loc, - video=video, - title=title, - event_date=event_date, - tags=tags, - summary=summary, - category=category, - speakers=speakers, - username=username, - video_title=filename[:-4], - local=local, - pr=pr, - test=test, - working_dir=working_dir, - ) - if not test: - if pr: - create_pr( - absolute_path=absolute_path, - loc=loc, - username=username, - curr_time=curr_time, - title=title, - ) - return absolute_path - except Exception as e: - logger.error("Error processing video") - logger.error(e) - - - -def process_source( - source, - title, - event_date, - tags, - category, - speakers, - loc, - model, - username, - source_type, - chapters, - local, - test=None, - pr=False, - deepgram=False, - summarize=False, - diarize=False, - upload=False, - model_output_dir=None, - verbose=False, -): - tmp_dir = tempfile.mkdtemp() - model_output_dir = ( - "local_models/" if model_output_dir is None else model_output_dir - ) - - try: - if source_type == "audio": - filename = process_audio( - source=source, - title=title, - event_date=event_date, - tags=tags, - category=category, - speakers=speakers, - loc=loc, - model=model, - username=username, - summarize=summarize, - local=local, - test=test, - pr=pr, - deepgram=deepgram, - diarize=diarize, - upload=upload, - model_output_dir=model_output_dir, - working_dir=tmp_dir, - ) - elif source_type == "playlist": - filename = process_videos( - source=source, - title=title, - event_date=event_date, - tags=tags, - category=category, - speakers=speakers, - loc=loc, - model=model, - username=username, - summarize=summarize, - chapters=chapters, - pr=pr, - deepgram=deepgram, - diarize=diarize, - upload=upload, - model_output_dir=model_output_dir, - working_dir=tmp_dir, - ) - elif source_type == "video": - filename = process_video( - video=source, - title=title, - event_date=event_date, - summarize=summarize, - tags=tags, - category=category, - speakers=speakers, - loc=loc, - model=model, - username=username, - local=local, - diarize=diarize, - chapters=chapters, - test=test, - pr=pr, - deepgram=deepgram, - upload=upload, - model_output_dir=model_output_dir, - working_dir=tmp_dir, - ) - else: - raise Exception(f"{source_type} is not a valid source type") - return filename, tmp_dir - except Exception as e: - logger.error("Error processing source") - logger.error(e) - - -def get_date(url): - video = pytube.YouTube(url) - return str(video.publish_date).split(" ")[0] - - def clean_up(tmp_dir): try: shutil.rmtree(tmp_dir) @@ -1043,32 +303,14 @@ def clean_up(tmp_dir): raise -def save_local_json(json_data, title, model_output_dir): - logger = logging.getLogger(__app_name__) - logger.info(f"Saving Locally...") - time_in_str = datetime.now().strftime("%Y-%m-%d-%H-%M-%S") - if not os.path.isdir(model_output_dir): - os.makedirs(model_output_dir) - file_path = os.path.join( - model_output_dir, title + "_" + time_in_str + ".json" - ) - with open(file_path, "w") as json_file: - json.dump(json_data, json_file, indent=4) - logger.info(f"Model stored at path {file_path}") - return file_path - - def generate_srt(data, filename, model_output_dir): - logger = logging.getLogger(__app_name__) - logger.info("Saving Locally...") time_in_str = datetime.now().strftime("%Y-%m-%d-%H-%M-%S") - base_filename, _ = os.path.splitext(filename) if not os.path.isdir(model_output_dir): os.makedirs(model_output_dir) output_file = os.path.join( - model_output_dir, base_filename + "_" + time_in_str + ".srt" + model_output_dir, filename + "_" + time_in_str + ".srt" ) - logger.debug(f"writing srt to {output_file}") + logger.info(f"Writing srt to {output_file}...") with open(output_file, "w") as f: for index, segment in enumerate(data): start_time, end_time, text = segment @@ -1099,49 +341,3 @@ def upload_file_to_s3(file_path): logger.info(f"File uploaded to S3 bucket : {bucket}") except Exception as e: logger.error(f"Error uploading file to S3 bucket: {e}") - - -def generate_payload( - loc, - title, - event_date, - tags, - category, - speakers, - username, - media, - transcript, - test, -): - logger = logging.getLogger(__app_name__) - try: - event_date = ( - event_date - if event_date is None - else event_date - if type(event_date) is str - else event_date.strftime("%Y-%m-%d") - ) - data = { - "title": title, - "transcript_by": f"{username} via TBTBTC v{__version__}", - "categories": category, - "tags": tags, - "speakers": speakers, - "date": event_date, - "media": media, - "loc": loc, - "body": transcript, - } - content = {"content": data} - if test: - return content - else: - config = dotenv_values(".env") - url = config["QUEUE_ENDPOINT"] + "/api/transcripts" - resp = requests.post(url, json=content) - if resp.status_code == 200: - logger.info("Transcript added to queue") - return resp - except Exception as e: - logger.error(e) diff --git a/app/logging.py b/app/logging.py new file mode 100644 index 0000000..c4cb077 --- /dev/null +++ b/app/logging.py @@ -0,0 +1,28 @@ +import logging +from pathlib import Path +import sys + +from app import __app_name__ + + +def configure_logger(log_level, working_dir=None): + logger = get_logger() + sh = logging.StreamHandler() + sh_log_fmt = '%(asctime)s [%(levelname)s] %(message)s' + sh.setLevel(log_level) + sh.setFormatter(logging.Formatter(sh_log_fmt)) + + # Always log debug out to a file in the workdir + if working_dir is not None: + filehandler = logging.FileHandler(Path(working_dir) / "tstbtc.log") + filehandler.setLevel(logging.DEBUG) + file_log_fmt = '%(asctime)s %(name)s [%(levelname)s] %(message)s' + filehandler.setFormatter(logging.Formatter(file_log_fmt)) + logger.addHandler(filehandler) + + logger.addHandler(sh) + logger.setLevel(logging.DEBUG) + + +def get_logger(): + return logging.getLogger(__app_name__) diff --git a/app/transcript.py b/app/transcript.py new file mode 100644 index 0000000..3cd8cd0 --- /dev/null +++ b/app/transcript.py @@ -0,0 +1,433 @@ +import json +import logging +import os +import shutil +import tempfile +from datetime import datetime +from urllib.parse import parse_qs, urlparse + +import pytube +import requests +import static_ffmpeg +import whisper +import yt_dlp +from clint.textui import progress +from moviepy.editor import VideoFileClip + +from app import __app_name__, __version__, application +from app.logging import get_logger +from app.utils import slugify + +logger = get_logger() + + +class Transcript: + def __init__(self, source, test_mode=False): + self.source = source + self.test_mode = test_mode + self.logger = get_logger() + + def create_transcript(self): + result = "" + for x in self.result: + result = result + x[2] + " " + + return result + + def process_source(self, tmp_dir=None): + tmp_dir = tmp_dir if tmp_dir is not None else tempfile.mkdtemp() + self.audio_file = self.source.process(tmp_dir) + self.title = self.source.title if self.source.title else os.path.basename( + self.audio_file)[:-4] + return self.audio_file, tmp_dir + + def transcribe(self, working_dir, generate_chapters, summarize_transcript, service, diarize, upload, model_output_dir, test_transcript=None): + + def process_mp3(): + """using whisper""" + self.logger.info("Transcribing audio to text using whisper ...") + try: + my_model = whisper.load_model(service) + result = my_model.transcribe(self.audio_file) + data = [] + for x in result["segments"]: + data.append(tuple((x["start"], x["end"], x["text"]))) + data_path = application.generate_srt( + data, self.title, model_output_dir) + if upload: + application.upload_file_to_s3(data_path) + return data + except Exception as e: + self.logger.error( + f"(wisper,{service}) Error transcribing audio to text: {e}") + return + + def write_chapters_file(): + """Write out the chapter file based on simple MP4 format (OGM)""" + try: + if generate_chapters and len(self.source.chapters) > 0: + self.logger.info("Chapters detected") + chapters_file = os.path.join(working_dir, os.path.basename( + self.audio_file)[:-4] + ".chapters") + + with open(chapters_file, "w") as fo: + for current_chapter in self.source.chapters: + fo.write( + f"CHAPTER{current_chapter[0]}=" + f"{current_chapter[1]}\n" + f"CHAPTER{current_chapter[0]}NAME=" + f"{current_chapter[2]}\n" + ) + fo.close() + return True + else: + return False + except Exception as e: + raise Exception(f"Error writing chapters file: {e}") + + try: + self.summary = None + if self.test_mode: + self.result = test_transcript if test_transcript is not None else "test-mode" + return self.result + if not self.audio_file: + # TODO give audio file path as argument + raise Exception( + "audio file is missing, you need to process_source() first") + + has_chapters = len(self.source.chapters) > 0 + self.result = None + if service == "deepgram" or summarize_transcript: + deepgram_resp = application.process_mp3_deepgram( + self.audio_file, summarize_transcript, diarize) + self.result = application.get_deepgram_transcript( + deepgram_resp, diarize, self.title, upload, model_output_dir) + + if summarize_transcript: + self.summary = application.get_deepgram_summary( + deepgram_resp) + + if service == "deepgram" and has_chapters: + if diarize: + self.result = application.combine_deepgram_chapters_with_diarization( + deepgram_data=deepgram_resp, chapters=self.source.chapters + ) + else: + self.result = application.combine_deepgram_with_chapters( + deepgram_data=deepgram_resp, chapters=self.source.chapters + ) + + if not service == "deepgram": + # whisper + self.result = process_mp3() + if has_chapters: + # this is only available for videos, for now + self.result = application.combine_chapter( + chapters=self.source.chapters, + transcript=self.result, + working_dir=working_dir + ) + else: + # finalize transcript + self.result = self.create_transcript() + + return self.result + + except Exception as e: + raise Exception(f"Error while transcribing audio source: {e}") + + def write_to_file(self, working_dir, transcript_by): + """Writes transcript to a markdown file and returns its absolute path + This file is submitted as part of the Pull Request to the + bitcointranscripts repo + """ + + def process_metadata(key, value): + if value: + value = value.strip() + value = [item.strip() for item in value.split(",")] + return f"{key}: {value}\n" + return "" + + self.logger.info("Creating markdown file with transcription...") + try: + # Add metadata prefix + meta_data = ( + "---\n" + f"title: {self.title}\n" + f"transcript_by: {transcript_by} via TBTBTC v{__version__}\n" + ) + if not self.source.local: + meta_data += f"media: {self.source.source_file}\n" + meta_data += process_metadata("tags", self.source.tags) + meta_data += process_metadata("speakers", self.source.speakers) + meta_data += process_metadata("categories", + self.source.category) + if self.summary: + meta_data += f"summary: {self.summary}\n" + if self.source.event_date: + meta_data += f"date: {self.source.event_date}\n" + meta_data += "---\n" + # Write to file + output_file = os.path.join( + working_dir, f"{slugify(self.title)}.md") + with open(output_file, "a") as opf: + opf.write(meta_data + "\n") + opf.write(self.result + "\n") + opf.close() + self.logger.info(f"Markdown file stored at: {output_file}") + return os.path.abspath(output_file) + except Exception as e: + self.logger.error(f"Error writing to file: {e}") + + def __str__(self): + excluded_fields = ['test_mode', 'logger'] + fields = {key: value for key, value in self.__dict__.items() + if key not in excluded_fields} + fields['source'] = str(self.source) + return f"Transcript:{str(fields)}" + + +class Source: + def __init__(self, source_file, local, title, date, tags, category, speakers, preprocess): + # initialize source with arguments + self.save_source(source_file, local, title, date, + tags, category, speakers, preprocess) + self.__config_event_date(date) + self.logger = get_logger() + + def save_source(self, source_file, local, title, date, tags, category, speakers, preprocess): + self.source_file = source_file + self.local = local + self.title = title + self.tags = tags + self.category = category + self.speakers = speakers + self.logger = get_logger() + self.preprocess = preprocess + + def __config_event_date(self, date): + self.event_date = None + if date: + try: + if type(date) is str: + self.event_date = datetime.strptime(date, "%Y-%m-%d").date() + else: + self.event_date = date + except ValueError as e: + raise ValueError(f"Supplied date is invalid: {e}") + return + + def initialize(self): + try: + # FFMPEG installed on first use. + self.logger.debug("Initializing FFMPEG...") + static_ffmpeg.add_paths() + self.logger.debug("Initialized FFMPEG") + except Exception as e: + raise Exception("Error initializing") + + +class Audio(Source): + def __init__(self, source): + try: + # initialize source using a base Source + super().__init__(source.source_file, source.local, source.title, source.event_date, + source.tags, source.category, source.speakers, source.preprocess) + self.type = "audio" + self.__config_source() + except Exception as e: + raise Exception(f"Error during Audio creation: {e}") + + def __config_source(self): + if self.title is None: + raise Exception("Please supply a title for the audio file") + + def process(self, working_dir): + """Process audio""" + + def download_audio(): + """Helper method to download an audio file and return its absolute path""" + # sanity checks + if self.local: + raise Exception(f"{self.source_file} is a local file") + if self.title is None: + raise Exception("Please supply a title for the audio file") + self.logger.info(f"Downloading audio file: {self.source_file}") + try: + audio = requests.get(self.source_file, stream=True) + output_file = os.path.join( + working_dir, f"{slugify(self.title)}.mp3") + with open(output_file, "wb") as f: + total_length = int(audio.headers.get("content-length")) + for chunk in progress.bar( + audio.iter_content(chunk_size=1024), + expected_size=(total_length / 1024) + 1, + ): + if chunk: + f.write(chunk) + f.flush() + return os.path.abspath(output_file) + except Exception as e: + raise Exception(f"Error downloading audio file: {e}") + + try: + self.logger.info(f"Audio processing: '{self.source_file}'") + if not self.local: + # download audio file from the internet + abs_path = download_audio() + self.logger.info(f"Audio file stored in: {abs_path}") + else: + # calculate the absolute path of the local audio file + filename = self.source_file.split("/")[-1] + abs_path = os.path.abspath(self.source_file) + filename = os.path.basename(abs_path) + if filename.endswith("wav"): + self.initialize() + abs_path = application.convert_wav_to_mp3( + abs_path=abs_path, filename=filename, working_dir=working_dir + ) + # return the audio file that is now ready for transcription + return abs_path + + except Exception as e: + raise Exception(f"Error processing audio file: {e}") + + + +class Video(Source): + def __init__(self, source, youtube_metadata=None, chapters=None): + try: + # initialize source using a base Source + super().__init__(source.source_file, source.local, source.title, source.event_date, + source.tags, source.category, source.speakers, source.preprocess) + self.type = "video" + self.youtube_metadata = youtube_metadata + self.chapters = chapters + + if self.youtube_metadata is None: + # importing from json, metadata exist + if not self.local and self.preprocess: + self.download_video_metadata() + except Exception as e: + raise Exception(f"Error during Video creation: {e}") + + def download_video_metadata(self): + self.logger.info(f"Downloading metadata from: {self.source_file}") + ydl_opts = { + 'quiet': True, # Suppress console output + 'extract_flat': True, # Extract only metadata without downloading + } + try: + with yt_dlp.YoutubeDL(ydl_opts) as ydl: + yt_info = ydl.extract_info(self.source_file, download=False) + self.title = yt_info.get('title', 'N/A') + self.youtube_metadata = { + "description": yt_info.get('description', 'N/A'), + "tags": yt_info.get('tags', 'N/A'), + "categories": yt_info.get('categories', 'N/A') + } + self.event_date = datetime.strptime(yt_info.get( + 'upload_date', None), "%Y%m%d").date() if yt_info.get('upload_date', None) else None + # Extract chapters from video's metadata + self.chapters = [] + has_chapters = yt_info.get('chapters', None) + if has_chapters: + for index, x in enumerate(yt_info["chapters"]): + name = x["title"] + start = x["start_time"] + self.chapters.append((str(index), start, str(name))) + except yt_dlp.DownloadError as e: + raise Exception(f"Error with downloading YouTube metadata: {e}") + + def process(self, working_dir): + """Process video""" + + def download_video(): + """Helper method to download a YouTube video and return its absolute path""" + # sanity checks + if self.local: + raise Exception(f"{self.source_file} is a local file") + try: + self.logger.info(f"Downloading video: {self.source_file}") + ydl_opts = { + "format": "18", + "outtmpl": os.path.join(working_dir, "videoFile.%(ext)s"), + "nopart": True, + } + with yt_dlp.YoutubeDL(ydl_opts) as ytdl: + ytdl.download([self.source_file]) + + output_file = os.path.join(working_dir, "videoFile.mp4") + return os.path.abspath(output_file) + except Exception as e: + self.logger.error(e) + raise Exception(f"Error downloading video: {e}") + + def convert_video_to_mp3(video_file): + try: + self.logger.info(f"Converting {video_file} to mp3...") + clip = VideoFileClip(video_file) + output_file = os.path.join( + working_dir, os.path.basename(video_file)[:-4] + ".mp3") + clip.audio.write_audiofile(output_file) + clip.close() + self.logger.info("Video converted to mp3") + return output_file + except Exception as e: + raise Exception(f"Error converting video to mp3: {e}") + + def extract_chapters_from_downloaded_video_metadata(): + try: + list_of_chapters = [] + with open(f"{working_dir}/videoFile.info.json", "r") as f: + info = json.load(f) + if "chapters" not in info: + self.logger.info("No chapters found for downloaded video") + return list_of_chapters + for index, x in enumerate(info["chapters"]): + name = x["title"] + start = x["start_time"] + list_of_chapters.append((str(index), start, str(name))) + + return list_of_chapters + except Exception as e: + self.logger.error( + f"Error reading downloaded video's metadata: {e}") + return [] + + try: + self.logger.info(f"Video processing: '{self.source_file}'") + if not self.local: + abs_path = download_video() + if self.chapters is None: + self.chapters = extract_chapters_from_downloaded_video_metadata() + else: + abs_path = os.path.abspath(self.source_file) + + self.initialize() + audio_file = convert_video_to_mp3(abs_path) + return audio_file + + except Exception as e: + raise Exception(f"Error processing video file: {e}") + + +class Playlist(Source): + def __init__(self, source, entries, preprocess=False): + try: + # initialize source using a base Source + super().__init__(source.source_file, source.local, source.title, source.event_date, + source.tags, source.category, source.speakers, source.preprocess) + self.__config_source(entries) + except Exception as e: + raise Exception(f"Error during Playlist creation: {e}") + + def __config_source(self, entries): + self.type = "playlist" + self.videos = [] + for entry in entries: + if entry["title"] != '[Private video]': + source = Video(source=Source(entry["url"], self.local, entry["title"], self.event_date, + self.tags, self.category, self.speakers, self.preprocess)) + self.videos.append(source) diff --git a/app/transcription.py b/app/transcription.py new file mode 100644 index 0000000..47da2a0 --- /dev/null +++ b/app/transcription.py @@ -0,0 +1,249 @@ +import json +import logging +import os +import re +import tempfile +import time +from datetime import datetime + +from dotenv import dotenv_values +import pytube +from pytube.exceptions import PytubeError +import requests +import yt_dlp + +from app.transcript import Transcript, Source, Audio, Video, Playlist +from app import __app_name__, __version__, application +from app.utils import write_to_json +from app.logging import get_logger + + +class Transcription: + def __init__(self, loc="test/test", model="tiny", chapters=False, pr=False, summarize=False, deepgram=False, diarize=False, upload=False, model_output_dir="local_models/", nocleanup=False, queue=True, markdown=False, username=None, test_mode=False, working_dir=None): + self.model = model + self.transcript_by = "username" if test_mode else self.__get_username() + # location in the bitcointranscripts hierarchy + self.loc = loc.strip("/") + self.generate_chapters = chapters + self.open_pr = pr + self.summarize_transcript = summarize + self.service = "deepgram" if deepgram else model + self.diarize = diarize + self.upload = upload + self.model_output_dir = f"{model_output_dir}/{self.loc}" + self.transcripts = [] + self.nocleanup = nocleanup + # during testing we do not have/need a queuer backend + self.queue = queue if not test_mode else False + # during testing we need to create the markdown for validation purposes + self.markdown = markdown or test_mode + self.test_mode = test_mode + self.logger = get_logger() + self.tmp_dir = working_dir if working_dir is not None else tempfile.mkdtemp() + + self.logger.info(f"Temp directory: {self.tmp_dir}") + + def _create_subdirectory(self, subdir_name): + """Helper method to create subdirectories within the central temp director""" + subdir_path = os.path.join(self.tmp_dir, subdir_name) + os.makedirs(subdir_path) + return subdir_path + + def __get_username(self): + try: + if os.path.isfile(".username"): + with open(".username", "r") as f: + username = f.read() + f.close() + else: + print("What is your github username?") + username = input() + with open(".username", "w") as f: + f.write(username) + f.close() + return username + except Exception as e: + raise Exception("Error getting username") + + def _initialize_source(self, source: Source, youtube_metadata, chapters): + """Initialize transcription source based on metadata + Returns the initialized source (Audio, Video, Playlist)""" + + def check_if_youtube(source: Source): + """Helper method to check and assign a valid source for + a YouTube playlist or YouTube video by requesting its metadata + Does not support video-ids, only urls""" + try: + ydl_opts = { + 'quiet': False, # Suppress console output + 'extract_flat': True, # Extract only metadata without downloading + } + with yt_dlp.YoutubeDL(ydl_opts) as ydl: + info_dict = ydl.extract_info( + source.source_file, download=False) + if 'entries' in info_dict: + # Playlist URL, not a single video + # source.title = info_dict["title"] + return Playlist(source=source, entries=info_dict["entries"]) + elif 'title' in info_dict: + # Single video URL + return Video(source=source) + else: + raise Exception(source.source_file) + + except Exception as e: + # Invalid URL or video not found + raise Exception(f"Invalid source: {e}") + try: + if source.source_file.endswith(".mp3") or source.source_file.endswith(".wav"): + return Audio(source=source) + + if youtube_metadata is not None: + # we have youtube metadata, this can only be true for videos + source.preprocess = False + return Video(source=source, youtube_metadata=youtube_metadata, chapters=chapters) + if source.source_file.endswith(".mp4"): + # regular remote video, not youtube + source.preprocess = False + return Video(source=source) + youtube_source = check_if_youtube(source) + if youtube_source == "unknown": + raise Exception(f"Invalid source: {source}") + return youtube_source + except Exception as e: + raise Exception(f"Error from assigning source: {e}") + + def add_transcription_source(self, source_file, title=None, date=None, tags=[], category=[], speakers=[], preprocess=True, youtube_metadata=None, chapters=None): + """Add a source for transcription""" + transcription_sources = {"added": [], "exist": []} + # check if source is a local file + local = False + if os.path.isfile(source_file): + local = True + # initialize source + source = self._initialize_source( + source=Source(source_file, local, title, date, + tags, category, speakers, preprocess), + youtube_metadata=youtube_metadata, + chapters=chapters) + self.logger.info(f"Detected source: {source}") + if source.type == "playlist": + # add a transcript for each source/video in the playlist + for video in source.videos: + transcription_sources['added'].append(video) + self.transcripts.append(Transcript(video, self.test_mode)) + elif source.type in ['audio', 'video']: + transcription_sources['added'].append(source) + self.transcripts.append(Transcript(source, self.test_mode)) + else: + raise Exception(f"Invalid source: {source_file}") + return transcription_sources + + def push_to_queue(self, transcript: Transcript, payload=None): + """Push the resulting transcript to a Queuer backend""" + def construct_payload(): + """Helper method to construct the payload for the request to the Queuer backend""" + payload = { + "content": { + "title": transcript.title, + "transcript_by": f"{self.transcript_by} via TBTBTC v{__version__}", + "categories": transcript.source.category, + "tags": transcript.source.tags, + "speakers": transcript.source.speakers, + "loc": self.loc, + "body": transcript.result, + } + } + # Handle optional metadata fields + if transcript.source.event_date: + payload["content"]["date"] = transcript.source.event_date if type( + transcript.source.event_date) is str else transcript.source.event_date.strftime("%Y-%m-%d") + if not transcript.source.local: + payload["content"]["media"] = transcript.source.source_file + return payload + + try: + if payload is None: + # No payload has been given directly + payload = construct_payload() + # Check if the user opt-out from sending the payload to the Queuer + if not self.queue: + # payload will not be send to the Queuer backend + if self.test_mode: + # queuer is disabled by default when testing but we still + # return the payload to be used for testing purposes + return payload + else: + # store payload in case the user wants to manually send it to the queuer + payload_json_file = write_to_json( + payload, self.model_output_dir, f"{transcript.title}_payload") + self.logger.info( + f"Transcript not added to the queue, payload stored at: {payload_json_file}") + return payload_json_file + # Push the payload with the resulting transcript to the Queuer backend + config = dotenv_values(".env") + if "QUEUE_ENDPOINT" not in config: + raise Exception( + "To push to a queue you need to define a 'QUEUE_ENDPOINT' in your .env file") + if "BEARER_TOKEN" not in config: + raise Exception( + "To push to a queue you need to define a 'BEARER_TOKEN' in your .env file") + url = config["QUEUE_ENDPOINT"] + "/api/transcripts" + headers = { + 'Authorization': f'Bearer {config["BEARER_TOKEN"]}', + 'Content-Type': 'application/json' + } + response = requests.post(url, json=payload, headers=headers) + if response.status_code == 200: + self.logger.info( + f"Transcript added to queue with id={response.json()['id']}") + else: + self.logger.error( + f"Transcript not added to queue: ({response.status_code}) {response.text}") + return response + except Exception as e: + self.logger.error(f"Transcript not added to queue: {e}") + + def start(self, test_transcript=None): + self.result = [] + try: + for transcript in self.transcripts: + self.logger.info( + f"Processing source: {transcript.source.source_file}") + tmp_dir = self._create_subdirectory( + f"transcript{len(self.result) + 1}") + transcript.process_source(tmp_dir) + result = transcript.transcribe( + tmp_dir, + self.generate_chapters, + self.summarize_transcript, + self.service, + self.diarize, + self.upload, + self.model_output_dir, + test_transcript=test_transcript + ) + if self.markdown: + transcription_md_file = transcript.write_to_file( + self.model_output_dir if not self.test_mode else tmp_dir, + self.transcript_by) + self.result.append(transcription_md_file) + else: + self.result.append(result) + if self.open_pr: + application.create_pr( + absolute_path=transcription_md_file, + loc=self.loc, + username=self.transcript_by, + curr_time=str(round(time.time() * 1000)), + title=transcript.title, + ) + else: + self.push_to_queue(transcript) + return self.result + except Exception as e: + raise Exception(f"Error with the transcription: {e}") from e + + def clean_up(self): + self.logger.info("Cleaning up...") + application.clean_up(self.tmp_dir) diff --git a/app/utils.py b/app/utils.py new file mode 100644 index 0000000..87a19ec --- /dev/null +++ b/app/utils.py @@ -0,0 +1,19 @@ +import json +import os +import re +from datetime import datetime + +def slugify(text): + return re.sub(r'\W+', '-', text).strip('-').lower() + + +def write_to_json(json_data, output_dir, filename, add_timestamp=True): + if not os.path.isdir(output_dir): + os.makedirs(output_dir) + time_in_str = f'_{datetime.now().strftime("%Y-%m-%d-%H-%M-%S")}' if add_timestamp else "" + file_path = os.path.join( + output_dir, f"{slugify(filename)}{time_in_str}.json" + ) + with open(file_path, "w") as json_file: + json.dump(json_data, json_file, indent=4) + return file_path diff --git a/test/testAssets/payload.json b/test/testAssets/payload.json index be39d68..c008213 100644 --- a/test/testAssets/payload.json +++ b/test/testAssets/payload.json @@ -3,10 +3,9 @@ "title": "test_title", "transcript_by": "username via TBTBTC v1.0.0", "categories": ["category1", "category2"], - "tags": ["tag1", "tag2"], + "tags": [], "speakers": ["speaker1", "speaker2"], "date": "2020-01-31", - "media": "test/testAssets/test_video.mp4", "loc": "yada/yada", "body": "Welcome to the Jankoid podcast. I'm here with merch hi there. Today we're gonna jump into the temple and That's a pun if you didn't get it. Welcome to Jankoid decoded the temple The temple an area you are more than familiar with yeah a Nampool whispery in the call. Yeah, let's maybe start with what's the relationship between the Mimpool and fees? We often talk about the Mimpool, but there is no such thing as a global Mimpool Every full-known has its own Mimpool and the Mimpool is basically just the queue of transactions waiting to get Confred where Confred means included in a block. So by default Block template builders will just sort the waiting transactions by the highest effective fee rate Then pick from the top the juicier transaction the quicker gets confirmed now Especially in the last few months we've seen that there was a very large queues because we had a huge run up in the price I haven't checked but I think it's now about a hundred and twenty days that We haven't cleared the Mimpool maybe a hundred and ten and since 15th of December So Mimpools are limited and By default they are limited to 300 megabytes of De-serialized data So that includes all day overhead structure the previous U-tix O's maybe even the whole transaction that created U-tix O's and so forth So roughly at about 80 blocks worth of data the default of 300 megabyte gets exceeded and at that point a full node will automatically start Perching the lowest fee rate transactions data stop them and tell all their neighboring peers Hey, don't send me anything under this period. They they start raising up their min fee rate So the problem that gets introduced here is if a parent transaction is no longer in the Mimpool You cannot bump it because if you try to do a CPP and the pair doesn't there the child is gonna be invalid CFF just for the initiated child place for parent Some things that are being done in the context of that is that people are working on Package relay where you can send more than one transaction to appear as a package that they evaluate as a whole together Instead of looking at the parent and saying okay you're out and this child doesn't have a parent Okay, you're out to And maybe you can just talk a little bit more about the mechanics of how CPS fee actually works to get into a block You bid on block space transactions get serialized in an apartment Where inputs are fair we big outputs are smaller there's a little bit of a transaction header that includes like how many inputs there output there and Lock time inversion So we already found out that when miners build blocks days Sort transactions by the highest period so they first considered the transactions that paid the most set-touchies Per byte of serialized data. So what's the mechanic where the mechanics of CPS fee when you try to Get a transaction through sometimes they have a Firit that is to low for it to be considered quickly and you can reprioritize your transaction by Increasing its effective Firit now you cannot edit a transaction after you submitted it to the network because the Transaction itself is immutable But what you can do is you can spend one of their outputs of the transactions with a Another child transaction that has a very high fee and Now the child transaction can only be valid by the parent getting included in the block So miners will look at transaction packages actually they sort the weight list by the M sister fee rate of transactions not just by transactions in the singular So when you have a child that is super juicy it basically pays for the parent to get included at low as well So literally tell pace for parent got every parents dream to have their children pay for You said that when miners evaluate these Fee rates is that built in the Bitcoin core are they writing custom software for that Bitcoin core has a get black template corn which allows you to exactly do that just generate a black template But I believe that most miners are probably running custom code because for example They accept out of band payments to reprioritize transactions or they run their own wallet service on this side and always prioritize their own transactions or They might have some sort of other solver that optimizes block template building further So I think that I haven't looked at this in detail, but I think that at least they're not running default values because By default blocks created by Bitcoin core would leave a little space. I think about six kilo bytes and blocks are full if you look at them. So they must have at least treated a little bit and we're not when we say miners We're talking about pools. Yes, right so most miners as in the People running a six or whatever They just join our pool who does the coordination of the work and They basically the pool operator picks the block template that is being worked on and the miner just gets a separate workspace that they iterate over in order to try to fund the This problem sounds hard. Why is it hard to estimate periods? So block discovery is a random process think of like Decay of radio active isotopes What we do there is we can give you a half time It usually takes around this much of time for half of the atoms to Disappade But we can't tell you if we look at a single atom when it's actually gonna Disappade it might be immediately it might be at the half time it might take decades Right with blocks that's the same thing there in average coming in at I think about 9.7 minutes But when the next block is gonna be found is up to this random press on process Actually it is such that since there is no memory to the process It's every draw just has a chance to succeed at every point in time The next block is about 10 minutes away in the average. Yeah, it's really intuitive to think about that Right if even if you're 18 minutes into not finding a block the next block will be found in 10 minutes Yes, exactly you don't know when the next block is gonna be found So you don't know what transactions you will be competing against you might be competing against the transactions that I Translate in the man pool plus the transactions that get added in the next one minute You might be competing against the transactions in the man pool plus 10 minutes or plus 60 minutes Because about once a day There's a block that takes 60 minutes really you have this one shot to pick exactly the right view To slide in at the bottom of the block that you want to be in because if you don't slide in at the bottom of the block You're overpay and if you underestimate you're not gonna get confirmed in the time that you were aiming to be confident And so how do exchanges usually do this are they overpaying? Are they just estimating the the upper end? Maybe like who's paying those fees? Right, there's different scenarios some exchanges have different tiers like low-time preference and high-time preference or whatever and they treat those differently But generally most exchanges by now batch their withdrawals Which gives them a way to leverage their scale So if you're sending to 20 people every minute Making one transaction out of that is a lot cheaper than making 20 separate payments It's also much easier to manage your due-to-to-pull that way and And Then they just tend to very conservatively estimate their fees just Be in the next two blocks and maybe rather overpay slightly because it's so much less work To deal with all the customer compliance over step-transactions than to to pay like sure we're overpaying by 30% to be in the next block But it's not them that's overpaying Is they usually that gives passed on the customer? There's different models. I think in most actually the exchange pays But they take a flat fee for a withdraw or really yeah, so like it's time for a very long time for example I'd like I think a 90 cent 90 Euro cent flat restraw fee But then they'd bet every few minutes only you said that the member who hasn't really been empty for almost four months Yeah, that's correct. Is the ever gonna empty again as we go to the moon does the what happens to the man pool? Yeah, that's a great question. I think we'll eventually see a man pool empty again But there should probably be a long tail end to it emptying Because now in this for months a lot of the exchanges that usually would do consolidations to keep their you take so pool sized Manageable they haven't been able to get any of those through so when the fee rates go down now I think that we'll see more people put in their consolidation transactions had like three to five such as per bite And that I think we might not see an empty man pool for multiple months So even if the top fee rates get a lot more relaxed now Generally the competition to be in blocks seems to correlate with volatility and especially price rises when when the market Heats up and and people are more excited to trade There's more transaction volume on the network and Now we've seen in the past four weeks or so the price has been going more sideways There might have been even a small dips here and there and the top fee rates have come down On the on the weekends that's dropped first to seven set of sheet per bite then six and now last weekend Six was clear completely I don't think that getting a one set of super by transaction a true will be possible at any time soon But it'll be very possible to wait to the weekend to get a ten set of super by transactions Maybe from like a more met-of-you know the miners like this don't they like having high fees because One is revenue for them but also As we sort of zoom out we think about the decreasing block reward over time Don't we have to have a high fee environment in order for this this is the work under one hand You have to also consider that the exchange rate 10x in the last year So the same fee rates represent a 10x purchasing value in cost for Getting a sense to the same service a transaction into a block so while the fee rates are similar The cost of getting a transaction through has actually increased there miners do love it because I think he rates make about 17% or so of the block reward right now So I'm not sure yeah, that's that's a nice little tip right But there's definitely a concern that when we continue to reduce the blocks subsidy in the every four year having rewards schedule that eventually the system will have to subside just transaction fees and if the transaction fees are to low it will Basically not be Economic for miners to provide security to the bit-ten system so there's a good argument for not Increasing the block space To our degree where it's always gonna be empty if you want to do that you essentially have to Also switch to an endless block subsidy otherwise there is no economic incentive for miners to continue mining if there's Not enough fees unless unless your minimum fee rate at some point becomes So valuable that even at minimum fee rate any transactions are Some sort of sufficient revenue for miners to continue their business Maybe we can sort of circle back to what happens when transactions are elected from the mumpul and to talk about like what problems Like it introduced especially for fee bumping and and lightning channel closing right When a mumpul fills up as we said earlier the node will start dropping the lowest fee rate transactions And especially for people or services that use Unconfirmed inputs that can be a problem at times because you cannot Spend an input that is unknown to other nodes Right, so if all other nodes on a network have dropped a transaction Your polar option that spends the output from that drop transaction will not be able to relay on in that work So you Cannot only not spend your hands, but you can also not Repair or ties the prior transaction One thing that this solves basically is RBF because you can just rewrite a replaceable transaction and submit a transaction with a higher fee rate All right, so we went over CPSP can we go over our BS? Sure So dip one 25 introduces rules by which you are allowed to replace transactions You have to explicitly signal data transaction is replaceable and In that case before a transaction is confirmed the sender may issue an updated version of the transaction Which can completely change the outputs the only restriction is that it has to use one of the same inputs Otherwise it wouldn't be a replacement And wouldn't be so it has to be a conflicting transaction essentially and Additionally, it has to pay enough fees to replace the prior transaction and all the transactions that changed off of them In the mimpul so if you had like three transactions you have to pay more fees and the replacement than those three transactions together All right, so blast double spending It's over site. I do not like to term double spending in that context So the problem with that is a successful double spend means that Either you actually got two transactions that were in conflict confirmed Which could basically only happen if you have two competing blocks where one Block had a prior version and a second block had a netty and then the second block eventually becomes part of the best chain Or when you at least convince somebody that they had been paid But then actually managed to spend the funds somewhere else But here in this case are the f transactions are Explicitly labeling themselves as replaceable basically they're running around with a red lettered sign on front of their chest Do not trust me, right and most wallets are for just doesn't show your rb f transactions until they are confirmed Once confirmed in the blockchain they're exactly the same and same reliability as any other transaction But while tuing they are explicitly saying look, I could be replaced do not consider yourself paid So calling this a double spend is really just saying that well Somebody made extremely unreasonable assumptions about the reliability of a transaction that explicitly warned them that It's not reliable so I like conflicting transactions more in this context and maybe why do we need two ways to bump fees? Why do we need RBF and CPFB right so they have slightly different traders CPFP allows any recipient of a transaction to bump it Right that could be a recipient in the sense that the person that got paid or The sender if there was a change I'd put on the transaction it also doesn't change the tx ID because you're just training Advait transactions on it and it it takes more blocks this right because you now have to send a second transaction in order to Increase the effective fee rate of the first so more blocks this Easier to keep track off and more flexibility as in there's more parties than can interact with it RBF on the other hand allows you to completely replace the transaction Which means that it is more flexible But you potentially have to pay more fee use especially if somebody else changed off of your transaction already It changes the tx ID and a lot of wallets and services have been tracking payments by the tx ID Rather than looking at like what address got paid what the amount was or Whatever as in like treating the current addresses as in voices as they should be used they built a whole system around tx IDs so our BF transactions that They changed the inputs are outputs right otherwise they couldn't change the fees and that means that they have a new tx ID And it it is not trivial to keep proper track of that and to update your UX and UI to make that Easily accessible to your users right then also only the sender can run per transaction Think that because they have to reissue the updated variant of the transaction Given that it is a little more difficult to interact with our BF transactions a lot of services Only see them once they're confirmed once they're reliable so if you're trying to get a service To give you something very quickly You might want to choose to not do an RBS transaction the first place though that they can Reasonably assume oh this has a high enough fee read and we know the user We can trust them that they're sending us these three dollars actually and give them Existed it so I don't know what I thought it okay So we asked the question like what problems do does an employee eviction Cause for fee bumping and and also maybe the lightning Channel closing use case we talked a bit generically about how parent transactions being gone Staps you from being able to spend those unconfermed output But does this especially a problem in the context of lightning because when you close a lightning channel It's either the collaborative case where you have no problem because you can really go see it the closing transaction with your partner But where you really needed you're trying to unilatory close because your partner has Channel partner hasn't shown up in all then if you you have to fall back to the transaction that you had negotiated Sometime in the past when you last updated the commitment transaction So let's say that was in a low fee read time and now the fee read suffix loaded and you can't actually even Broadcast the commitment transaction to the network because it's too low fee read Now the problem is the parted that is closing the lightning channel under the LN penalty update mechanism their funds are actually locked with a csv So they can't do cp because The output is only spendable after the transaction is confirmed for multiple blocks So you can't obtain a transaction to a output that is not spendable while it's still in the income Especially for enlightening this introduces the volatility in the the block space martin introduces a headache because You can literally come into the situation where you can close your lightning channel due to the fee read So one approach I've heard about is to introduce anchor outputs Which are depending on the proposal either spendable by either side or spendable on certain conditions But they immediately spendable so they can be cp of peat or Another idea is to have package relay right because if the Channel closing transaction has a low fee read and you can then relay it together with a second transaction That'll work except if you're in the naturally closing because the csv issue still Pretends to that but either way if you get package relay you would be able to do away with the the estimate and Commitment transactions altogether because we talked about how fee estimation is hard for regular transactions The estimation for a commitment transactions is even much harder because you have no clue when you will want to use the transaction Yeah, that depends if they seem Very scary right you you have absolutely no clue what the fee rates will be like when when you actually try to use it So having Packetry lay would in combination with anchor outputs would allow you to always have a zero fee on the commitment transaction And then basically always bring your own fee when you broadcast it in the Cp P P Touch Transaction got it okay, so we sort of talked about specific but maybe we can zoom out and you know What are some ways that we could be using our box based more efficiently? What are some things that make us optimistic about the future? We still have Only about I think 40% or so Segwood inputs now about 50% percent of all transactions use Segwood inputs, but the majority of inputs is still non Segwood Once more people start using Segwood or even tap root once tap root comes out The input sizes will be smaller. So naturally there will be more space for more transactions So recently a major service wall service provider and How I'm first of April nonetheless that they would be switching to native Segwood addresses and they they had been a long holdout So blockchain.com has Probably around 33% of all transactions creations among their use of this yeah, I mean that dependency is We're shaking our heads simultaneously Great Segwood activated on 24th of August in 2017 Right, that's three and a half years ago and until recently I think they they weren't even able to send to native Segwood addresses and now they announced that not only They'll actually default native Segwood addresses altogether I think they claimed this month, but I'm hoping that they'll come true with that because we have a huge backlock of all use outputs that they created over the years It has been one of the most popular bit carnwallas for yeah almost a decade And it will take forever for all of these non Segwood outputs to eventually get spent But the observation is that most Inputs are Consuming just very young outputs so funds that got moved are much more likely to move again soon so Seeing that Chanda.com will hopefully switch to native Segwood output soon I would assume that even while the U.T.s O set will have a lot of non Segwood outputs living there for a very long time The transactions that get built very well much quicker become Segwood transactions to a high degree If 33% of all transactions let's say 80% of them become Segwood inputs and Literally more than half their input sides that would be I want to say like 15% of the current blocks based demand Going away overnight. Yeah, yeah That's right. I think they calculate more more certainly, but other other holdouts Bit max recently switched to native Segwood I think for deposits there's still Quite a few services that use rap Segwood rather than native Segwood which already gets most of the efficiency but clearly not all it was actually expecting that the high fee rates might get more people moving I think that the tap root roll out might get a huge Block space efficiency gain because Tap root introduces a bunch of new features that are only available fruit Tap root and tap root outputs and inputs are about the size of pay-to-avitness public key hash in total So smaller than a lot of the multi-state constructions these days even in native Segwood and Definitely smaller than everything non Segwood so any any wallet that switched to tap root roll Bring down the blocks based use a lot quickly. Yeah, the multi-state savings are pretty significant yeah And local little bring in a new era of Multi-state being more standard. I think that'd be that's the system work setting thing it'll take quite some time because to do the Public key aggregation that will bring the biggest efficiency gain people will actually have to implement new Segwood or another aggregation algorithm and until that gets into regular wallets Will be a while. I think maybe the first it gets into libraries and and Especially for services with multi Segwood wallets There would be a huge efficiency gain there and they they should have Great incentives to roll it out very quickly Great Thanks for listening to another episode of chain code decoded and we're gonna keep it rolling we'll have another one next week Yeah, let's talk about maybe how the blockchain works go back to basics in a time\n" } diff --git a/test/test_audio.py b/test/test_audio.py index 28aaaed..27721bd 100644 --- a/test/test_audio.py +++ b/test/test_audio.py @@ -4,6 +4,7 @@ import pytest from app import application +from app.transcription import Transcription def rel_path(path): @@ -67,32 +68,21 @@ def test_audio_with_title(): source = rel_path("testAssets/audio.mp3") title = "title" username = "username" - filename, tmp_dir = application.process_source( - source=source, - title=title, - event_date=None, - tags=None, - category=None, - speakers=None, - loc=rel_path("yada/yada"), - model="tiny", + transcription = Transcription( username=username, - source_type="audio", - local=True, - test=result, - chapters=False, - pr=False, + test_mode=True, ) - filename = os.path.join(tmp_dir, filename) - assert os.path.isfile(filename) + transcription.add_transcription_source(source, title) + transcripts = transcription.start() + assert os.path.isfile(transcripts[0]) assert check_md_file( - path=filename, + path=transcripts[0], transcript_by=username, media=source, title=title, local=True, ) - application.clean_up(tmp_dir) + transcription.clean_up() @pytest.mark.feature @@ -102,29 +92,14 @@ def test_audio_without_title(): file.close() source = rel_path("test/testAssets/audio.mp3") - username = "username" title = None - filename, tmp_dir = application.process_source( - source=source, - title=title, - event_date=None, - tags=None, - category=None, - speakers=None, - loc=rel_path("yada/yada"), - model="tiny", - username=username, - pr=False, - source_type="audio", - local=True, - test=result, - chapters=False, + transcription = Transcription( + test_mode=True ) - assert filename is None - assert not check_md_file( - path=filename, transcript_by=username, media=source, title=title - ) - application.clean_up(tmp_dir) + with pytest.raises(Exception) as error: + transcription.add_transcription_source(source, title) + assert "Please supply a title for the audio file" in str(error) + transcription.clean_up() @pytest.mark.feature @@ -139,31 +114,20 @@ def test_audio_with_all_data(): tags = "tag1, tag2" category = "category" date = "2020-01-31" - date = datetime.strptime(date, "%Y-%m-%d").date() - filename, tmp_dir = application.process_source( - source=source, - title=title, - event_date=date, - tags=tags, - category=category, - speakers=speakers, - loc=rel_path("yada/yada"), - model="tiny", + transcription = Transcription( username=username, - source_type="audio", - local=True, - test=result, - chapters=False, - pr=False, + test_mode=True, ) + transcription.add_transcription_source( + source, title, date, tags, category, speakers) + transcripts = transcription.start() + category = [cat.strip() for cat in category.split(",")] tags = [tag.strip() for tag in tags.split(",")] speakers = [speaker.strip() for speaker in speakers.split(",")] - date = date.strftime("%Y-%m-%d") - filename = os.path.join(tmp_dir, filename) - assert os.path.isfile(filename) + assert os.path.isfile(transcripts[0]) assert check_md_file( - path=filename, + path=transcripts[0], transcript_by=username, media=source, title=title, @@ -173,4 +137,4 @@ def test_audio_with_all_data(): speakers=speakers, local=True, ) - application.clean_up(tmp_dir) + transcription.clean_up() diff --git a/test/test_cli.py b/test/test_cli.py index f59a404..2b42de0 100644 --- a/test/test_cli.py +++ b/test/test_cli.py @@ -4,6 +4,7 @@ import pytest from app import application +from app.transcription import Transcription def rel_path(path): @@ -21,53 +22,33 @@ def test_initialize_repo(): assert False -@pytest.mark.feature -def test_find_source_type(): - assert application.check_source_type("B0HW_sJ503Y")[0] == "video" - assert application.check_source_type("https://www.youtube.com/watch?v=B0HW_sJ503Y")[0] == "video" - assert application.check_source_type("https://youtu.be/B0HW_sJ503Y")[0] == "video" - assert application.check_source_type("https://youtube.com/embed/B0HW_sJ503Y")[0] == "video" - assert application.check_source_type("youtube.com/watch?v=B0HW_sJ503Y")[0] == "video" - assert application.check_source_type("www.youtube.com/watch?v=B0HW_sJ503Y&list")[0] == "video" - assert application.check_source_type("https://youtube.com/watch?v=B0HW_sJ503Y")[0] == "video" - - assert application.check_source_type("PLPQwGV1aLnTuN6kdNWlElfr2tzigB9Nnj")[0] == "playlist" - assert application.check_source_type("https://www.youtube.com/playlist?list=PLPQwGV1aLnTuN6kdNWlElfr2tzigB9Nnj")[0] == "playlist" - assert application.check_source_type("www.youtube.com/playlist?list=PLPQwGV1aLnTuN6kdNWlElfr2tzigB9Nnj")[0] == "playlist" - assert application.check_source_type("https://youtube.com/playlist?list=PLPQwGV1aLnTuN6kdNWlElfr2tzigB9Nnj")[0] == "playlist" - assert application.check_source_type("https://www.youtube.com/watch?v=B0HW_sJ503Y&list=PLPQwGV1aLnTuN6kdNWlElfr2tzigB9Nnj")[0] == "playlist" - - assert application.check_source_type("https://anchor.fm/s/12fe0620/podcast/play/32260353/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2021-3-26%2Fdc6f12e7-a547-d872-6ef6-7acfe755a692.mp3")[0] == "audio" - +# @pytest.mark.main +# def test_find_source_type(): +# @TODO rewwrite +@pytest.mark.feature def test_download_audio_file(): - if not os.path.isdir("tmp"): - os.mkdir("tmp") - audio = application.get_audio_file( - "https://dcs.megaphone.fm/FPMN6776580946.mp3", "test" + transcription = Transcription( + test_mode=True, ) - print("audio", audio) - assert os.path.isfile(audio) - os.remove(audio) + transcription.add_transcription_source( + "https://dcs.megaphone.fm/FPMN6776580946.mp3", "test") + audio_file, tmp_dir = transcription.transcripts[0].process_source( + transcription.tmp_dir) + assert os.path.isfile(audio_file) + application.clean_up(tmp_dir) +@pytest.mark.feature def test_download_video_file(): - if not os.path.isdir("tmp"): - os.mkdir("tmp") - url = "https://www.youtube.com/watch?v=B0HW_sJ503Y" - video = application.download_video(url) - assert os.path.isfile(video) and os.path.isfile("tmp/videoFile.info.json") - print() - os.remove(video) - os.remove("tmp/videoFile.info.json") - shutil.rmtree("tmp") - - -@pytest.mark.main -def test_convert_video_to_audio(): - if not os.path.isdir("tmp/"): - os.makedirs("tmp/") - application.convert_video_to_mp3(rel_path("testAssets/test_video.mp4")) - assert os.path.isfile("tmp/test_video.mp3") - shutil.rmtree("tmp/") + transcription = Transcription( + test_mode=True, + ) + transcription.add_transcription_source( + "https://www.youtube.com/watch?v=B0HW_sJ503Y") + audio_file, tmp_dir = transcription.transcripts[0].process_source( + transcription.tmp_dir) + assert os.path.isfile(f"{audio_file[:-4]}.mp4") # video download + assert os.path.isfile(audio_file) # mp3 convert + application.clean_up(tmp_dir) diff --git a/test/test_video.py b/test/test_video.py index a4b0406..2d07576 100644 --- a/test/test_video.py +++ b/test/test_video.py @@ -6,6 +6,7 @@ import pytest from app import application +from app.transcription import Transcription def rel_path(path): @@ -90,9 +91,6 @@ def check_md_file( @pytest.mark.feature def test_video_with_title(): - with open(rel_path("testAssets/transcript.txt"), "r") as file: - result = file.read() - file.close() source = os.path.abspath(rel_path("testAssets/test_video.mp4")) username = "username" title = "test_video" @@ -100,26 +98,17 @@ def test_video_with_title(): tags = None category = None date = None - filename, tmp_dir = application.process_source( - source=source, - title=title, - event_date=date, - tags=tags, - category=category, - speakers=speakers, - loc="yada/yada", - model="tiny", + transcription = Transcription( username=username, - source_type="video", - local=True, - test=result, - chapters=False, + test_mode=True, ) - assert tmp_dir is not None - filename = os.path.join(tmp_dir, filename) - assert os.path.isfile(filename) + transcription.add_transcription_source( + source, title, date, tags, category, speakers) + transcripts = transcription.start() + + assert os.path.isfile(transcripts[0]) assert check_md_file( - path=filename, + path=transcripts[0], transcript_by=username, media=source, title=title, @@ -129,7 +118,7 @@ def test_video_with_title(): speakers=speakers, local=True, ) - application.clean_up(tmp_dir) + transcription.clean_up() @pytest.mark.feature @@ -141,32 +130,21 @@ def test_video_with_all_options(): tags = "tag1, tag2" category = "category" date = "2020-01-31" - date = datetime.strptime(date, "%Y-%m-%d").date() - filename, tmp_dir = application.process_source( - source=source, - title=title, - event_date=date, - tags=tags, - category=category, - speakers=speakers, - loc="yada/yada", - model="tiny", + + transcription = Transcription( username=username, - source_type="video", - local=True, - test=True, - chapters=False, + test_mode=True, ) - assert tmp_dir is not None - filename = os.path.join(tmp_dir, filename) - assert os.path.isfile(filename) + transcription.add_transcription_source( + source, title, date, tags, category, speakers) + transcripts = transcription.start() + assert os.path.isfile(transcripts[0]) category = [cat.strip() for cat in category.split(",")] tags = [tag.strip() for tag in tags.split(",")] speakers = [speaker.strip() for speaker in speakers.split(",")] - date = date.strftime("%Y-%m-%d") assert check_md_file( - path=filename, + path=transcripts[0], transcript_by=username, media=source, title=title, @@ -176,7 +154,7 @@ def test_video_with_all_options(): speakers=speakers, local=True, ) - application.clean_up(tmp_dir) + transcription.clean_up() @pytest.mark.feature @@ -191,25 +169,16 @@ def test_video_with_chapters(): tags = "tag1, tag2" category = "category" date = "2020-01-31" - date = datetime.strptime(date, "%Y-%m-%d").date() - filename, tmp_dir = application.process_source( - source=source, - title=title, - event_date=date, - tags=tags, - category=category, - speakers=speakers, - loc="yada/yada", - model="tiny", + + transcription = Transcription( username=username, - source_type="video", - local=True, - test=result, chapters=True, - pr=True, + test_mode=True, ) - assert tmp_dir is not None - filename = os.path.join(tmp_dir, filename) + transcription.add_transcription_source( + source, title, date, tags, category, speakers) + transcripts = transcription.start(result) + chapter_names = [] with open(rel_path("testAssets/test_video_chapters.chapters"), "r") as file: result = file.read() @@ -218,14 +187,12 @@ def test_video_with_chapters(): chapter_names.append(x.split("= ")[1].strip()) file.close() - print(filename) - assert os.path.isfile(filename) + assert os.path.isfile(transcripts[0]) category = [cat.strip() for cat in category.split(",")] tags = [tag.strip() for tag in tags.split(",")] speakers = [speaker.strip() for speaker in speakers.split(",")] - date = date.strftime("%Y-%m-%d") assert check_md_file( - path=filename, + path=transcripts[0], transcript_by=username, media=source, title=title, @@ -236,28 +203,34 @@ def test_video_with_chapters(): chapters=chapter_names, local=True, ) - application.clean_up(tmp_dir) + transcription.clean_up() @pytest.mark.feature def test_generate_payload(): - date = "2020-01-31" - date = datetime.strptime(date, "%Y-%m-%d").date() with open(rel_path("testAssets/transcript.txt"), "r") as file: transcript = file.read() file.close() - payload = application.generate_payload( - loc=rel_path("yada/yada"), - title="test_title", - event_date=date, - tags=["tag1", "tag2"], - test=True, - category=["category1", "category2"], - speakers=["speaker1", "speaker2"], + + source = rel_path("testAssets/test_video.mp4") + username = "username" + title = "test_title" + speakers = ["speaker1", "speaker2"] + tags = [] + category = ["category1", "category2"] + date = "2020-01-31" + loc = "yada/yada" + + transcription = Transcription( + loc=loc, username="username", - media=rel_path("testAssets/test_video.mp4"), - transcript=transcript, + test_mode=True, ) + transcription.add_transcription_source( + source, title, date, tags, category, speakers) + transcription.start(transcript) + payload = transcription.push_to_queue(transcription.transcripts[0]) + transcription.clean_up() with open(rel_path("testAssets/payload.json"), "r") as outfile: content = json.load(outfile) outfile.close() diff --git a/transcriber.py b/transcriber.py index 063d962..18827f6 100644 --- a/transcriber.py +++ b/transcriber.py @@ -1,22 +1,14 @@ import logging -from datetime import datetime +import tempfile import click from app import __app_name__, __version__, application +from app.transcript import Transcript +from app.transcription import Transcription +from app.logging import configure_logger, get_logger - -def setup_logger(): - logger = logging.getLogger(__app_name__) - console_handler = logging.StreamHandler() - console_handler.setLevel( - logging.DEBUG - ) # Set the desired log level for console output in the submodule - formatter = logging.Formatter( - "%(asctime)s - %(name)s - %(levelname)s - %(message)s" - ) - console_handler.setFormatter(formatter) - logger.addHandler(console_handler) +logger = get_logger() @click.group() @@ -38,10 +30,7 @@ def print_help(ctx, param, value): ctx.exit() -@click.command() -@click.argument("source", nargs=1) -@click.argument("loc", nargs=1) -@click.option( +whisper = click.option( "-m", "--model", type=click.Choice( @@ -58,118 +47,143 @@ def print_help(ctx, param, value): ] ), default="tiny.en", - help="Options for transcription model", -) -@click.option( - "-t", - "--title", - type=str, - help="Supply transcribed file title in 'quotes', title is mandatory in case" - " of audio files", -) -@click.option( - "-d", - "--date", - type=str, - help="Supply the event date in format 'yyyy-mm-dd'", + show_default=True, + help="Select which whisper model to use for the transcription", ) -@click.option( - "-T", - "--tags", - type=str, - help="Supply the tags for the transcript in 'quotes' and separated by " - "commas", -) -@click.option( - "-s", - "--speakers", - type=str, - help="Supply the speakers for the transcript in 'quotes' and separated by " - "commas", +deepgram = click.option( + "-D", + "--deepgram", + is_flag=True, + default=False, + help="Use deepgram for transcription", ) -@click.option( - "-c", - "--category", - type=str, - help="Supply the category for the transcript in 'quotes' and separated by " - "commas", +diarize = click.option( + "-M", + "--diarize", + is_flag=True, + default=False, + help="Supply this flag if you have multiple speakers AKA " + "want to diarize the content", ) -@click.option( - "-v", - "--version", +summarize = click.option( + "-S", + "--summarize", is_flag=True, - callback=print_version, - expose_value=False, - is_eager=True, - help="Show the application's version and exit.", + default=False, + help="Summarize the transcript [only available with deepgram]", ) -@click.option( +use_youtube_chapters = click.option( "-C", "--chapters", is_flag=True, default=False, - help="Supply this flag if you want to generate chapters for the transcript", -) -@click.option( - "-h", - "--help", - is_flag=True, - callback=print_help, - expose_value=False, - is_eager=True, - help="Show the application's help and exit.", + help="For YouTube videos, include the YouTube chapters and timestamps in the resulting transcript.", ) -@click.option( +open_pr = click.option( "-p", "--PR", is_flag=True, default=False, - help="Supply this flag if you want to generate a payload", + help="Open a PR on the bitcointranscripts repo", ) -@click.option( - "-D", - "--deepgram", +upload_to_s3 = click.option( + "-u", + "--upload", is_flag=True, default=False, - help="Supply this flag if you want to use deepgram", + help="Upload processed model files to AWS S3", ) -@click.option( - "-S", - "--summarize", +save_to_markdown = click.option( + "--markdown", is_flag=True, default=False, - help="Supply this flag if you want to summarize the content", + help="Save the resulting transcript to a markdown format supported by bitcointranscripts", ) -@click.option( - "-M", - "--diarize", +noqueue = click.option( + "--noqueue", is_flag=True, default=False, - help="Supply this flag if you have multiple speakers AKA " - "want to diarize the content", + help="Do not push the resulting transcript to the Queuer backend", ) -@click.option( +model_output_dir = click.option( + "-o", + "--model_output_dir", + type=str, + default="local_models/", + show_default=True, + help="Set the directory for saving model outputs", +) +nocleanup = click.option( + "--nocleanup", + is_flag=True, + default=False, + help="Do not remove temp files on exit", +) +verbose_logging = click.option( "-V", "--verbose", is_flag=True, default=False, help="Supply this flag to enable verbose logging", ) + + +@cli.command() +@click.argument("source", nargs=1) +@click.argument("loc", nargs=1) +# Available transcription models and services +@whisper +@deepgram +# Options for adding metadata @click.option( - "-o", - "--model_output_dir", + "-t", + "--title", type=str, - default="local_models/", - help="Supply this flag if you want to change the directory for saving " - "model outputs", + help="Add the title for the resulting transcript (required for audio files)", ) @click.option( - "-u", - "--upload", + "-d", + "--date", + type=str, + help="Add the event date to transcript's metadata in format 'yyyy-mm-dd'", +) +@click.option( + "-T", + "--tags", + multiple=True, + help="Add a tag to transcript's metadata (can be used multiple times)", +) +@click.option( + "-s", + "--speakers", + multiple=True, + help="Add a speaker to the transcript's metadata (can be used multiple times)", +) +@click.option( + "-c", + "--category", + multiple=True, + help="Add a category to the transcript's metadata (can be used multiple times)", +) +# Options for configuring the transcription process +@diarize +@summarize +@use_youtube_chapters +@open_pr +@upload_to_s3 +@save_to_markdown +@noqueue +@model_output_dir +@nocleanup +@verbose_logging +@click.option( + "-v", + "--version", is_flag=True, - default=False, - help="Supply this flag if you want to upload processed model files to AWS " - "S3", + callback=print_version, + expose_value=False, + is_eager=True, + help="Show the application's version and exit.", ) def add( source: str, @@ -177,9 +191,9 @@ def add( model: str, title: str, date: str, - tags: str, - speakers: str, - category: str, + tags: list, + speakers: list, + category: list, chapters: bool, pr: bool, deepgram: bool, @@ -188,62 +202,47 @@ def add( upload: bool, verbose: bool, model_output_dir: str, + nocleanup: bool, + noqueue: bool, + markdown: bool ) -> None: - """Supply a YouTube video id and directory for transcription. \n + """Transcribe the given source. Suported sources: + YouTube videos, YouTube playlists, Local and remote audio files + Note: The https links need to be wrapped in quotes when running the command on zsh """ - setup_logger() - logger = logging.getLogger(__app_name__) - if verbose: - logger.setLevel(logging.DEBUG) - else: - logger.setLevel(logging.WARNING) + tmp_dir = tempfile.mkdtemp() + configure_logger(logging.DEBUG if verbose else logging.INFO, tmp_dir) logger.info( "This tool will convert Youtube videos to mp3 files and then " "transcribe them to text using Whisper. " ) try: - username = application.get_username() - loc = loc.strip("/") - event_date = None - if date: - try: - event_date = datetime.strptime(date, "%Y-%m-%d").date() - except ValueError as e: - logger.error("Supplied date is invalid: ", e) - return - (source_type, local) = application.check_source_type(source=source) - if source_type is None: - logger.error("Invalid source") - return - filename, tmp_dir = application.process_source( - source=source, - title=title, - event_date=event_date, - tags=tags, - category=category, - speakers=speakers, + transcription = Transcription( loc=loc, model=model, - username=username, chapters=chapters, pr=pr, summarize=summarize, - source_type=source_type, deepgram=deepgram, diarize=diarize, upload=upload, model_output_dir=model_output_dir, - verbose=verbose, - local=local + nocleanup=nocleanup, + queue=not noqueue, + markdown=markdown, + working_dir=tmp_dir + ) + transcription.add_transcription_source( + source_file=source, title=title, date=date, tags=tags, category=category, speakers=speakers, ) - if filename: - """INITIALIZE GIT AND OPEN A PR""" - logger.info("Transcription complete") - logger.info("Cleaning up...") - application.clean_up(tmp_dir) + transcription.start() + if nocleanup: + logger.info("Not cleaning up temp files...") + else: + transcription.clean_up() except Exception as e: logger.error(e) - logger.error("Cleaning up...") + logger.info(f"Exited with error, not cleaning up temp files: {tmp_dir}")