Skip to content

kaieberl/paper2speech

Repository files navigation

Paper2Speech

Tip

You can now convert documents to a web page with explanation buttons and have specific sections explained. The provided script takes care of retrieving the required background information from your document.

Motivation

As a student in applied mathematics / machine learning, I often get to read scientific books, lecture notes and papers. Usually I prefer listening to a lecture from the professor and following his visual explanations on the blackboard, because then I get much information through the ear and don't have to do the "heavy lifting" through reading only. So far, this has not been available for books and papers.
So I thought: Why not let a software read out the text for you? What if you just had to click a button in the Finder, and the book or paper is converted to speech automatically?
This script uses the Meta Nougat package to extract formatted text from pdf and then converts it to audio or a website.

Sample output for the paper Large Language Models for Compiler Optimization:
output audio

Features

The aim of this package is to make papers more accessible by converting them to audio, or to an easy-to-read web page.

Audio

  • pause before and after headings
  • skip references like [1], (1, 2)], [Feynman et al., 1965], [AAKA23, SKNM23]
  • spell out abbreviations like e.g., i.e., w.r.t., Fig., Eq.
  • read out inline math (work in progress)
  • do not read out block math, instead pause
  • do not read out table contents
  • read out figure, table captions

HTML

  • fetch explanations for a specific section from GPT-4o using retrieval augmented generation and read it out
  • explanation buttons next to each section, subsection, theorem, etc.

Installation

Replace the GEMMA_CPP_PATH variable in src/markdown_to_html.py with the build path of your gemma executable. The tokenizer and model weights should be in the same directory. If no weights are found, this functionality is disabled.

git clone [email protected]:kaieberl/paper2speech.git
pip install .

For conversion to html, additionally install:

brew install node
npm install -g @mathpix/mpx-cli
sudo port install latexml

Usage

Files can be converted from pdf, mmd and tex to mp3 and html.

paper2speech <input_file.pdf> -o <output_file.mp3>

In case an error occurs in a later stage, you can invoke the command again on intermediately produced files (e.g. mmd). When converting to html, the output directory should be out/ for correct linking of the css file.
There are two scripts in the out/ directory: script_latexml.js and script_latexml_clipboard.js. If you prefer to use your own ChatGPT or other LLM subscription, you can rename the second script to script_latexml.js. This will just copy the retrieved section text to the clipboard, with an added instruction.
Depending on your particular use case and LLM, you can customize the instruction that will be sent, e.g.

const prompt = 'Explain this intuitively:\n' + getSectionText(button.parentElement);

or

const prompt = 'Explain this in detail and without simplification:\n' + getSectionText(button.parentElement);

Currently, the mp3 output uses Google Cloud TTS, while the html output uses the OpenAI gpt-4o and speech APIs.
The Google cloud authentication json file should be in the src directory. It can be downloaded from the Google Cloud Console, as described here.
TLDR: On https://cloud.google.com, create a new project. In your project, in the upper right corner, click on the 3 dots > project settings > service accounts > choose one or create service account > create key > json > create. The resulting json file should be downloaded automatically. Google TTS Neural2 and Wavenet voices are free for the first 1 million characters per month, after that $16 per 1M characters for the Neural2 voices and $4 per 1M characters for the Wavenet voices.
The OpenAI API key should be added in out/script_latexml.js.

You can customize the voice in the definition of the voice variable.

voice = texttospeech.VoiceSelectionParams(
    language_code='en-GB',
    name='en-GB-Neural2-B',
)

Go to https://cloud.google.com/text-to-speech to try out different voices and languages. Below the text box, there is a button to show the json request. E.g. to use an American english voice, replace 'en': ('en-GB', 'en-GB-Neural2-B'), by 'en': ('en-US', 'en-US-Neural2-J'),. Also change the fallback Wavenet voice to the same voice a few lines further down:

voice = texttospeech.VoiceSelectionParams(
    language_code='en-GB',
    name='en-GB-Wavenet-B',
)

This voice is used if the Neural voice returns an error, e.g. because a sentence is too long.

On macOS, you can create a shortcut in the Finder with the following steps:

  1. in Automator, create a new Quick Action.
  2. At the top, choose input as "PDF files" in "Finder".
  3. add a "Run Shell Script" action. Set shell to /bin/zsh and pass input as arguments.
  4. add the following code: For mp3 output:
source ~/opt/miniconda3/etc/profile.d/conda.sh
conda activate paper2audio
paper2speech $1 -o "${1%.*}.mp3"

For creating an html page:

export PATH=/opt/homebrew/bin:/opt/local/bin:$PATH
source ~/opt/miniconda3/etc/profile.d/conda.sh
conda activate paper2audio
file_name=${1##*/}
paper2speech $1 -o "/path/to/paper2speech/out/${file_name%.*}.html"

Where the two paths in the first line should be the locations of node and latexmlc. 5. save the action and give it a name, e.g. "Paper2Speech", or "PaperAI", respectively.

FAQ

What to do if I get the error: Mathpix CLI conversion failed?

There is likely an unsupported LaTeX command in your mmd file.

  1. Please go to snip.mathpix.com and paste the content of your mmd file into a new note. You will get a preview on the right. Any command unsupported in Mathpix Markdown will show up as yellow warning.
  2. Inside text_to_speech.py, add a replacement to the refine_mmd() function at the bottom. Please also create a PR or an issue, so that I can fix the bug. Alternatively, if you can live with the error, you can export the note as tex from Mathpix and then run paper2speech on the tex file.

Limitations (for PDFs)

  • currently does not support images in PDFs (PRs welcome!)
  • only works for English

Roadmap

  • create a Dockerfile for easy installation