What's new?

Word-level timestamps for Whisper automatic-speech-recognition 🤯

This release adds the ability to predict word-level timestamps for our whisper automatic-speech-recognition models by analyzing the cross-attentions and applying dynamic time warping. Our implementation is adapted from this PR, which added this functionality to the 🤗 transformers Python library.

Example usage: (see docs)

import { pipeline } from '@xenova/transformers';

let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
let transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en', {
    revision: 'output_attentions',
});
let output = await transcriber(url, { return_timestamps: 'word' });
// {
//   "text": " And so my fellow Americans ask not what your country can do for you ask what you can do for your country.",
//   "chunks": [
//     { "text": " And", "timestamp": [0, 0.78] },
//     { "text": " so", "timestamp": [0.78, 1.06] },
//     { "text": " my", "timestamp": [1.06, 1.46] },
//     ...
//     { "text": " for", "timestamp": [9.72, 9.92] },
//     { "text": " your", "timestamp": [9.92, 10.22] },
//     { "text": " country.", "timestamp": [10.22, 13.5] }
//   ]
// }

Note: For now, you need to choose the output_attentions revision (see above). In future, we may merge these models into the main branch. Also, we currently do not have exports for the medium and large models, simply because I don't have enough RAM to do the export myself (>25GB needed) 😅 ... so, if you would like to use our conversion script to do the conversion yourself, please make a PR on the hub with these new models (under a new output_attentions branch)!

From our testing, the JS implementation exactly matches the output produced by the Python implementation (when using the same model of course)! 🥳

Python (left) vs. JavaScript (right)

surprise me

I'm excited to see what you all build with this! Please tag me on twitter if you use it in your project - I'd love to see! I'm also planning on adding this as an option to whisper-web, so stay tuned! 🚀

Misc bug fixes and improvements

Fix loading of grayscale images in node.js (#178)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.4.0

What's new?

Word-level timestamps for Whisper automatic-speech-recognition 🤯

Misc bug fixes and improvements