Add duration bounds checking to word-for-word audio migration #351

CharlieMcVicker · 2023-10-27T15:38:16Z

Goal

The purpose of this PR is to catch mismatched annotations and audio files at migration time.

Checks

Annotations do not reference audio outside of the bounds of the audio file's duration
- Sadly this involved downloading the audio to tmp files and doing the duration check on disk, which is not ideal for the long run. I'd like to find a rust crate that will do the metadata read over the network, but I couldn't find one quickly and moved on. I consider this a quick fix and diagnostic while we are still getting to the bottom of things.
Some kind of check which can handle when a smaller annotation file is used. Eg. more than 10 seconds between the last annotation end_time and the duration of the file. For false positives, we would probably want to trim the end of the audio file anyway. Removing silence at the end of a track wouldn't affect alignment.
There are the same number of annotations and words in a document

Counting words / annotations and ensuring they match

I couldn't find a clean place to add a sanity check on num_words == num_annotations but here is the code for it. I originally wrote this inside AnnotatedLine::many_from_semantic source but since that goes line by line, this would typically fail, since there are fewer words in a line than in the whole document.

                // Our code will error if there are fewer annotations than
                // words, but continue silently if there are leftover
                // annotations.
                if let (Some(num_annotations), Ok(num_words)) = (
                    meta.audio_recording
                        .as_ref()
                        .and_then(|audio| audio.annotations.as_ref())
                        .map(|annotations| annotations.len()),
                    words.as_ref().map(|w| w.len()),
                ) {
                    anyhow::ensure!(
                        num_annotations == num_words,
                        "If we have word-for-word annotations, we must have one for each word."
                    )
                }

GracefulLemming · 2023-10-31T18:28:55Z

If we dont want to download each audio file, two things come to mind:

Track audio length on a custom metadata field in S3
implement a flag representing whether the length of an audio file has already been verified since last update, allowing us to skip downloading some audio

Both of these come with their own overhead in ways im not certain are direct improvements on this process though

GracefulLemming · 2023-10-31T18:38:28Z

migration/src/audio.rs

+        if let Some(slices) = &annotations {
+            let last_slice = slices
+                .last()
+                .ok_or_else(|| anyhow::format_err!("Expected there to be at least one slice"))?;


Ive noticed that sometimes our logging appears in unpredictable orders due to use of async, so I think adding some info identifying the problematic document, annotations, and/or audio file as well as the particular word index and/or annotation time/index would be helpful in resolving errors thrown here

GracefulLemming

Overall this makes sense, but I think we could add some more details to our error messages to make resolving errors easier

add bounds checking to word-for-word audio migrations

c476cf8

CharlieMcVicker requested a review from GracefulLemming October 27, 2023 15:47

GracefulLemming reviewed Oct 31, 2023

View reviewed changes

GracefulLemming requested changes Oct 31, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add duration bounds checking to word-for-word audio migration #351

Add duration bounds checking to word-for-word audio migration #351

CharlieMcVicker commented Oct 27, 2023

GracefulLemming commented Oct 31, 2023

GracefulLemming Oct 31, 2023

GracefulLemming left a comment

Add duration bounds checking to word-for-word audio migration #351

Are you sure you want to change the base?

Add duration bounds checking to word-for-word audio migration #351

Conversation

CharlieMcVicker commented Oct 27, 2023

Goal

Checks

Counting words / annotations and ensuring they match

GracefulLemming commented Oct 31, 2023

GracefulLemming Oct 31, 2023

Choose a reason for hiding this comment

GracefulLemming left a comment

Choose a reason for hiding this comment