Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add duration bounds checking to word-for-word audio migration #351

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

CharlieMcVicker
Copy link
Collaborator

Goal

The purpose of this PR is to catch mismatched annotations and audio files at migration time.

Checks

  • Annotations do not reference audio outside of the bounds of the audio file's duration
    • Sadly this involved downloading the audio to tmp files and doing the duration check on disk, which is not ideal for the long run. I'd like to find a rust crate that will do the metadata read over the network, but I couldn't find one quickly and moved on. I consider this a quick fix and diagnostic while we are still getting to the bottom of things.
  • Some kind of check which can handle when a smaller annotation file is used. Eg. more than 10 seconds between the last annotation end_time and the duration of the file. For false positives, we would probably want to trim the end of the audio file anyway. Removing silence at the end of a track wouldn't affect alignment.
  • There are the same number of annotations and words in a document

Counting words / annotations and ensuring they match

I couldn't find a clean place to add a sanity check on num_words == num_annotations but here is the code for it. I originally wrote this inside AnnotatedLine::many_from_semantic source but since that goes line by line, this would typically fail, since there are fewer words in a line than in the whole document.

                // Our code will error if there are fewer annotations than
                // words, but continue silently if there are leftover
                // annotations.
                if let (Some(num_annotations), Ok(num_words)) = (
                    meta.audio_recording
                        .as_ref()
                        .and_then(|audio| audio.annotations.as_ref())
                        .map(|annotations| annotations.len()),
                    words.as_ref().map(|w| w.len()),
                ) {
                    anyhow::ensure!(
                        num_annotations == num_words,
                        "If we have word-for-word annotations, we must have one for each word."
                    )
                }

@GracefulLemming
Copy link
Contributor

If we dont want to download each audio file, two things come to mind:

  1. Track audio length on a custom metadata field in S3
  2. implement a flag representing whether the length of an audio file has already been verified since last update, allowing us to skip downloading some audio

Both of these come with their own overhead in ways im not certain are direct improvements on this process though

if let Some(slices) = &annotations {
let last_slice = slices
.last()
.ok_or_else(|| anyhow::format_err!("Expected there to be at least one slice"))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive noticed that sometimes our logging appears in unpredictable orders due to use of async, so I think adding some info identifying the problematic document, annotations, and/or audio file as well as the particular word index and/or annotation time/index would be helpful in resolving errors thrown here

Copy link
Contributor

@GracefulLemming GracefulLemming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense, but I think we could add some more details to our error messages to make resolving errors easier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants