Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can somebody explain me what the _preprocess() function is doing? #33

Open
samarth12 opened this issue Oct 20, 2020 · 2 comments
Open

Comments

@samarth12
Copy link

I ran the standard example and walk through it in the debugger.

from jiwer import wer

ground_truth = "hello world"
hypothesis = "hello duck"

error = wer(ground_truth, hypothesis)

In the steps, it uses the _preprocess() function to convert the words/tokens into integer representations.

hello world is converted to \x00\x02 and hello duck is converted to \x00\x01. Then the Levenshtein distance is calculated on these strings rather than the original words. I am not sure how that maps to the original definition of Word Error Rate.

Can somebody explain to me what is happening?

@nikvaessen
Copy link
Collaborator

nikvaessen commented Oct 21, 2020

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
(https://en.wikipedia.org/wiki/Levenshtein_distance)

Due to the fact that we convert each word in a sentence to an unique integer such as \x00, the ground truth and hypothesis sentences become two "words" of integer tokens. We can then use the Levenshtein distance to calculate the insertions (I), deletions (D) or substitutions (S) required to change the hypothesis "integer word" into the ground truth "integer word". We can than simply plug the S, D, and I values we found in the WER formula.

@samarth12
Copy link
Author

Oh, I realized this a little late, the Levenshtein distance in WER is calculated on words rather than characters. So that is done to convert each word into a unique ID, makes sense now. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants