-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect BLEU score calculation #3
Comments
I'm now trying to deal with this problem. |
Yes, current implementation calculates BLEU with UNK tokens (if you are using WordVocabulary). |
Thank you for replying. I see, then what about making a switch to evaluate BLEU score with (1) word IDs or (2) original sentences. Method (1) is the same as now. If you agree with making switch like this, I can add this function. Best regards, |
After reading your PR, I think it might be more better only replacing UNK ID in ref or hyp to other values which is never used for other word ID (e.g. 0xffffffff). This requires only few changes in train.cc and makes same effect as comparing UNK and other words. |
Adding a switch for UNK treatment is good I think. |
At first, I thought the same thing as you. But in the case of using BPE (it's still not implemented though) and evaluate the score with original tokenization, In the current implementation of Corpus class, we cannot store these. My PR seems to be unnecessary for just replacing UNK tokens, |
Using original texts sometimes cannot calculate correct BLEU when, for example, we are using CharacterVocabulary, since this class does not care of any word separators and it basically could accept the raw (untokenized) sentences. Therefore, we basically cannot always calculate desired BLEU by only changing training configurations. BTW, if we decided to add original texts, I thought it is better the members for this purpose could be integrated directly in the Sample structure, because:
|
I understand your opinion. I'm going to modify some codes for this. |
I am now re-making some data structures to introduce raw tokens for the evaluation in the |
Thank you for the notice. I just make a BPE vocabulary and it seems to be working fine. |
Hi,
I've just found that the calculation of BLEU score during the training is incorrect.
Current implementation of evaluateBLEU function in train.cc,
just compare hypothesis and target word IDs, which are including unk,
and not deal with these tokens.
For example, if I set the source and target vocabulary size to 4,
I can get really high BLEU score because almost all of the words are unk.
The text was updated successfully, but these errors were encountered: