Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect BLEU score calculation #3

Open
MorinoseiMorizo opened this issue Jan 7, 2017 · 10 comments
Open

Incorrect BLEU score calculation #3

MorinoseiMorizo opened this issue Jan 7, 2017 · 10 comments

Comments

@MorinoseiMorizo
Copy link
Contributor

MorinoseiMorizo commented Jan 7, 2017

Hi,

I've just found that the calculation of BLEU score during the training is incorrect.
Current implementation of evaluateBLEU function in train.cc,
just compare hypothesis and target word IDs, which are including unk,
and not deal with these tokens.

For example, if I set the source and target vocabulary size to 4,
I can get really high BLEU score because almost all of the words are unk.

@MorinoseiMorizo
Copy link
Contributor Author

I'm now trying to deal with this problem.
I think I can send a pull request soon.

@odashi
Copy link
Owner

odashi commented Jan 9, 2017

Yes, current implementation calculates BLEU with UNK tokens (if you are using WordVocabulary).
This is because of some reasons...we need some additional modules (e.g. tokenizer requested from specific evaluation campaigns) to calculate BLEU or other metrics, and this sometimes requires some hacky implementations. We cannot know what is "the word to evaluate" while training process regardless additional tokenizers, so the BLEU calculator only relies word IDs generated from the Vocabulary class (e.g. if we are using CharacterVocabulary, the BLEU score would be calculated by characters).
Of course this problem should be mentioned in some documents, however I guess we cannot say current implementation is "incorrect."

@MorinoseiMorizo
Copy link
Contributor Author

Thank you for replying.

I see, then what about making a switch to evaluate BLEU score with (1) word IDs or (2) original sentences.

Method (1) is the same as now.
Method (2) uses original tokenized sentences (don't contain unk) and hypothesis sentences that may contain unk.
In the method (2), hypothesis sentences should reproduce original tokenization regards less of what Vocabulary class they use, so we need some function to do this.
In the current implementation, convertToSentence can do, so I used this function in #4 fix.

If you agree with making switch like this, I can add this function.

Best regards,

@odashi
Copy link
Owner

odashi commented Jan 10, 2017

After reading your PR, I think it might be more better only replacing UNK ID in ref or hyp to other values which is never used for other word ID (e.g. 0xffffffff). This requires only few changes in train.cc and makes same effect as comparing UNK and other words.

@odashi
Copy link
Owner

odashi commented Jan 10, 2017

Adding a switch for UNK treatment is good I think.

@MorinoseiMorizo
Copy link
Contributor Author

MorinoseiMorizo commented Jan 11, 2017

At first, I thought the same thing as you.
If we use word level vocabulary, it's enough.

But in the case of using BPE (it's still not implemented though) and evaluate the score with original tokenization,
we need to store the original sentences with original tokenization for testing.

In the current implementation of Corpus class, we cannot store these.
And original sentences are not needed for training, so we should make different Corpus class and Sampler class for testing.

My PR seems to be unnecessary for just replacing UNK tokens,
but it is needed for tokenization methods like BPE.

@odashi
Copy link
Owner

odashi commented Jan 11, 2017

Using original texts sometimes cannot calculate correct BLEU when, for example, we are using CharacterVocabulary, since this class does not care of any word separators and it basically could accept the raw (untokenized) sentences. Therefore, we basically cannot always calculate desired BLEU by only changing training configurations.

BTW, if we decided to add original texts, I thought it is better the members for this purpose could be integrated directly in the Sample structure, because:

  • Adding some new classes which have mostly same behavior as that of current implementation may confuse us and cause increasing management costs of whole code.
  • Some further translation models sometimes require original surface of the word to calculate their features while also training.
  • This may increase the amount of memory for samplers, however we can easily add an option to samplers to control this behavior.

@MorinoseiMorizo
Copy link
Contributor Author

I understand your opinion.
Now I think it would be better.

I'm going to modify some codes for this.

@odashi
Copy link
Owner

odashi commented Jan 22, 2017

I am now re-making some data structures to introduce raw tokens for the evaluation in the new-structure branch.

@MorinoseiMorizo
Copy link
Contributor Author

Thank you for the notice.

I just make a BPE vocabulary and it seems to be working fine.
But it relies on my TestSampler and TestCorpus so I will send a PR after you made a new structure of evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants