Incorrect BLEU score calculation #3

MorinoseiMorizo · 2017-01-07T17:52:28Z

Hi,

I've just found that the calculation of BLEU score during the training is incorrect.
Current implementation of evaluateBLEU function in train.cc,
just compare hypothesis and target word IDs, which are including unk,
and not deal with these tokens.

For example, if I set the source and target vocabulary size to 4,
I can get really high BLEU score because almost all of the words are unk.

MorinoseiMorizo · 2017-01-07T21:43:58Z

I'm now trying to deal with this problem.
I think I can send a pull request soon.

odashi · 2017-01-09T11:34:47Z

Yes, current implementation calculates BLEU with UNK tokens (if you are using WordVocabulary).
This is because of some reasons...we need some additional modules (e.g. tokenizer requested from specific evaluation campaigns) to calculate BLEU or other metrics, and this sometimes requires some hacky implementations. We cannot know what is "the word to evaluate" while training process regardless additional tokenizers, so the BLEU calculator only relies word IDs generated from the Vocabulary class (e.g. if we are using CharacterVocabulary, the BLEU score would be calculated by characters).
Of course this problem should be mentioned in some documents, however I guess we cannot say current implementation is "incorrect."

MorinoseiMorizo · 2017-01-09T16:44:02Z

Thank you for replying.

I see, then what about making a switch to evaluate BLEU score with (1) word IDs or (2) original sentences.

Method (1) is the same as now.
Method (2) uses original tokenized sentences (don't contain unk) and hypothesis sentences that may contain unk.
In the method (2), hypothesis sentences should reproduce original tokenization regards less of what Vocabulary class they use, so we need some function to do this.
In the current implementation, convertToSentence can do, so I used this function in #4 fix.

If you agree with making switch like this, I can add this function.

Best regards,

odashi · 2017-01-10T19:48:50Z

After reading your PR, I think it might be more better only replacing UNK ID in ref or hyp to other values which is never used for other word ID (e.g. 0xffffffff). This requires only few changes in train.cc and makes same effect as comparing UNK and other words.

odashi · 2017-01-10T20:07:08Z

Adding a switch for UNK treatment is good I think.

MorinoseiMorizo · 2017-01-11T00:28:27Z

At first, I thought the same thing as you.
If we use word level vocabulary, it's enough.

But in the case of using BPE (it's still not implemented though) and evaluate the score with original tokenization,
we need to store the original sentences with original tokenization for testing.

In the current implementation of Corpus class, we cannot store these.
And original sentences are not needed for training, so we should make different Corpus class and Sampler class for testing.

My PR seems to be unnecessary for just replacing UNK tokens,
but it is needed for tokenization methods like BPE.

odashi · 2017-01-11T07:44:21Z

Using original texts sometimes cannot calculate correct BLEU when, for example, we are using CharacterVocabulary, since this class does not care of any word separators and it basically could accept the raw (untokenized) sentences. Therefore, we basically cannot always calculate desired BLEU by only changing training configurations.

BTW, if we decided to add original texts, I thought it is better the members for this purpose could be integrated directly in the Sample structure, because:

Adding some new classes which have mostly same behavior as that of current implementation may confuse us and cause increasing management costs of whole code.
Some further translation models sometimes require original surface of the word to calculate their features while also training.
This may increase the amount of memory for samplers, however we can easily add an option to samplers to control this behavior.

MorinoseiMorizo · 2017-01-12T06:50:51Z

I understand your opinion.
Now I think it would be better.

I'm going to modify some codes for this.

odashi · 2017-01-22T10:55:13Z

I am now re-making some data structures to introduce raw tokens for the evaluation in the new-structure branch.

MorinoseiMorizo · 2017-01-22T12:17:44Z

Thank you for the notice.

I just make a BPE vocabulary and it seems to be working fine.
But it relies on my TestSampler and TestCorpus so I will send a PR after you made a new structure of evaluation.

MorinoseiMorizo mentioned this issue Jan 8, 2017

fix BLEU score calculation during training #4

Open

odashi added the enhancement label Jan 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect BLEU score calculation #3

Incorrect BLEU score calculation #3

MorinoseiMorizo commented Jan 7, 2017 •

edited

Loading

MorinoseiMorizo commented Jan 7, 2017

odashi commented Jan 9, 2017

MorinoseiMorizo commented Jan 9, 2017

odashi commented Jan 10, 2017

odashi commented Jan 10, 2017

MorinoseiMorizo commented Jan 11, 2017 •

edited

Loading

odashi commented Jan 11, 2017

MorinoseiMorizo commented Jan 12, 2017

odashi commented Jan 22, 2017

MorinoseiMorizo commented Jan 22, 2017

Incorrect BLEU score calculation #3

Incorrect BLEU score calculation #3

Comments

MorinoseiMorizo commented Jan 7, 2017 • edited Loading

MorinoseiMorizo commented Jan 7, 2017

odashi commented Jan 9, 2017

MorinoseiMorizo commented Jan 9, 2017

odashi commented Jan 10, 2017

odashi commented Jan 10, 2017

MorinoseiMorizo commented Jan 11, 2017 • edited Loading

odashi commented Jan 11, 2017

MorinoseiMorizo commented Jan 12, 2017

odashi commented Jan 22, 2017

MorinoseiMorizo commented Jan 22, 2017

MorinoseiMorizo commented Jan 7, 2017 •

edited

Loading

MorinoseiMorizo commented Jan 11, 2017 •

edited

Loading