Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about self-BLEU implementation #27

Open
weilinie opened this issue Nov 16, 2018 · 2 comments
Open

Question about self-BLEU implementation #27

weilinie opened this issue Nov 16, 2018 · 2 comments

Comments

@weilinie
Copy link

As far as I know, the basic idea of self-BLEU scores is to calculate the BLEU scores by choosing each sentence in the set of generated sentences as hypothesis and the others as reference, and then take an average of BLEU scores over all the generated sentences.

However, when looking into the implementation of self-BLEU scores: https://github.com/geek-ai/Texygen/blob/master/utils/metrics/SelfBleu.py, I found an issue inside for evaluating self-BLEU over training: Only in the first time of evaluation that the reference and hypothesis come from the same “test data” (i.e. the set of generated sentences). After that, the hypothesis keeps updated but the reference remains unchanged (due to “is_first=False”), which means hypothesis and reference are not from the same “test data” any more, and thus the scores obtained under this implementation are not self-BLEU scores.

To this end, I modified the implementation to make sure that the hypothesis and reference are always from the same “test data” (by simply removing the variables "self.reference" and "self.is_first") and found that the self-BLEU (2-5) scores are always 1 when evaluating all the models.

Please let me know if my concern makes sense or just misunderstand the definition of self-BLEU scores?

@MichaelZhouwang
Copy link

I also found this problem, when the test data and reference data is the same, self-bleu is always 1. However many papers in this domain use it as a diversity metric, which is quite misleading. I think it only measures to which extent the generated samples with GAN training is different from MLE training samples, and not necessarily the lower the better. What do you think about the Forward-Backward Bleu metric used in the ''Toward Diverse Text Generation with Inverse Reinforcement Learning'' paper?

@weilinie
Copy link
Author

Fully agree! Thank you for your follow up. In terms of the Forward-Backward Bleu metric you mentioned, I didn't try it. But since it is based on (self-)BLEU scores, I think there also exist the issues we observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants