-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking the unexpected prediction with current model #80
Comments
hui xian shi cheng -> 辉县市成 vs 会显示成 |
This issue seems to be relevant to "unknown penalty" only apply to "unknown". If a word is "almost" unknown, the penalty won't be applied. We may need a interpolation on the probability in the model. |
wanxiao -> 玩笑 not sure where does this 完小 coming from |
完小是指具备初级小学(小学一年级至四年级)和高级小学(小学五年级、六年级)的学校,称之为“完全小学”,简称“完小”。 |
grep from model, the entries are
$ grep 玩笑 lm_sc.arpa
it's actually interesting that 玩笑 has so many tri-gram while uni-gram is lower than 完小.. From the data, I guess the reason could be that 开玩笑 takes away the frequency from 玩笑 |
It seems that we should remove 开玩笑 (probably some other |
After further investigation, the libime usage on the model probably have some unaligned expectation. For IME, it makes more sense to assume the start of input is not a sentence, but some unknown token. However, the model seems assumes you always start with Only when order=1, lmplz will produce unigram score solely based on word frequency. Will try to tune the model to use a different mechanism for unigram score thus we can get a better result for libime. |
No description provided.
The text was updated successfully, but these errors were encountered: