Tracking the unexpected prediction with current model #80

wengxt · 2024-10-30T16:31:03Z

No description provided.

wengxt · 2024-10-30T17:07:41Z

hui xian shi cheng -> 辉县市成 vs 会显示成

wengxt · 2024-11-16T20:17:20Z

hui xian shi cheng -> 辉县市成 vs 会显示成

This issue seems to be relevant to "unknown penalty" only apply to "unknown". If a word is "almost" unknown, the penalty won't be applied.

We may need a interpolation on the probability in the model.

wengxt · 2025-01-12T15:03:51Z

wanxiao -> 玩笑

not sure where does this 完小 coming from

oldherl · 2025-01-12T15:10:45Z

not sure where does this 完小 coming from

完小是指具备初级小学（小学一年级至四年级）和高级小学（小学五年级、六年级）的学校，称之为“完全小学”，简称“完小”。

wengxt · 2025-01-12T15:18:14Z

grep from model, the entries are

-4.9858837      完小    -0.2816716
-1.5378636      完小 的 0
-1.5111926      完小 。 0
-2.174242       完小 一 0
-2.0291164      完小 和 0
-1.2420088      完小 、 0
-1.9291966      完小 教师       0
-1.8168479      完小 校长       0
-1.974151       完小 任教       0
-3.451519       村 完小 -0.20186298
-3.0526175      中心 完小       -0.17031695
-2.510187       小学 完小       -1.287718
-2.8678312      村级 完小       0
-3.3122647      石鼓 完小       -1.257167
-1.6803577      完小 完小       -1.3554685
-0.51793367     完小 ， -1.0182986

$ grep 玩笑 lm_sc.arpa

-5.177468       玩笑    -0.142906
-5.6088424      玩笑话  -0.03664786
-5.2477465      开玩笑  -0.22451825
-5.523891       半开玩笑        -0.4944338
-5.5992584      开开玩笑        -0.0492024
-1.3832151      玩笑 的 -0.26375443
-1.3299513      开玩笑 的       -0.10372763
-0.7034921      半开玩笑 的     0
-2.4452958      玩笑 都 0
-1.524879       玩笑 。 -0.08134223
-1.3579892      玩笑话 。       0
-1.2041603      开玩笑 。       0
-0.942389       开开玩笑 。     0
-2.138802       玩笑 不 0
-2.2080994      玩笑 是 0
-2.5105116      玩笑 来 0
-2.134074       玩笑 也 0
-1.8868014      玩笑话 也       0
-2.036349       玩笑 了 -0.20378523
-1.9976928      开玩笑 了       0
-2.2731628      玩笑 一 0
-2.7238111      玩笑 对 0
-2.5074801      玩笑 吧 -0.31541517
-2.2248611      开玩笑 吧       0
-1.7971009      玩笑 说 -0.049946435
-0.77673805     开玩笑 说       -0.023381311
-2.697913       玩笑 你 0
-3.2035964      玩笑 半 -0.47779813
-0.9213799      半开玩笑 半     0
-2.3195803      玩笑 就 0
-1.9432257      玩笑话 就       0
-2.687656       玩笑 让 0
-2.5954108      玩笑 要 0
-2.1476874      玩笑 和 0
-2.5222363      玩笑 而 0
-2.3647504      玩笑 我 0
-1.9675562      玩笑 、 0
-2.2028408      玩笑 ！ -0.057637207
-1.8541294      开玩笑 ！       0
-2.2178242      玩笑 ？ 0
-2.334132       玩笑 时 0
-2.5619147      玩笑 吗 -0.7606487
-2.3966796      玩笑 而已       -0.20118168
-2.5500877      玩笑 呢 -0.22165687
-2.0654967      开玩笑 呢       0
-2.552788       玩笑 一样       0
-2.7278311      玩笑 还是       0
-2.1468208      玩笑 中 0
-2.694506       玩笑 却 0
-2.2814465      玩笑 之 0
-2.6881113      玩笑 称 0
-2.5064929      玩笑 啊 0
-2.064796       玩笑 着 0
-2.751057       玩笑 的话       0
-2.9366188      玩笑 问 0
-2.017762       玩笑 地 -0.4805562
-1.3870697      开玩笑 地       -0.29777834
-0.43836725     半开玩笑 地     -0.24370837
-2.6313734      玩笑话 讲       0
-3.0486548      玩笑 么 0
-2.9129837      玩笑 表示       0
-2.7134833      玩笑 啦 0
-2.8452988      玩笑 嘛 0
-1.9801606      玩笑 开 -0.1966895
-2.0811002      玩笑 般 -0.36068675
-2.108505       玩笑 道 0
-1.5480953      开玩笑 道       0
-1.8543925      开 玩笑 -0.46564165
-2.7238154      致命 玩笑       0
-1.4088688      开起 玩笑       0
-1.4121019      开开 玩笑       0
-2.0118399      愚人节 玩笑     0
-2.1944165      玩笑 似 -0.63856745
-2.315034       玩笑 归 -1.4576433
-2.7473521      玩笑 罢了       0
-2.3608334      玩笑 似的       0
-2.2871647      玩笑 开大       -1.514774
-3.6721497      句 玩笑话       -0.07122915
-3.089258       半 开玩笑       0
-3.6920066      经常 开玩笑     0
-1.664779       玩笑 ， -0.04606934
-1.5648757      玩笑话 ，       0
-1.4514949      开玩笑 ，       0
-0.21023889     玩笑 般 的
-0.6910439      开 玩笑 的
-0.10056991     玩笑 似 的
-1.5395846      开 玩笑 了
-0.013026016    玩笑 开大 了
-1.1283715      玩笑 地 对
-1.541788       玩笑 的 话
-0.81712407     玩笑 的 说
-0.6720341      开玩笑 的 说
-0.26964918     玩笑 地 说
-0.30194095     开玩笑 地 说
-0.3639918      半开玩笑 地 说
-0.8475342      开 玩笑 说
-0.43401927     玩笑 开 得
-1.418497       玩笑 说 我
-1.5184251      玩笑 的 时候
-1.4479315      玩笑 的 人
-1.5226777      玩笑 说 自己
-1.647881       玩笑 的 方式
-1.2175026      玩笑 地 问
-1.2054269      开 玩笑 地
-0.17564452     玩笑 半 认真
-1.4440264      天大 的 玩笑
-0.89116734     无伤大雅 的 玩笑
-0.87529993     开起 了 玩笑
-0.8270485      开 个 玩笑
-1.1960349      开 我 玩笑
-0.4602168      开 什么 玩笑
-1.703771       开 着 玩笑
-1.2940509      都 开 玩笑
-1.3973694      没 开 玩笑
-0.66365695     是 开 玩笑
-0.9437602      来 开 玩笑
-1.3044887      也 开 玩笑
-1.5991783      能 开 玩笑
-1.1034858      他们 开 玩笑
-1.0246112      你 开 玩笑
-0.40214372     半 开 玩笑
-0.9920661      他 开 玩笑
-1.3738894      就 开 玩笑
-1.0552404      会 开 玩笑
-0.7000808      还 开 玩笑
-1.6508374      要 开 玩笑
-1.4722492      没有 开 玩笑
-0.8004517      我 开 玩笑
-0.53760225     在 开 玩笑
-0.5780612      曾经 开 玩笑
-0.9134966      用 开 玩笑
-0.77597314     人 开 玩笑
-0.5363283      朋友 开 玩笑
-1.5329177      自己 开 玩笑
-0.6396945      大家 开 玩笑
-0.19049528     互相 开 玩笑
-0.8623454      以 开 玩笑
-1.0550332      我们 开 玩笑
-0.7123358      别人 开 玩笑
-0.46898404     喜欢 开 玩笑
-0.52174747     只是 开 玩笑
-0.5909766      老师 开 玩笑
-0.24223        同事 开 玩笑
-0.6160627      别 开 玩笑
-1.3740922      地 开 玩笑
-0.7436857      经常 开 玩笑
-1.5146505      常 开 玩笑
-0.0923033      网友 开 玩笑
-0.86692774     乱 开 玩笑
-0.1309643      拿来 开 玩笑
-0.15655036     爱 开 玩笑
-0.42306006     同学 开 玩笑
-0.89297485     她 开 玩笑
-0.49949118     曾 开 玩笑
-0.049291093    生命 开 玩笑
-0.015410228    玩笑 归 玩笑

it's actually interesting that 玩笑 has so many tri-gram while uni-gram is lower than 完小..

From the data, I guess the reason could be that 开玩笑 takes away the frequency from 玩笑

wengxt · 2025-01-12T15:24:48Z

It seems that we should remove 开玩笑 (probably some other *玩笑* too), manually update the segmentation and update the model.

wengxt · 2025-01-13T21:36:08Z

After further investigation, the libime usage on the model probably have some unaligned expectation.

For IME, it makes more sense to assume the start of input is not a sentence, but some unknown token. However, the model seems assumes you always start with ~~and end with~~ to get a more meaningful result. It doesn't produce in any bi-gram and tri-gram. Thus, the unigram score produced by kenlm doesn't reflect the score of the word frequency, it causes some surprising result when freq(玩笑) ~ 500k and freq(完小) ~ 100k, but unigram score for 完小 is even larger.

Only when order=1, lmplz will produce unigram score solely based on word frequency.

Will try to tune the model to use a different mechanism for unigram score thus we can get a better result for libime.

wengxt added a commit that referenced this issue Jan 13, 2025

Update language model with unigram score based on frequency (#80)

c5b5a83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking the unexpected prediction with current model #80

Tracking the unexpected prediction with current model #80

wengxt commented Oct 30, 2024

wengxt commented Oct 30, 2024

wengxt commented Nov 16, 2024

wengxt commented Jan 12, 2025

oldherl commented Jan 12, 2025

wengxt commented Jan 12, 2025

wengxt commented Jan 12, 2025 •

edited

Loading

wengxt commented Jan 13, 2025

Tracking the unexpected prediction with current model #80

Tracking the unexpected prediction with current model #80

Comments

wengxt commented Oct 30, 2024

wengxt commented Oct 30, 2024

wengxt commented Nov 16, 2024

wengxt commented Jan 12, 2025

oldherl commented Jan 12, 2025

wengxt commented Jan 12, 2025

wengxt commented Jan 12, 2025 • edited Loading

wengxt commented Jan 13, 2025

wengxt commented Jan 12, 2025 •

edited

Loading