Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accuracy = 0 and the result only contain one character after decoded #1

Open
rolai opened this issue Apr 24, 2017 · 25 comments
Open

accuracy = 0 and the result only contain one character after decoded #1

rolai opened this issue Apr 24, 2017 · 25 comments
Labels

Comments

@rolai
Copy link

rolai commented Apr 24, 2017

非常感谢楼主的分享,我运行测试了一下,测试结果却是:训练accuracy一直是0,而且decoded出来只有一个字符? 请问该哪里出问题了,怎么调整?
seq 0: origin: [52, 23, 35, 62, 24, 33] decoded:[26]
seq 1: origin: [54, 49, 2, 40, 26, 38] decoded:[26]
seq 2: origin: [62, 48, 10, 42, 12] decoded:[26]
seq 3: origin: [53, 54, 7, 36, 45] decoded:[26]
seq 4: origin: [35, 43, 45, 7] decoded:[26]
seq 5: origin: [44, 56, 50, 2] decoded:[26]
seq 6: origin: [53, 35, 57, 7] decoded:[26]
seq 7: origin: [58, 31, 37, 8, 43] decoded:[26]
seq 8: origin: [10, 30, 53, 38, 20, 12] decoded:[26]
seq 9: origin: [45, 45, 39, 27, 61] decoded:[26]
4-23 22:24:42 Epoch 2/10000, accuracy = 0.000,train_cost = 22.159, lastbatch_err = 0.987, time = 676.276

@ilovin
Copy link
Owner

ilovin commented Apr 24, 2017

前4,5个epoch是这样的,大概到第7个epoch就有明显变化,已经可以感觉到,10几个之后就已经有一些对的了,如果不收敛,可以调调lr(比如调到1e-3,可以很快看到结果)

@rolai
Copy link
Author

rolai commented Apr 25, 2017

@ilovin 确实如你说的,10几个epoch后准确率开始慢慢提高了。
还有个问题请教一下,每个epoch运行时间特别长,请问是lstm模型训练慢还是我的配置有问题导致没有使用GPU训练? ps: 我在同一台机器上训练CNN模型,速度很快。

@ilovin
Copy link
Owner

ilovin commented Apr 25, 2017

时间和step有关,可以减小step,另外ctc是很耗时的,可以打开run_metadata选项看看瓶颈在哪里
我打开nvdia-smi发现是有用到GPU的,如果你发现能有加速的方法,请再留言吧

@ilovin ilovin closed this as completed Apr 27, 2017
@catmonkeylee
Copy link

我用的是tensorflow 1.1 的版本,跑了173步了都不收敛,也调整过 learning_rate , 是使用的tf版本有问题吗?

10:26:47 Epoch 173/10000, accuracy = 0.000,train_cost = 21.926, lastbatch_err = 0.983, time = 37.416

@ilovin
Copy link
Owner

ilovin commented May 3, 2017

我看你一个epoch只用了37.416s,你用了多少数据做训练?

@catmonkeylee
Copy link

我把训练集生成的比较小,和这个有关系?我按原来的配置重新生成一下试试,多谢指点!

@catmonkeylee
Copy link

@ilovin 谢谢,确实是训练集太小了,已经可以收敛了

@lmolhw5252
Copy link

现在cost已经很小了,但是准确率还是上不去,楼主知道这是为什么么?或者有遇到的情况么?
acc现在为0.591 cost为0.045

@rolai
Copy link
Author

rolai commented May 23, 2017

这里的acc是字符串准确率,字符准确率会高不少。
另外建议试试bi-lstm和dropout,对准确率都有提升。还可以先用使用cnn提取特征然后再输入lstm,效果会好很多。

@junedgar
Copy link

junedgar commented Jun 12, 2017

训练样本是128000,测试样本是10000,采用默认参数,300个Epoch后还是不收敛,而且acc一直是0.224。Epoch 300/10000, accuracy = 0.224,avg_train_cost = 5.262, lastbatch_err = 0.295, time = 583.210,lr=0.00000000。按照 @ilovin 的方法调整了学习率到1e-3后,Epoch 45/10000, accuracy = 0.036,avg_train_cost = 5.974, lastbatch_err = 0.320, time = 268.441,lr=0.00000002,好像没有在10个epoch后有明显变化。调整lr到1e-3,然后调整decay_rete到0.9,几十个epoch后还是一样没啥明显变化

@junedgar
Copy link

已经解决不收敛问题,因为warpctc无法绑定到tensorflow1.1,所以采用standord ctc,发现里面的学习率衰减的太快,修改一下,就可以啦

@ilovin ilovin reopened this Jun 20, 2017
@ilovin ilovin changed the title 请教一下:训练accuracy一直是0,而且decoded出来只有一个字符 accuracy = 0 and the result only contain one character after decoded Jun 20, 2017
@Embraceee
Copy link

Embraceee commented Jun 29, 2017

请教一下为什么程序放到服务器上跑并不会加速,而跑其他程序正常!还有learning_rate和decay_rate都已经做过调整,然而acc仍然是0
seq 0: origin: [30, 15, 29, 30, 46, 16, 40, 24, 7, 34] decoded:[30, 28, 11, 19, 19, 24, 46, 16, 16, 40, 24, 7, 34]
seq 1: origin: [30, 15, 29, 30, 53, 40, 61, 56, 32] decoded:[30, 28, 11, 19, 19, 24, 53, 40, 61, 56, 32]
seq 2: origin: [30, 15, 29, 30, 6, 2, 22, 1] decoded:[30, 28, 11, 19, 24, 24, 6, 2, 22, 1]
seq 3: origin: [30, 15, 29, 30, 54, 38, 16, 49, 5] decoded:[30, 28, 11, 19, 19, 24, 54, 38, 16, 49, 5]
seq 4: origin: [30, 15, 29, 30, 53, 54, 43, 24] decoded:[30, 28, 11, 19, 19, 24, 53, 54, 43, 43, 24]
seq 5: origin: [30, 15, 29, 30, 21, 35, 10, 7, 62, 27] decoded:[30, 28, 11, 19, 19, 24, 21, 35, 10, 7, 62, 27]
seq 6: origin: [30, 15, 29, 30, 27, 18, 15, 3, 53, 7] decoded:[30, 28, 11, 19, 19, 24, 27, 18, 15, 3, 53, 7]
seq 7: origin: [30, 15, 29, 30, 5, 1, 1, 7, 47, 1] decoded:[30, 28, 11, 19, 19, 24, 5, 25, 1, 7, 47, 1]
seq 8: origin: [30, 15, 29, 30, 21, 38, 11, 43, 53, 43] decoded:[30, 28, 11, 19, 19, 24, 21, 38, 11, 43, 53, 43]
seq 9: origin: [30, 15, 29, 30, 53, 40, 31, 53, 42, 34] decoded:[30, 28, 11, 19, 19, 24, 53, 40, 31, 53, 42, 42, 34]
6/29 18:25:51 Epoch 35/10000, accuracy = 0.000,avg_train_cost = 1.234, lastbatch_err = 0.701, time = 339.922,lr=0.00008863
batch 1099 : time 0.337705850601
batch 1199 : time 0.357650995255
batch 1299 : time 0.329480886459
batch 1399 : time 0.365411996841
batch 1499 : time 0.341113090515
batch 1599 : time 0.330847024918
batch 1699 : time 0.372397899628
batch 1799 : time 0.332052946091
batch 1899 : time 0.350213050842
batch 1999 : time 0.340576887131
seq 0: origin: [30, 15, 29, 30, 46, 16, 40, 24, 7, 34] decoded:[30, 28, 11, 19, 19, 24, 46, 16, 16, 40, 24, 7, 34, 34]
seq 1: origin: [30, 15, 29, 30, 53, 40, 61, 56, 32] decoded:[30, 28, 11, 19, 19, 24, 53, 40, 61, 56, 32]
seq 2: origin: [30, 15, 29, 30, 6, 2, 22, 1] decoded:[30, 28, 11, 19, 19, 24, 6, 2, 22, 22, 25]
seq 3: origin: [30, 15, 29, 30, 54, 38, 16, 49, 5] decoded:[30, 28, 11, 19, 19, 24, 54, 38, 16, 49, 5]
seq 4: origin: [30, 15, 29, 30, 53, 54, 43, 24] decoded:[30, 28, 11, 19, 19, 24, 53, 54, 43, 24]
seq 5: origin: [30, 15, 29, 30, 21, 35, 10, 7, 62, 27] decoded:[30, 28, 11, 19, 19, 24, 21, 35, 10, 7, 62, 27]
seq 6: origin: [30, 15, 29, 30, 27, 18, 15, 3, 53, 7] decoded:[30, 28, 11, 19, 19, 24, 27, 18, 15, 3, 53, 7]
seq 7: origin: [30, 15, 29, 30, 5, 1, 1, 7, 47, 1] decoded:[30, 28, 11, 19, 19, 24, 5, 1, 1, 7, 7, 47, 47, 1]
seq 8: origin: [30, 15, 29, 30, 21, 38, 11, 43, 53, 43] decoded:[30, 28, 11, 19, 19, 24, 21, 38, 11, 43, 53, 43]
seq 9: origin: [30, 15, 29, 30, 53, 40, 31, 53, 42, 34] decoded:[30, 28, 11, 19, 19, 24, 53, 40, 31, 53, 53, 42, 34]
6/29 18:31:32 Epoch 35/10000, accuracy = 0.000,avg_train_cost = 1.205, lastbatch_err = 0.685, time = 681.270,lr=0.00008863

@ilovin
Copy link
Owner

ilovin commented Jun 30, 2017

---update---
我发现与优化器的选择有关,我改成Adam结果无法收敛,又改回RMS,就正常了,几个Epoch有正确的了
我最近重新生成了一些图片,发现acc也是0,网络无法收敛了...
我拿去识别身份证号码还是好的,不知道什么问题。

@hsmyy
Copy link

hsmyy commented Jul 7, 2017

尝试了下,把图片维度缩小到45 * 120(原图60 * 160),initial_lr=1e-3,decay_rate=0.9,decay_steps=1000,最终收敛情况:

7/7 8:41:54 Epoch 89/10000, accuracy = 0.864,avg_train_cost = 0.354, lastbatch_err = 0.034, time = 144.714,lr=0.00000000

模型可以修改的点很多,不过现在这个样子已经很不错了!感谢

@ilovin
Copy link
Owner

ilovin commented Jul 19, 2017

@hsmyy 欢迎拍砖,可以把修改的版本提交push上来

@indra215
Copy link

@hsmyy From which epoch did you start seeing an accuracy greater than 0 ? I ran the same code with same parameters as yours but still at 0 accuracy for 34 epochs.

@gnnbest
Copy link

gnnbest commented Sep 11, 2017

@rolai 麻烦问一下你的代码现在可以启用GPU了吗?我也遇到了同样的问题,程序只用了cpu,没有使用gpu,可以正常收敛,但是比较慢

@rolai
Copy link
Author

rolai commented Sep 11, 2017

我现在训练启用GPU了,可以使用warpctc减少每个batch的时间。
另外,你检查一下tensorflow是否安装的是GPU版的

@gnnbest
Copy link

gnnbest commented Sep 11, 2017

@rolai 我安装的是gpu版本的tensorflow(写了简单的测试程序是可以的启用GPU的),我用的master分支上的warpctc,但是默认没有启用GPU?还需要修改代码或是额外配置什么地方吗?启用GPU后会快不少吗?

@dotsonliu
Copy link

@rolai 如果用中文训练,可以吗?

@hsmyy
Copy link

hsmyy commented Oct 11, 2017

@indra215 it takes some time, maybe 1 hour later, after the model seeking a good valley, it will drop down and acc will improve drmatically

@PsyDog5hao
Copy link

@junedgar 可以请问一下你具体设定了多少么,在多少epoch开始收敛么?我也碰到了不收敛的情况。

@onfdtz
Copy link

onfdtz commented Nov 20, 2017

@rolai 您好,我遇到和您一样的问题,就是:训练accuracy一直是0,而且decoded出来只有一个相同的字符。请问您是怎么解决这个问题的呢?
batch 1599 : time 0.5304007530212402
batch 1699 : time 0.5304009914398193
batch 1799 : time 0.49920082092285156
batch 1899 : time 0.49920082092285156
batch 1999 : time 0.49920105934143066
seq 0: origin: [59, 37, 2, 27] decoded:[42]
seq 1: origin: [6, 53, 43, 54, 26] decoded:[42]
seq 2: origin: [5, 12, 37, 27, 3] decoded:[42]
seq 3: origin: [26, 51, 10, 42, 17, 21] decoded:[42]
seq 4: origin: [41, 60, 49, 59, 24] decoded:[42]
seq 5: origin: [53, 26, 1, 7] decoded:[42]
seq 6: origin: [34, 16, 20, 6] decoded:[42]
seq 7: origin: [33, 53, 2, 33, 37, 20] decoded:[42]
seq 8: origin: [12, 21, 54, 46, 62, 17] decoded:[42]
seq 9: origin: [44, 4, 16, 40] decoded:[42]
11/20 16:35:26 Epoch 3/10000, accuracy = 0.000,avg_train_cost = 21.998, lastbatc
h_err = 0.983, time = 1060.739,lr=0.00053144

@rolai
Copy link
Author

rolai commented Dec 6, 2017

@dotsonliu 可以训练中文,中文的字典会大一点,模型和训练方法一样。

@onfdtz 多训练几个epoch,模型一开始收敛慢,耐心等待十几个epoch后出效果。另外,建议用baidu warpctc 能极大提高训练速度。

@boris-lb
Copy link

boris-lb commented Sep 1, 2018

@hsmyy 我也调整了你说的图片大小,很快就收敛到99的精度,但是测试报错FailedPreconditionError (see above for traceback): sequence_length(0) <= 29,麻烦问一下你遇到这总情况没,怎么解决的?感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests