-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtodo.txt
167 lines (159 loc) · 10.2 KB
/
todo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
1. finish the gan: the generator and discriminator updating [check]
2. change all lstm to gru, I have known the ins and outs of lstm, especially the lstmstatetuple then this expires. [check]
3. stop when decoding in gan training [check]
4. the loss of the gan training [check]
5. seperate rollout parameters from those of generator add a gan mode, delete any mode in the graph but keep those which are outside of it [check]
6. review the code from the beginning to the end carefully
7. make the coverage work
8. the machenism to decide when to stop decoding while rolling out
9. add modes outside the graph and delete them inside it [check]
10. add the eval to the training to early stop the training according to two rules by chengyu
11. the dictionaries are different, batches of words should be transfered into characters and the dictionary is hence different(should be reencoded) [check]
1. keep in mind that due to the oovs ids should be transfered to words and then back to ids
12. beam search in pointer decoder is outputting only one sample, then either the beam search should be changed or new batches should be created before calculating the rewards [check]
rewriting the beam search needs lots of work, I should first take the workaround to just create a new batch
13. make ending and stop decoding and others all 0 if possible
for this repo in the decoder:
the input: start_decode_id + sequence
the target: sequence + end_decode_id
while decoding in the decode process, the words after the stop sign is cutted
15. batch_size and oovs may cause issues while decoding [check]
16. compatibility between decode and gan in beam_search, or just support batch sized beam search and delete the decode mode in batcher [check]
14. shuffle the generated sample before putting into the discriminator [check]
17. positive and negative samples, the strategy to train GAN
18. graph compatibility among different modes in chenyu's code
19. process traditional chinese and nonprintable characters in the material and then the corpus and the vocabulary should be updated
20. make a build vocabulary module to cntk
21. what should the input be in each attention decoding process? the same as the previous word in the input or the previous output; while training it is the former and decoding it is the latter, but how about rollout?
22. choose a conference from the following:
https://mp.weixin.qq.com/s?__biz=MzIxMjAzNDY5Mg==&mid=2650791318&idx=1&sn=d26c946de53ebad482d1e0fbe5dc3b02&chksm=8f47487db830c16b438967f001247553084448f7ee00d1bf8cbfac5a8a0468a1c890d6c95e0a#rd
1. https://2017.icml.cc/
2. https://www.cicling.org/
23. evaluation and checkpoint
add the evaluation to the generator and terminate it when necessary
24. rollout in the graph with attention and copymechanism? what should the rollout like?
25. add enc_padding_mask to the attention [check]
26. beam search is too slow, true batch based beam search should be implemented
27. corpus remove punc
28. TODO: keep the [unk] and such words in the vocab transform func [check]
29. two inputs but with different scalability
30. decode the not pointer generator option in the graph
31. separate the discriminator graph and the generator graph but load them together in order to train then separately [check]
32. train my own vocabulary for the discriminator
already get a pretrained one, the char embedding is directly input to the discriminator from the generator
33. TODO: rollout weights decay
34. the beam search is too slow thus a multithread function should be added
35. it seems that using two dictionaries is less efficient in using the reward
the embedding from the encoder is put to the discriminator directly
36. what should the validation of the generator be in processing the gan, cross entropy or rouge etc.?
37. initialize the variables in the graph but cannot be loaded from the checkpoint
38. update the validation dataset with the improvement of the quality of the generated sample
39. while processing the data, I should tokenize the article into sentences first and then segment them into words.
40. vocabulary should be rebuilt because there were traditional chinese in corpus when building it [check]
41. add the reinforcement rouge reward
42. collect some keywords from the long content to form a dynamic vocabulary, but how to make use of those keywords remains a problem
43. the rollout can be move out from the graph to make it less error-prone
44. vocab acumulative probability to choose the vocab_size [check]
https://stackoverflow.com/questions/15408371/cumulative-distribution-plots-python
45. add the learning rate update [check]
46. add the checkpoint modification: a particular variable loader for the trained graph is created [check]
47. the content can be used to train a value iteration for the reward, if can be found in the content the reward is higher
48. substitue the MCMC with the stocastic MCMC: https://arxiv.org/pdf/1706.05439.pdf
49. add the dropout mechenism or maxout better to the network, Selective Encoding for Abstractive Sentence Summarization
50. the unknown can be not transformed to token and be passed to the next step
51. create perturbation samples by deleting some sentences
52. hierarchical attention copy mechanism
53. read-gain mechanism to encode each char in the content and find the span for the copy mechanism
54. the loss of the discriminator should be scaled by the ratio of the positive and negative samples
since the amount of the negative samples is twice as that of the positive ones, the loss of the negative should be halved
55. multiple attentions each for different part of the content
56. submodularity with DNN, especially attention
57. add the sentence number to each word while embedding: http://arxiv.org/abs/1611.02344
58. mask out the scores of ids from the vocab dis which can be copied from the encoder: http://arxiv.org/abs/1603.06393
59. equip the model to decide which sentence would be more important, position embedding: https://s3.amazonaws.com/fairseq/papers/convolutional-sequence-to-sequence-learning.pdf
60. dialation cnn for encoder or decoder
61. position embedding [may not be effective]
62. multi-layers attention
63. dropout: embedding
64. rnn encoder -> cnn decoder [check]
cnn encoder -> rnn decoder
cnn encoder -> cnn decoder
65. attention is not of the same length
66. cnn in decoder
67. decorator practice: https://wiki.python.org/moin/PythonDecoratorLibrary
68. generate at once like the way conv encoder decodes, or block by block
69. unsupervised pre-training using language model
70. Train discriminator using previous checkpoints, not only the latest one
71. TODO: for the cnn dataset make bow separated by any punc as encoder unit
72. reward: rollout max at each step / normal decode as baseline
# problems
1. cannot end
mask is wrong, mask should cover the stop token
2. generate repetative sentences
3. the discriminator assign high scores to very bad unseen samples
4. trash samples
1) only update one side
2) multiple D and G
3) matchmaking ranking
4) add noise
5. when should the GAN stop
6. test results by training 770000 * 16 samples with the evaluation loss being converged to 4.462
1). the ROUGE-1 recall is 0.3671 and the precision is 0.4203, then the F-1 is 0.3919
2). the copy mechanism: 40.83% of those should be copied are copied or generated,
7. discriminator overrate those bad samples(repetitive and etc that the generator would generate in the early stage) which was unseen whilie training
8. the more the generator is trained by RL the less words it will generate
it should be the reward mask rather than the sample mask to mask the loss. If the sample mask is used the more it sample the less the rewards
9. the dictionary 100000 is too big, in this paper <Incorporating Copying Mechanism in Sequence-to-Sequence Learning> it is 10000. this will decrease the variance by decreasing the search space.
when the vocab size is 10k the loss only decreases from 4.5 and 4.1(160000 samples) to 3.82(val) and 3.30(train) after training 4640000 samples, that never decreases in training the later 11920000 - 4640000 samples
when the vocab size is 50k the start losses are 4.22(training) and 5.19(val)(160000 samples)
10. the length problem
normal distribution
the rouge is acumulated but how to add the previous influence?
minuts the previous word's rouge value is a good idea
11. the oov problem:
ensure that the traditional chinese character have been transfered into simplified ones
12. the word segmentation tool doesn't work well
change to charater based
train word segmentation model using kcws
13. the beam search implementation I made is weaker than the vallina one
14. I can borrow some ideas of conditoning GAN from image captioning and image completion
15. bidirectionary attention: https://github.com/allenai/bilm-tf
16. communicating agents: https://arxiv.org/pdf/1803.10357.pdf
# error
1. a list of float cannot be joined
2. two list cannot be updated by += but if the right side is a numpy it is ok
3. tensorarray should asigned to itself after writing
4. in run_decode_onestep the ran_id is not returned
# tips
immediate rewards should be given more importance, discount or impatience
because the summition is finite then we just set the discount to 1
if discout is set to 0.8 it works better practically
https://pdfs.semanticscholar.org/4814/21dd6b320e9aa2325420cc27cdee22ee36cd.pdf
vocabulary:
95.5% about 5000, zipf's law
length:
min, max = mean - 2 * STD, mean + 2 * STD
cover 96% lengthes
reward is for what to do but not how to do it
unit testing
the copy mechanism
the maximum chars can be copied(in the reference and in the content/len(reference)):
0.62213374002(mean)
0.225145828146(std)
performance(in the content and in the hypothesis/hypothesis):
0.969032674681
0.0894641280516
rate of correction(in the content, in the reference and in the hypothesis/in the content, in the reference):
0.446353624443
0.260164242876
rate of missing(in the content, in the reference but not in the hypothesis/in the content, in the reference):
0.548669849735
0.261140982246
rate of mistakenly copied(in the content, not in the reference, but in the hypothesis/hypothesis):
0.581372847028
0.262387083345
# baseline
The average rouge_1 is 0.344
The average rouge_2 is 0.223
The average rouge_l is 0.313
# experiments