TODOs #13

ghost · 2018-04-26T05:28:58Z

This is an umbrella issue where we can collectively tackled some problems and improve general open source reading comprehension quality.

Goal
The network is already there. We just need to add more features on top of the current model.

Implement full features stated in the original paper
Achieve EM/F1 performance stated in the original paper with a single model settings

Model

Increase the hidden units to 128. Report the results #15 reported performance increase when the hidden units increased from 96 to 128
Increase the number of heads to 8
Add dropouts in better locations to maximize regularization
Train "unknown" word embedding

Data

Implement paraphrasing by back-translation to increase the data size

Contribution to any of these issues is welcome and please comment on this issue and let us know if you want to work on these problems.

ghost · 2018-04-27T23:41:03Z

As of f0c79cc, I have changed the location of dropouts to "after" layer norm from "before" layer norm. It doesn't make sense to drop input channels to layer norm as they normalize across channel dimensions, this will cause distribution mismatch during inference time and training time. We shall see how this improves the model.

alphamupsiomega · 2018-04-29T05:46:02Z

To overcome your GPU memory constraints, what about just decreasing batch size?

On a 1080 Ti (11GB), I'm able to run 128 hidden units, 8 attention heads, 300 glove_dim, 300 char_dim with a batch size of 12. At least 16 and above, CUDA is out of memory. Accuracy seems comparable so far.

ghost · 2018-04-29T05:57:20Z

You have a valid point, and I would like to know how your experiment goes. I would also suggest trying group norm instead of layer norm as they report better performance with lower batch sizes.

alphamupsiomega · 2018-04-29T06:54:12Z

Good suggestion, Min. Since the paper compares against batch norm, have you found that layer norm generally outperforms batch norm lately? One could try batch norm also for comparison. Interestingly the 'break-even' point is about batch size 12 between batch norm and group norm for those paper's conditions. Layer norm is supposedly more robust to small mini batches compared to batch norm.

Also the conditions from the above comment run fine on a 1070 gpu.

Do you have a sense if model parallelization across multiple gpus is worth it for this type of model?

localminimum · 2018-05-05T07:07:43Z

Hi @mikalyoung , I haven't tried parallelisation across multiple GPUs so I wouldn't know what the best way to go about it is. I heard that data parallelism is easier to get working than model parallelisation. It seems that from #15 using bigger hidden size and bigger number of heads in attention improves the performance, so I would try fitting the bigger model with smaller batches into multiple GPUs.

JACKHAHA363 · 2019-02-12T21:20:51Z

Right now what is the status reproducing the paper's result?

ghost changed the title ~~Regularization~~ TODOs Apr 26, 2018

ghost mentioned this issue Apr 26, 2018

Train with M40 card but got OOM message #8

Closed

ghost added the help wanted label Apr 26, 2018

ghost mentioned this issue Apr 26, 2018

about embedding matrix structure #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODOs #13

TODOs #13

ghost commented Apr 26, 2018 •

edited by localminimum

Loading

ghost commented Apr 27, 2018 •

edited by ghost

Loading

alphamupsiomega commented Apr 29, 2018 •

edited

Loading

ghost commented Apr 29, 2018 •

edited by ghost

Loading

alphamupsiomega commented Apr 29, 2018

localminimum commented May 5, 2018

JACKHAHA363 commented Feb 12, 2019

TODOs #13

TODOs #13

Comments

ghost commented Apr 26, 2018 • edited by localminimum Loading

ghost commented Apr 27, 2018 • edited by ghost Loading

alphamupsiomega commented Apr 29, 2018 • edited Loading

ghost commented Apr 29, 2018 • edited by ghost Loading

alphamupsiomega commented Apr 29, 2018

localminimum commented May 5, 2018

JACKHAHA363 commented Feb 12, 2019

ghost commented Apr 26, 2018 •

edited by localminimum

Loading

ghost commented Apr 27, 2018 •

edited by ghost

Loading

alphamupsiomega commented Apr 29, 2018 •

edited

Loading

ghost commented Apr 29, 2018 •

edited by ghost

Loading