Hi, thanks for releasing this code. I'm enjoying playing with this.
Could you please explain what is 'context' in your Attention module, if possible? link.
I guess it has a similar role to the previous hidden state for reuse in Transformer-XL paper link (which is explained in Section 3.2), but the way you implemented is little different.