Attention – Wennroy

从LSTM到Attention

Sentence Representation: 对于LSTM来说，LSTM将一句话压缩成一个特征向量。

It’s not ideal to compress the meaning of a sentence with variable length into a single vector.

一个过长的句子只被一个Vector表示是不够的。因此，很直接的来说，我们会想要用多个向量来表示我们的句子。能够实现他的方式就是Attention。

编码层面：将句子中的每个单词都作为一个向量。

解码层面：利用attention weights（加权）计算各个向量的线性组合。最后利用输出来决定下一个词。

在seq2seq模型中，我们也经常将 target hidden vector (query) to all source vectors (keys) 称作 target-to-source cross attention.

Attention模型

Attention score由点积+Softmax得到。

$a_l$我们认为是context vector，是对所有原始输入的加权（加权和由注意力（$a_{t,l}$）得分来得到）和。

Different Attention Score functions

Multi-layer Perceptron (Bahdanau et al. 2015)
$$a(\boldsymbol q, \boldsymbol k ) = \boldsymbol w_2^T\tanh(W_1[\boldsymbol{q};\boldsymbol{k}])$$
$\tanh$ is a non-linear function. 总体上会更加灵活（Flexible），在大数据集上表现会好一些。
Bilinear (Luong et al. 2015)
$$a(\boldsymbol q, \boldsymbol k) = \boldsymbol q^T W \boldsymbol k$$
利用$W$矩阵将$\boldsymbol q$向量投影到$\boldsymbol k$所处的空间上，再进行点积操作。
Dot Product (Luong et al. 2015)
$$a(\boldsymbol q, \boldsymbol k) = \boldsymbol q^T \boldsymbol k$$
要求两个向量的sizes必须一致。当维度增加的时候，输出的value将会增加。如果value不够稳定，那么训练可能不会很稳定。因此Scaled Dot Product是个解决办法。
Scaled Dot Product
$$a(\boldsymbol q, \boldsymbol k) = \frac{\boldsymbol q^T\boldsymbol k}{\sqrt{|\boldsymbol k|}}$$

Attention is all you need

Transformer

一些特点：

相比于LSTM从左到右的计算方式，Transformer可以并行计算，效率更高。
seq2seq模型，但完全基于注意力机制(Attention)。
只有矩阵计算，意味着训练较快。

一些重要的组成部分：

Self-attention – allows parallel computing of all tokens
Multi-headed attention — allows querying multiple positions at each layer
Position encoding – adds position information to each token
Adding nonlinearities — combines features from a self-attention layer
Masked decoding – prevents attention lookups in the future tokens

Self-Attention

Querys与keys都是自己的attention叫作Self-Attention。

一般来说，我们将输入$x_t$先通过feedforward layer或者非线性的函数映射到$h_t$，然后复制三份，映射到三个向量$k_t,q_t,v_t$，分别被称作keys, querys, values.

Multi-headed Attention

Repeat Attention many times.

重复attention可以学到一些其他关系，例如一个词前后的词，最终得到$d$维度的context vector。

$$\boldsymbol a_l = [a_{l,I},\cdots,a_{l,1}]\in\mathbb{R}^d,\quad a_{l,i}\in\mathbb{R}^{\frac{d}{I}}$$

Where $I$ is the umber of heads. 8个左右的heads在大型模型中表现较好。

但是这种多头自注意力机制下，自注意力仍然是前一个层的线性变化，这会导致很难学习complex data。因此一般会选择在multi-head self-attention之后外加一层feedforward layer，增加一个非线性函数，最终得到我们这一层的输出。

With Feedforward (Non-linear function appliled) Layer

最终构成了完整的Transformer块。

Positional Encoding

由于注意力机制完全没有在模型里引入位置信息，我们额外引入位置编码。

在不是Transformer的一些位置信息encoding中，我们会选用Naive Positional Encoding：

$$\bar{x}_t = [x_t, t]^T$$

这种方式并不有效，因为我们更需要的是相对位置信息而不是绝对位置信息。例如一个名词在不同的句子中的位置可能完全不一样。因此我们考虑frequency-based representations.

一个很巧妙的设定是类似于上图的设定。一开始的前几个位置信息$\sin(t/10000^{2*1/d})$频率较高，这会导致倾向于区分相邻的两个词，例如区分是第奇数个词或者第偶数个词（对应图上的”even-odd” indicator），而较高维度的位置信息$\sin(t/10000^{2*\frac{d}{2}/d})$有较低的频率，震荡周期长，正值或者负值就能区分他到底是位于句子的前半段还是后半段。从第一张图来看，能够区分句子前半段后半段的维数在中间的位置。

最后，我们将concat原始输入向量得到新的输入向量，

$$\tilde{x}_t = [x_t,p_t]^T$$

Downside: 通常情况下我们要确认最长的序列，我们并不能生产超过最长序列的位置信息。

Masked attention for target sentence

不同于BERT (Bidirectional)。

我们只允许自注意力关注到之前的tokens，而不考虑未来的tokens，一个简单的办法就是让

$$e_{l,t} =\left\{ \begin{array}{ll}q_l\cdot k_t &\text{if }l\geq t\\-\infty &\text{Otherwise}\end{array}\right.$$

实际操作中，一般直接在softmax中，将$\exp(e_{l,t})$替换为0 if $l<t$. Multiply the attention matrix by 0-1 masking matrix.

Attention Tricks

Self Attention: Each layer combines words with others
Multi-headed Attention: 8 attention heads learned independently
Normalized Dot-product Attention: Remove bias in dot product when using large networks
Positional Encodings: Make sure that even if we don’t have RNN, can still distinguish positions

Training Tricks

Layer Normalization: Help ensure that layers remain in reaonable range
Specialized Training Schedule: Adjust default learning rate of the Adam optimizer
Label Smoothing: Insert some uncertainty in the training process. Add some values on ground true values. Better for generalization.
Masking for Efficient Training

Code Walk: https://nlp.seas.harvard.edu/2018/04/03/attention.html

Some Drawbacks

Slow to decode
don’t necessarily outperform RNNs
hard to train on small data

Some models better than Attention

Hard Attention. (Xu et al. 2015) (Lei et al. 2016)
Instead of a soft interpolation, Make a zero-one decision about where to attend. Requires methods such as reinforcement.
Monotonic Attention.
Bidirectional Training.

References

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” Advances in neural information processing systems 30 (2017).