RNN Attention and Transformer

Some notes on Attention and Transformer

Chengkun (Charlie) Li

Dec 18, 2020 paper review

“Transformer” credit: Unsplash

Why Attention? RNNs vs RNN + Attention

Before we talk about Transformers, let’s list some limitations of RNNs

The hidden states failed to model long range dependencies
Suffers from gradient vanishing and explosion
Requires large numbers of training steps
Its recurrence prevents parallel computation

An example of RNN, Attention & Transformer is to increase the performance of RNN

Comparing with RNNs, Transformer networks have the attributes of

It can facilitate long range dependencies
There’s no gradient vanishing and explosion: (since #layers decreases)
Attention could help to go back and look at particular input to decide the outputs
Every step of generating the outputs is a training sample

Transfromer Network (from *Attention Is All You Need* paper )

Example of RNN + Attention

In this Seq2Seq with Attention example¹, the attention mechanism enables the decoder to focus on the word “étudiant” (“student” in french) before it generates the English translation. This ability to amplify the signal from the relevant part of the input sequence makes attention models produce better results than models without attention.

With Attention, all embeddings are feeded to the the decoders

This scoring exercise is done at each time step on the decoder side.

The Attention works in the order of

The attention decoder RNN takes in the embedding of the <END> token, and an initial decoder hidden state.
The RNN processes its inputs, producing an output and a new hidden state vector (h4). The output is discarded.
Attention Step: We use the encoder hidden states and the h4 vector to calculate a context vector (C4) for this time step.
We concatenate h4 and C4 into one vector.
We pass this vector through a feedforward neural network (one trained jointly with the model).
The output of the feedforward neural networks indicates the output word of this time step.
Repeat for the next time steps

Whole process of Attention scoring