RNN Attention and Transformer

Some notes on Attention and Transformer

“Transformer” credit: Unsplash

Why Attention? RNNs vs RNN + Attention

Before we talk about Transformers, let’s list some limitations of RNNs

  • The hidden states failed to model long range dependencies
  • Suffers from gradient vanishing and explosion
  • Requires large numbers of training steps
  • Its recurrence prevents parallel computation
An example of RNN, Attention & Transformer is to increase the performance of RNN

Comparing with RNNs, Transformer networks have the attributes of

  • It can facilitate long range dependencies
  • There’s no gradient vanishing and explosion: (since #layers decreases)
  • Attention could help to go back and look at particular input to decide the outputs
  • Every step of generating the outputs is a training sample
Transfromer Network (from Attention Is All You Need paper )

Example of RNN + Attention

In this Seq2Seq with Attention example1, the attention mechanism enables the decoder to focus on the word “étudiant” (“student” in french) before it generates the English translation. This ability to amplify the signal from the relevant part of the input sequence makes attention models produce better results than models without attention.

Attention in translation
With Attention, all embeddings are feeded to the the decoders

This scoring exercise is done at each time step on the decoder side.
The Attention works in the order of

  1. The attention decoder RNN takes in the embedding of the <END>  token, and an initial decoder hidden state.
  2. The RNN processes its inputs, producing an output and a new hidden state vector (h4). The output is discarded.
  3. Attention Step: We use the encoder hidden states and the h4 vector to calculate a context vector (C4) for this time step.
  4. We concatenate h4 and C4 into one vector.
  5. We pass this vector through a feedforward neural network (one trained jointly with the model).
  6. The output of the feedforward neural networks indicates the output word of this time step.
  7. Repeat for the next time steps
    Whole process of Attention scoring

Multi-Head Attention Module

Multi-Head Attention

Three types of input vectors to the module:

  • Value
  • Key
  • Query

The Attention is calculated as $$ \text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$

The illustration in this part is referred to the post2 please check out the original post for more details.

Multi-headed self-attention boils down to one figure
Avatar
Chengkun (Charlie) Li
Incoming PhD Student
comments powered by Disqus
Previous

Related