seq2seq networks

hamish

seq2seq

  • still building up to transformers
  • last time talked about transfer learning
  • this time about seq2seq models
  • going to just talk about sequences in general today
  • next time apply to an NLP problem

Sequence problems

"Here, have potato"

???

???

"fold" over this building up an understanding of patterns

"unfold" until some condition is met

Natural approach

Quick tour of RNNs

x_{0}
x_{3}
x_{2}
x_{1}
y_{0}
y_{3}
y_{2}
y_{1}
h_{0}
h_{4}
h_{3}
h_{2}
h_{1}

elements of sequence

neural network things

happen

memory

same weights

Quick tour of RNNs

y_{t} = \phi (W_{y}^{T} h_{t - 1} + b)
h_{t} = \phi ( W_{x}^T x_{t} + W_{h}^T h_{t - 1} + b )
x_{t}
h_{t - 1}

can think of this as "memory"

sequence element

LSTM/GRU

Long Short Term Memory (LSTM)

  • has a longer term memory
  • does a remember/forget computation
  • really slow

Gated Recurrent Unit

  • also has a memory
  • handles remembering/forgetting differntly
  • much faster, but still slow
x_{N - 2}
h_{N - 3}
x_{N}
x_{N - 1}

inputs

outputs

y_{0}
y_{3}
y_{2}
y_{1}

seq2seq

Encoder

Decoder

h_{N}

This thing is really useful

d_{0}
d_{1}
d_{2}
d_{3}

Show me the code

class Encoder(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Encoder, self).__init__()
        self.gru = nn.GRU(input_dim, hidden_dim)
                
    def forward(self, x):
        outputs, hidden = self.gru(x)
        return outputs, hidden


class Decoder(nn.Module):
    def __init__(self, output_dim, hidden_dim):
        super(Decoder, self).__init__()
        self.gru = nn.GRU(output_dim, hidden_dim)
        self.out = nn.Linear(hidden_dim, output_dim)
      
    def forward(self, x, hidden):
        output, hidden = self.gru(x, hidden)       
        prediction = self.out(output[0])
      
        return prediction, hidden


class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder        

    def forward(self, source, target_length, device):

        input_length = source.size(0)
        batch_size = target.shape[1]
      
        # create something to hold the predicted outputs
        outputs = torch.zeros(target_length, batch_size).to(self.device)

        encoder_output, encoder_hidden = self.encoder(source)

        #use the encoder’s hidden layer as the decoder hidden
        decoder_hidden = encoder_hidden.to(device)
  
        # we need an input here, in NLP we will typically use a special token
        decoder_input = torch.tensor([0], device=device)
    
        for t in range(target_length):
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden)
            outputs[t] = decoder_output
            
        return outputs

sorry

model = Sequential()
model.add(GaussianNoise(0.01, input_shape=(n_steps_in, n_features)))
model.add(GRU(20, activation='relu'))
model.add(RepeatVector(n_steps_out))
model.add(GRU(20, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse')

model.summary()

Or in Keras

Real example I'm playing with

ugly matplotlib charts ftw

Problems with seq2seq

  • they're really slow
  • you can't parallelise the training
  • that means they're really slow
  • which means you can't build up HUGE general models (like we can in CV) and do transfer learning
Made with Slides.com