Decoder only stack from torch.nn.Transformers for self attending autoregressive generation

Hi everybody,

I want to build a Transformer which only consists of Decoder Blocks. I would normally code this completely from scratch but first I need a proof of concept if the model is feasible.
I stumbled upon the nn.Transformer class.
But as it seems the Model has to have both Encoder and Decoder.

I only need the attention and the ability to predict tokens, as the input is a Batch size * lenSeq*Embedding tensor. It is not a sequence to sequence task, but is it attending to itself when providing the same tensors as src and tgt.

My question is: When spawning a transformer with :

T = torch.nn.Transformer(num_encoder_layers=0)

seq = torch.rand(1, 512, 512)

T(seq, seq)

tensor([[[ 0.2780,  0.8620, -0.2263,  ...,  1.0728, -0.4022,  0.9872],
         [-0.2340,  0.1427, -0.3868,  ...,  1.0756,  1.6993,  0.6485],
         [-0.6118, -0.5644, -0.9284,  ..., -3.0683, -0.0306, -0.1220],
         ...,
         [-1.1515, -0.1376, -0.6824,  ..., -0.9080,  0.9986,  0.0290],
         [ 0.2444, -0.7368, -1.5312,  ..., -0.9956, -1.4545,  1.9276],
         [-1.0432, -1.4984, -0.0433,  ...,  1.0828,  0.9358,  0.3096]]])


I get an output. But does that output make sense? Is it attending to itself? Or should the tgt sequence be my actual embedded target sequence and not the input.

Sorry if all of these questions are dumb guys.

For my use case I want the model to autoregressively build sequences. Is that possible with that implementation

Cheers everybody!

Hi,

You might want to look into BERT and GPT-3, these are Transformer based architectures.
Bert uses only the Encoder part, whereas GPT-3 uses only the Decoder part.

Both of them have the ability to predict tokens, but in different ways.

For simplicity I will show you some examples with a library that has these models pretrained called HuggingFace. (Even this pytorch documentation for transformers refers to this library. To install simply run pip install transformers)

One of the use cases for BERT is to predict what the [MASK] token should be, taking the whole sentence into consideration. link

from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-cased')
unmasker("Paris is the [MASK] of France.")
# Ouput:
[{'score': 0.9861817359924316,
  'sequence': 'Paris is the capital of France.',
  'token': 2364,
  'token_str': 'capital'},
 {'score': 0.0037214006297290325,
  'sequence': 'Paris is the center of France.',
  'token': 2057,
  'token_str': 'center'},
 {'score': 0.003259770106524229,
  'sequence': 'Paris is the Capital of France.',
  'token': 6299,
  'token_str': 'Capital'},
 {'score': 0.0021571300458163023,
  'sequence': 'Paris is the centre of France.',
  'token': 2642,
  'token_str': 'centre'},
 {'score': 0.0009026133338920772,
  'sequence': 'Paris is the city of France.',
  'token': 1331,
  'token_str': 'city'}]

For the GPT we can look at the use case to predict the next words, given a starting point. Here I am actually using GPT-2, but it is the same concept. link

from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm learning about machine learning,", max_length=30, num_return_sequences=5)
# Output:
[{'generated_text': "Hello, I'm learning about machine learning, I'm learning from it. In fact, this summer I think the field of machine learning may be really"},
 {'generated_text': "Hello, I'm learning about machine learning, which is, again, why I'm posting a bunch of tutorials about it. This isn't a bad"},
 {'generated_text': 'Hello, I\'m learning about machine learning, but I don\'t get much of a connection anymore with the new machine learning tools I\'m getting into."'},
 {'generated_text': "Hello, I'm learning about machine learning, a more general term here.\n\nThis article will show you how to make a machine learning framework for"},
 {'generated_text': "Hello, I'm learning about machine learning, but I like my work because I like to teach. I need to give myself a good foundation so that"}]

This being said, you can look into fine-tuning these models, or other variations of them (there are a lot).

But if you want to train them from scratch, they are going to need A LOT of data and processing resources. (e.g. BERT used the entire English wikipedia to train)

Regarding this:

You can plot a heatmap of your output to see what is actually happening.

import math
import torch
import seaborn as sns
import matplotlib.pyplot as plt

words = 20

T = torch.nn.Transformer(num_encoder_layers=0)
seq = torch.rand(words, 512)
out = T(seq, seq)

with torch.no_grad():
    f,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(11,5), gridspec_kw={'width_ratios':[.5,.5,0.05]})
    map1 = seq @ seq.T
    sns.heatmap(map1.numpy(),cmap="YlGnBu",cbar=False,ax=ax1)
    map2 = seq @ out.T
    sns.heatmap(map2.numpy(),cmap="YlGnBu",ax=ax2,cbar_ax=ax3)

image

This would be an example for a sentence with 20 “words”. On the left you see the confusion matrix for the input values against the input values. Since they are the same, we get a strong diagonal, whereas all other values seem to have little to no relation. On the right we have the input values against the output values. Here the diagonal is also strong, however all other have undergone transformations that would place word with similar semantic context closer to each other due to the self attention mechanism (as well as the positional embedding).

So the answer would be, yes, they are attending to itself. And you can play around with masking the source or the target.

Hope this helps :wink:

Let me know if something is not clear or you need more help.

(Bonus: BertViz is a very cool library that can help you visualize the attention heads of BERT and how the words affect each other.)

1 Like

Hi Matias thank you for your rigorous reply and help!

The heat maps were really impressive also thanks for pointing out all those details regarding the attending!

I have a lot of GPU’s and data that is not the problem :D!

Sorry it was late yesterday so I think I forgot to mention a key detail. I already have preprocessed tensors they are coming from a BERT like model with axial attention and I only want to feed those in and on their basis autoregressively predict a sequence after prepending them with a conditional token(like CTRL Transformer from huggingface).
N= Batch_size
Amount of sentences = S
Seq_len = L
Embedding_dim = E

So multidimensional text → axial attention BERT(NSLE) → Mean(NLE) → Decoder Stack(NVocab_length*1) → sampling → output sequence

I looked into huggingface and their implementation o GPT-2 did not seem straight forward to modify for only taking tensors instead of strings… but if possible that would be the best option as it is integrated into deep speed and they have the clm_train script readily available. I also have a suitable BPEtokenizer pretrained in huggingface for the output vocab. So it would be good but I did not see a way to do it after reading though their docs and some of the GPT-2 code they provide on github.

Hope this clarifies!

I am not going to claim I know what I am doing here :sweat_smile:, but I think you can guide yourself with the github repository to see how you can implement the GPT2 class directly.

Here is a small example to use GPT2 directly and not through their pipeline. This way you can feed tensors directly to the model. Here I am using the GPT that outputs the raw hidden-states without a specific head on top. If you parse through the source code, you will see more with language modeling for single or multiple choice or sequence classification for example.

from transformers.models.gpt2 import GPT2Model, GPT2Config

config = GPT2Config()
model = GPT2Model(config)
print(model.config)

sentence = torch.rand(2, 20, 768)

output = model(inputs_embeds=sentence)

Hope this helps a little bit more!

1 Like

Hey Matias,
That comment was a godsend! Thank you I completely overlooked the optional arguments!

Thank you for your help! I think I can manage the rest :slight_smile:

1 Like