Hi!
EDIT: I’m simplifying my question considerably from its previous version.
I want to implement a decoder-only transformer that will be trained in an unsupervised manner only. It will be used to generate, as one does in inference mode, a set of sentences. These sentences however should be generated probabilistically, and not in a greedy way (that is, the next word can be picked according to the probability distribution of the next token and not the maximum probability only). From Umar Jamil’s YouTube video on coding transformers, I believe I can create a function that will generate such sentences.
Now, my cost function is actually computed from computing statistics of these random sentences. I do not understand how torch tensors with require_grad=True work. How do I make sure that once I compute correlations from tensors representing the random sentences generated, that I can back propagate.
Here’s the code that generates the random sentences from the model.
def generate_random_sentences(model, batch_size, vocab_size, max_len, device):
# Initialize the decoder input with random first word
decoder_input = torch.randint(low=0,high=vocab_size,size=(batch_size,1)).to(device) #(batch_size, 1)
while True:
if decoder_input.size(1) == max_len:
break
# build mask for target
decoder_mask = causal_mask(decoder_input.size(1)).unsqueeze(0).to(device) # (1, curr_seq_len, curr_seq_len) -> (1, 1, curr_seq_len, curr_seq_len)
# calculate output
out = model.decode(decoder_input, decoder_mask) # (batch_size, seq_len = decoder_input.size(1), vocab_size)
# get next token
prob = model.project(out[:, -1]) # (batch_size, vocab_size)
prob_cumul = prob.cumsum(dim = 1) # (batch_size, vocab_size)
next_word = torch.searchsorted(prob_cumul,torch.rand(batch_size,1)) #(batch_size, 1)
decoder_input = torch.cat(
[decoder_input, next_word.to(device)], dim=1
)
return decoder_input.squeeze(0) #(1, batch_size, max_len)