Hello. I have the following task: given a string such as This is a nice string
I have to tokenize it and compute the log probs of each token. I assume this means that for e.g. nice
I need to compute log p(nice|this is a)
.
This is the code I have written to achieve this:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
text = "This is a nice string."
tokens = tokenizer.encode(text, return_tensors="pt")
tokenized_sequence = tokenizer.convert_ids_to_tokens(tokens[0])
print(f"Tokenized sequence: {tokenized_sequence} \n")
outputs = model(tokens)
logits = outputs.logits # shape: [1, sequence_length, vocab_size]
log_probs = torch.log_softmax(logits, dim=-1)
for idx, token in enumerate(tokenized_sequence):
token_id = tokenizer.convert_tokens_to_ids(token)
log_prob = log_probs[0, idx, token_id].item()
print(f"token: {token}, log prob: {log_prob}")
There is something that I’m not sure about. in the line log_prob = log_probs[0, idx, token_id].item()
, should I use log_prob = log_probs[0, idx-1, token_id].item()
instead? In other words, do the logits at position $j$ in a sequence give the prediction for the token at position $j$ given the previous tokens as context, or is a shift necessary?