Below is the Embedding model that is being trained on a list of words following skip-gram method for word2vec algorithm:
SkipGram(
(embed): Embedding(63641, 300)
(output): Linear(in_features=300, out_features=63641, bias=True)
(log_softmax): LogSoftmax(dim=1)
)
And in the tutorial the below in/out data is being passed (sample of batch size 8):
for inputs, targets in get_batches(train_words, 8):
steps += 1
inputs, targets = torch.LongTensor(inputs), torch.LongTensor(targets)
inputs, targets = inputs.to(device), targets.to(device)
print('input_shape:', inputs.shape, 'output_shape:', targets.shape)
break
input_shape: torch.Size([36]) output_shape: torch.Size([36])
The idea, of course, is to train the weights of the Embedding layer to approximate the context relations given as input.
My question here is, clearly the output shape of the model is different than what is being passed to the target and it still trains (!) so how is that possible? shouldn’t I have to convert the targets to one-hot encoding format for the loss computation to work?
To be specific, the target is just 1 integer while the softmax layer is expected to get a one-hot encoded version of the target integer. Is pytorch internally managing this or am I missing something?
[EDIT]
Model definition:
class SkipGram(nn.Module):
def __init__(self, n_vocab, n_embed):
super().__init__()
self.embed = nn.Embedding(n_vocab, n_embed)
self.output = nn.Linear(n_embed, n_vocab)
self.log_softmax = nn.LogSoftmax(dim=1)
def forward(self, x):
x = self.embed(x)
scores = self.output(x)
log_ps = self.log_softmax(scores)
return log_ps
Training loop:
# check if GPU is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
embedding_dim=300 # you can change, if you want
model = SkipGram(len(vocab_to_int), embedding_dim).to(device)
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)
print_every = 500
steps = 0
epochs = 1
# train for some number of epochs
for e in range(epochs):
# get input and target batches
for inputs, targets in get_batches(train_words, 512):
steps += 1
inputs, targets = torch.LongTensor(inputs), torch.LongTensor(targets)
inputs, targets = inputs.to(device), targets.to(device)
log_ps = model(inputs)
loss = criterion(log_ps, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if steps % print_every == 0:
# getting examples and similarities
valid_examples, valid_similarities = cosine_similarity(model.embed, device=device)
_, closest_idxs = valid_similarities.topk(6) # topk highest similarities
valid_examples, closest_idxs = valid_examples.to('cpu'), closest_idxs.to('cpu')
for ii, valid_idx in enumerate(valid_examples):
closest_words = [int_to_vocab[idx.item()] for idx in closest_idxs[ii]][1:]
print(int_to_vocab[valid_idx.item()] + " | " + ', '.join(closest_words))
print("...")