How PyTorch is managing target shape mismatch here

Below is the Embedding model that is being trained on a list of words following skip-gram method for word2vec algorithm:

  (embed): Embedding(63641, 300)
  (output): Linear(in_features=300, out_features=63641, bias=True)
  (log_softmax): LogSoftmax(dim=1)

And in the tutorial the below in/out data is being passed (sample of batch size 8):

for inputs, targets in get_batches(train_words, 8):
    steps += 1
    inputs, targets = torch.LongTensor(inputs), torch.LongTensor(targets)
    inputs, targets =,
    print('input_shape:', inputs.shape, 'output_shape:', targets.shape)
input_shape: torch.Size([36]) output_shape: torch.Size([36])

The idea, of course, is to train the weights of the Embedding layer to approximate the context relations given as input.

My question here is, clearly the output shape of the model is different than what is being passed to the target and it still trains (!) so how is that possible? shouldn’t I have to convert the targets to one-hot encoding format for the loss computation to work?
To be specific, the target is just 1 integer while the softmax layer is expected to get a one-hot encoded version of the target integer. Is pytorch internally managing this or am I missing something?

Model definition:

class SkipGram(nn.Module):
    def __init__(self, n_vocab, n_embed):
        self.embed = nn.Embedding(n_vocab, n_embed)
        self.output = nn.Linear(n_embed, n_vocab)
        self.log_softmax = nn.LogSoftmax(dim=1)
    def forward(self, x):
        x = self.embed(x)
        scores = self.output(x)
        log_ps = self.log_softmax(scores)
        return log_ps

Training loop:

# check if GPU is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

embedding_dim=300 # you can change, if you want

model = SkipGram(len(vocab_to_int), embedding_dim).to(device)
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

print_every = 500
steps = 0
epochs = 1

# train for some number of epochs
for e in range(epochs):
    # get input and target batches
    for inputs, targets in get_batches(train_words, 512):
        steps += 1
        inputs, targets = torch.LongTensor(inputs), torch.LongTensor(targets)
        inputs, targets =,
        log_ps = model(inputs)
        loss = criterion(log_ps, targets)

        if steps % print_every == 0:                  
            # getting examples and similarities      
            valid_examples, valid_similarities = cosine_similarity(model.embed, device=device)
            _, closest_idxs = valid_similarities.topk(6) # topk highest similarities
            valid_examples, closest_idxs ='cpu'),'cpu')
            for ii, valid_idx in enumerate(valid_examples):
                closest_words = [int_to_vocab[idx.item()] for idx in closest_idxs[ii]][1:]
                print(int_to_vocab[valid_idx.item()] + " | " + ', '.join(closest_words))

I think you need to provide some more context.

thanks, I’ve updated the main post - please let me know if more info is needed.

The model definition is missing. Provide a minimum working example.

I’ve added the model definition and the training loop, hope that helps too. I can share the utility functions definitions as well if you like but did not want to clutter the post. Please see batch function and sample output of the get_batch function:

def get_batches(words, batch_size, window_size=5):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    n_batches = len(words)//batch_size
    # only full batches
    words = words[:n_batches*batch_size]
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
        yield x, y
int_text = [i for i in range(20)]
x,y = next(get_batches(int_text, batch_size=4, window_size=5))

print('x\n', x)
print('y\n', y)
 [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]
 [1, 2, 3, 0, 2, 3, 0, 1, 3, 0, 1, 2]

thank you for your interest.

So the answer to your questions are 1. No. Torch internally converts indexes to one-hot encoded vectors so you don’t have to. 2. Yes. For example:

loss = NLLLoss()
y = torch.LongTensor([2, 1])
y_pred = torch.Tensor([[0.1, 0, 0.9], [0.0, 0.5, 0.5]])
print(loss(torch.log(y_pred), y))

y and y_pred does not have the same dimensions, but torch gets it. This is very convenient for single-class classification problems. Your problem is a single-class classification problem.