Why is Pytorch OK with this wierd operation and not fail?

Please take a moment and have a look at the notebook here. This is a simple tutorial concerning LSTMs taught at Udemy’s Pytorch Course.
There are two sections in this IPython notebook that confuses me greatly.

  1. Why is it necessary to use contiguous() when using an LSTM?
    and more importantly :
  2. Why doesnt the training procedure fail! becasue of the following code snippet :
    counter = 0
    n_chars = len(net.chars)
    for e in range(epochs):
        # initialize hidden state
        h = net.init_hidden(batch_size)
        
        for x, y in get_batches(data, batch_size, seq_length):
            counter += 1
            
            # One-hot encode our data and make them Torch tensors
            x = one_hot_encode(x, n_chars)
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            
            if(train_on_gpu):
                inputs, targets = inputs.cuda(), targets.cuda()

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([each.data for each in h])

            # zero accumulated gradients
            net.zero_grad()
            
            # get the output from the model
            output, h = net(inputs, h)
            print(f'output.shape: {output.shape}')
            print(f'y.shape :{targets.shape}')
            print(targets[0,:])
            # calculate the loss and perform backprop
            loss = criterion(output, targets.view(batch_size*seq_length).long())
            loss.backward()
            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            opt.step()
            
            # loss stats
            if counter % print_every == 0:

What I’m specifically refering to is this line:

loss = criterion(output, targets.view(batch_size*seq_length).long())

basically, here the author is using a one-hot encoded output with the shape (batch, sequence_length, features) with a normal not-one-hot encoded target tensor of shape (batchsize, sequence_length)!
why does it not fail? how is crossentropy doing its job when the two tensors are not both one-hot encoded?!
if you go and one-hot encode the targets as well, you will face the error :
RuntimeError: multi-target not supported at C:/w/1/s/windows/pytorch/aten/src\THCUNN/generic/ClassNLLCriterion.cu:15

The usage of contiguous() seems not to do any good and the only way to get this to work seems like this!

also have a side question, what does weight = next(self.parameters()).data mean?
Why did t he author do :

weight = next(self.parameters()).data
hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                weight.new(self.n_layers, batch_size, self.n_hidden).zero_())

whats this weight.new? how does he/she know what parameters to send? why did not the author simply use tensor.zeros() instead and do :

hidden_state = torch.zeros(num_layer*direction, batch_size, hidden_size).to(device)
cellstate = torch.zeros_like(hidden_state).to(device)
hiddenstates = (hidden_state,cellstate)

Can anyone please explain to me what is happening here?
I grealy appreciate it

For this part what I can tell you is that nn.CrossEntropyLoss() does take output as one-hot encoded and targets with indices that isn’t encoded into one-hot. Internally it converts it into one-hot encoding and computes the loss.

You can find this easily on documentation and its source code.

1 Like