Please take a moment and have a look at the notebook here. This is a simple tutorial concerning LSTMs taught at Udemy’s Pytorch Course.
There are two sections in this IPython notebook that confuses me greatly.
- Why is it necessary to use contiguous() when using an LSTM?
and more importantly : - Why doesnt the training procedure fail! becasue of the following code snippet :
counter = 0
n_chars = len(net.chars)
for e in range(epochs):
# initialize hidden state
h = net.init_hidden(batch_size)
for x, y in get_batches(data, batch_size, seq_length):
counter += 1
# One-hot encode our data and make them Torch tensors
x = one_hot_encode(x, n_chars)
inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
if(train_on_gpu):
inputs, targets = inputs.cuda(), targets.cuda()
# Creating new variables for the hidden state, otherwise
# we'd backprop through the entire training history
h = tuple([each.data for each in h])
# zero accumulated gradients
net.zero_grad()
# get the output from the model
output, h = net(inputs, h)
print(f'output.shape: {output.shape}')
print(f'y.shape :{targets.shape}')
print(targets[0,:])
# calculate the loss and perform backprop
loss = criterion(output, targets.view(batch_size*seq_length).long())
loss.backward()
# `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
nn.utils.clip_grad_norm_(net.parameters(), clip)
opt.step()
# loss stats
if counter % print_every == 0:
What I’m specifically refering to is this line:
loss = criterion(output, targets.view(batch_size*seq_length).long())
basically, here the author is using a one-hot encoded output with the shape (batch, sequence_length, features)
with a normal not-one-hot encoded target tensor of shape (batchsize, sequence_length)
!
why does it not fail? how is crossentropy
doing its job when the two tensors are not both one-hot encoded?!
if you go and one-hot encode the targets as well, you will face the error :
RuntimeError: multi-target not supported at C:/w/1/s/windows/pytorch/aten/src\THCUNN/generic/ClassNLLCriterion.cu:15
The usage of contiguous() seems not to do any good and the only way to get this to work seems like this!
also have a side question, what does weight = next(self.parameters()).data
mean?
Why did t he author do :
weight = next(self.parameters()).data
hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
whats this weight.new? how does he/she know what parameters to send? why did not the author simply use tensor.zeros() instead and do :
hidden_state = torch.zeros(num_layer*direction, batch_size, hidden_size).to(device)
cellstate = torch.zeros_like(hidden_state).to(device)
hiddenstates = (hidden_state,cellstate)
Can anyone please explain to me what is happening here?
I grealy appreciate it