I’m using an LSTM, and this error keeps occurring when the hidden state size is larger than 16. At 16 the network outputs nan instead.

The model is defined as:

class LSTM(nn.Module):
     def __init__(self, input_dim, hidden_dim, batch_size, output_dim=1, num_layers=1, dropout=0, h0=None, c0=None):
         super(LSTM, self).__init__()
         self.input_dim = input_dim #n_row*n_col
         self.hidden_dim = hidden_dim
         self.batch_size = batch_size
         self.output_dim = output_dim
         self.num_layers = num_layers
         self.lstm = nn.LSTM(self.input_dim, self.hidden_dim, self.num_layers, dropout=dropout)
         self.decoder = nn.Linear(self.hidden_dim, self.output_dim)
         #Initialize hidden states, default is zero
         if h0 is None:
             self.h0 = torch.zeros(self.num_layers, self.batch_size, self.hidden_dim)
             self.c0 = torch.zeros(self.num_layers, self.batch_size, self.hidden_dim)
             self.h0 = h0
             self.c0 = c0
         if torch.cuda.is_available():
             self.h0 = self.h0.cuda()
             self.c0 = self.c0.cuda()
     #Forward pass
     def forward(self, input):
         #Input to LSTM has shape (seq_length, batch_size, n_row*n_col)
         #LSTM output has shape (seq_length, batch_size, hidden_dim)
         lstm_out, self.hidden = self.lstm(input, (self.h0, self.c0))
         #Decoder output has shape (seq_length, batch_size, output_dim)
         prediction = self.decoder(lstm_out)
         return prediction, self.hidden
     #Propagate one step in forward pass
     def step(self, input, h, c):
         #Input to LSTM has shape (1, batch_size, n_row*n_col)
         #LSTM output has shape (1, batch_size, hidden_dim)
         if torch.cuda.is_available():
             h = h.cuda()
             c = c.cuda()
             input = input.cuda()
         lstm_out, self.hidden = self.lstm(input, (h, c))
         # Decoder output has shape (1, batch_size, output_dim)
         prediction = self.decoder(lstm_out)
         return prediction, self.hidden

and trained with:

 loss_fn = nn.MSELoss()
 optimizer = optim.Adam(params=model.parameters(), lr=lr, weight_decay=weight_decay)
 for epoch in range(n_epochs):
     loss_total = 0
     for batch_idx, batch in enumerate(dataloader):
         if batch_idx == end_idx:
         input_seq = Variable(batch["input"].view(seq_length, batch_size, -1))
         output_seq = Variable(batch["output"].view(seq_length, batch_size, -1))
         if torch.cuda.is_available():
             input_seq = input_seq.cuda()
             output_seq = output_seq.cuda()
         _, (h, c) = model(input_seq[0:seen_step])
         empty_input = torch.zeros_like(input_seq[0:1])
         fut_prediction = []
         for t in range(fut_step):
             prediction, (h, c) = model.step(empty_input, h, c)
         pred_seq =, dim=0)
         truth_seq = output_seq[seen_step:]
         loss = loss_fn(pred_seq, truth_seq)
         loss_total += loss.detach().item()
         if grad_clip != 0:
             torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)

The code is a bit hard to read, but could it be you are never detaching the hidden state thus the computation graph is growing for the whole training?

1 Like

I am using the zero initialized hidden state for every forward pass. Is that what you mean?

Yes, that’s what I mean. Where are you re-initializing it?

Did you find out the line number/function call where it happens?

unrelated to the question, Is it intended to give empty input to all the decoder time steps?

i am also facing the error for using pretrained resnet50

data_transforms = {
‘train’: transforms.Compose([
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
‘val’: transforms.Compose([
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

data_dir = “data”
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
for x in [‘train’, ‘val’]}
dataloaders = {x:[x], batch_size=4,
shuffle=True, num_workers=4)
for x in [‘train’, ‘val’]}
dataset_sizes = {x: len(image_datasets[x]) for x in [‘train’, ‘val’]}
class_names = image_datasets[‘train’].classes

device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)

def imshow(inp, title=None):
“”“Imshow for Tensor.”""
inp = inp.numpy().transpose((1, 2, 0))
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
inp = std * inp + mean
inp = np.clip(inp, 0, 1)
if title is not None:
plt.pause(0.001) # pause a bit so that plots are updated

Get a batch of training data

inputs, classes = next(iter(dataloaders[‘train’]))

Make a grid from batch

out = torchvision.utils.make_grid(inputs)

imshow(out, title=[class_names[x] for x in classes])

def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
since = time.time()

best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0

for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch, num_epochs - 1))
    print('-' * 10)

    # Each epoch has a training and validation phase
    for phase in ['train', 'val']:
        if phase == 'train':
            model.train()  # Set model to training mode
            model.eval()   # Set model to evaluate mode

        running_loss = 0.0
        running_corrects = 0

        # Iterate over data.
        for inputs, labels in dataloaders[phase]:
            inputs =
            labels =

            # zero the parameter gradients

            # forward
            # track history if only in train
            with torch.set_grad_enabled(phase == 'train'):
                outputs = model(inputs)
                _, preds = torch.max(outputs, 1)
                loss = criterion(outputs, labels)

                # backward + optimize only if in training phase
                if phase == 'train':

            # statistics
            running_loss += loss.item() * inputs.size(0)
            running_corrects += torch.sum(preds ==

        epoch_loss = running_loss / dataset_sizes[phase]
        epoch_acc = running_corrects.double() / dataset_sizes[phase]

        print('{} Loss: {:.4f} Acc: {:.4f}'.format(
            phase, epoch_loss, epoch_acc))

        # deep copy the model
        if phase == 'val' and epoch_acc > best_acc:
            best_acc = epoch_acc
            best_model_wts = copy.deepcopy(model.state_dict())

    # print()

time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format(
    time_elapsed // 60, time_elapsed % 60))
print('Best val Acc: {:4f}'.format(best_acc))

# load best model weights
return model

Epoch 0/24

/opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THCUNN/ void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File “”, line 179, in
File “”, line 109, in train_model
File “/home/raman/anaconda2/envs/pyto3/lib/python3.6/site-packages/torch/”, line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/raman/anaconda2/envs/pyto3/lib/python3.6/site-packages/torch/autograd/”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag

I would like to add my points also with respect to these error.
I would like to ask a question with respect to cuda version 9.0 and cudnn 7.5.1 Error is something same with respect to this one, so sharing my experience can be beneficial to all. I have successfully installed both of these and successfully configured cuda for both tensorflow and pytorch.

I am testing torch gpu on the attached code in the link.

I got the following error RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I tried some way but I think this is something to do with cuda. cuda is working successful with torch.

Any ideas will be welcomed.

With Regards
Sanpreet Singh

Hello Sanpreet Singh

I have gone through the error you are facing. I have faced such issues in the past. From my knowledge and experience the problem is with different versions of cuda, cuDNN and torch. If you are using RTX series GPU, cuda 9.0 have some compatibility issues. So, it would be better you compile the code with cuda 10.0. I also recommend you please perform the task in a virtual environment. IF you don’t know about the virtual environment please follow these steps to create it.

  • Install pip first using this command:
    sudo apt-get install python3-pip

  • Then install virtualenv using pip3:
    sudo pip3 install virtualenv

  • Now create a virtual environment:
    virtualenv venv (you can use any name instead of venv)

  • Active your virtual environment
    source venv/bin/activate

Now, please visit and install torch with cuda 10 .If you face any problem please have a look at the below link where i have added the screenshot for it.

I am also adding the code if you don’t want to do it yourself.

pip3 install

pip3 install torchvision

With regards
Ekta Smothra


Thanks Ekta. I tried your solution and it worked.

1 Like