RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:26

ash · March 13, 2020, 9:16pm

I am new to pytorch and I want to know how RNN many to many classification is being performed using torch for a time series dataset. The dataset consists of 135 features, 121 time steps and a label for each time step. There are total of 10 labels.

This is my code:

seq_len = 121
batch_size = 702
features = 135
Train_tensor = Train_tensor.view(batch_size, seq_len, features)
trialtgt = Traintgt_tensor.view(batch_size, seq_len, 1).long()

# Implement RNN network

class Model(nn.Module):
def  **init** (self, input_size, output_size, hidden_dim, n_layers):
super(Model, self). **init** ()


    # Defining some parameters
    self.hidden_dim = hidden_dim
    self.n_layers = n_layers

    #Defining the layers
    # RNN Layer
    self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)   

    # Fully connected layer
    self.fc = nn.Linear(hidden_dim, output_size)

def forward(self, x):
    
    batch_size = x.size(0)

    #Initializing hidden state for first input using method defined below
    hidden = self.init_hidden(batch_size)
    print("Hidden", hidden.shape)

    # Passing in the input and hidden state into the model and obtaining outputs
    #print("X", x.shape)
    out, hidden = self.rnn(x, hidden)
    print("Output", out.shape)
    print("Hidden", hidden.shape)
    
    # Reshaping the outputs such that it can be fit into the fully connected layer
    out = out.contiguous().view(-1, self.hidden_dim)
    out = self.fc(out)
    print("Linear layer", out.shape)
    
    return out, hidden

def init_hidden(self, batch_size):
    # This method generates the first hidden state of zeros which we'll use in the forward pass
    hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device)
    return hidden


# Instantiate the model with hyperparameters

model = Model(input_size=features, output_size= 1, hidden_dim=500, n_layers=1)

# We’ll also set the model to the device that we defined earlier (default is CPU)

model = model.to(device)

# Define hyperparameters

n_epochs = 100
lr=0.01

# Define Loss, Optimizer

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# Training Run

for epoch in range(1, n_epochs + 1):
    optimizer.zero_grad() # Clears existing gradients from previous epoch
    Train_tensor = Train_tensor.to(device)
    Train_tensor = Train_tensor.float()
    output, hidden = model(Train_tensor)
    Traintgt_tensor = Traintgt_tensor.to(device)
    Traintgt_tensor= Traintgt_tensor.float()
    loss = criterion(output, Traintgt_tensor.view(-1).long())
    loss.backward() # Does backpropagation and calculates gradients
    optimizer.step() # Updates the weights accordingly


if epoch%10 == 0:
    print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
    print("Loss: {:.4f}".format(loss.item()))

The shape of training set is as follows:

Train_tensor shape torch.Size([702, 121, 135])
Traintgt_tensor shape torch.Size([84942])
trialtgt shape torch.Size([702, 121, 1])

where 702 is the batch size, 121 is the sequence length and 135 is the number of features

The error message that I am getting is,

RuntimeError Traceback (most recent call last)
in ()
11 loss = criterion(output, Traintgt_tensor.view(-1).long())
12 #loss = criterion(output, trialtgt)
—> 13 loss.backward() # Does backpropagation and calculates gradients
14 optimizer.step() # Updates the weights accordingly
15

~/anaconda3/lib/python3.6/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
193 products. Defaults to False .
194 “”"
–> 195 torch.autograd.backward(self, gradient, retain_graph, create_graph)
196
197 def register_hook(self, hook):

~/anaconda3/lib/python3.6/site-packages/torch/autograd/ init .py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
97 Variable._execution_engine.run_backward(
98 tensors, grad_tensors, retain_graph, create_graph,
—> 99 allow_unreachable=True) # allow_unreachable flag
100
101

RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:26

Could you please help me solve this issue or help me understand whether I have implemented the RNN correctly!

ptrblck · March 15, 2020, 7:06am

Is your model working on the CPU?
Usually you would get a better error message by running your model on the CPU.
However, if you model is only raising this error in the GPU, could you rerun the code with:

CUDA_LAUNCH_BLOCKING=1 python scripy.py args

and post the stack trace here?
Often the class indices are not in the expected range of [0, nb_classes-1].

ash · March 16, 2020, 6:11pm

I modified the code a bit and tried to run it on CPU, The code and the stack trace is as follows:

Training Run

for epoch in range(1, n_epochs + 1):
optimizer.zero_grad() # Clears existing gradients from previous epoch
Train_tensor = Train_tensor.float()
output, hidden = model(Train_tensor)
Traintgt_tensor= Traintgt_tensor.float()
loss = criterion(output, Traintgt_tensor.view(-1,1))
loss.backward() # Does backpropagation and calculates gradients
optimizer.step() # Updates the weights accordingly
  if epoch%10 == 0:
        print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
        print("Loss: {:.4f}".format(loss.item()))

Error:
Hidden torch.Size([1, 702, 500])
Output torch.Size([702, 121, 500])
Linear layer torch.Size([84942, 1])

RuntimeError Traceback (most recent call last)
in ()
9 #Traintgt_tensor = Traintgt_tensor.to(device)
10 Traintgt_tensor= Traintgt_tensor.float()
—> 11 loss = criterion(output, Traintgt_tensor.view(-1,1))
12 #loss = criterion(output, trialtgt)
13 loss.backward() # Does backpropagation and calculates gradients

~/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
→ 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

~/anaconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
914 def forward(self, input, target):
915 return F.cross_entropy(input, target, weight=self.weight,
→ 916 ignore_index=self.ignore_index, reduction=self.reduction)
917
918

~/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
2019 if size_average is not None or reduce is not None:
2020 reduction = _Reduction.legacy_get_string(size_average, reduce)
→ 2021 return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
2022
2023

~/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
1836 .format(input.size(0), target.size(0)))
1837 if dim == 2:
→ 1838 ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
1839 elif dim == 4:
1840 ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

RuntimeError: 1D target tensor expected, multi-target not supported

ptrblck · March 16, 2020, 8:00pm

Most likely you are passing the target tensor in the wrong shape.
For a multi-class classification use case with nn.CrossEntropy, the target should have the shape [batch_size] and contain the class indices in the range [0, nb_classes].

ash · March 16, 2020, 8:36pm

@ptrblck Thanks a lot. There were 10 classes and I was giving the output size as 1, instead of 10. The issue is solved.

RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:26

Training Run

Error: Hidden torch.Size([1, 702, 500]) Output torch.Size([702, 121, 500]) Linear layer torch.Size([84942, 1])

Error:
Hidden torch.Size([1, 702, 500])
Output torch.Size([702, 121, 500])
Linear layer torch.Size([84942, 1])