Runetime Error: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/THCTensorMathCompare.cuh:82

I’m training a model; after running for 50k to 0.5 million epochs, it gives me this error, again and again! I’ve attached the terminal error and also the piece of code where the error occurs. Kindly help me!

The code is attached below:

def train_batch_MLE(self, enc_out, enc_hidden, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, batch):
‘’’ Calculate Negative Log Likelihood Loss for the given batch. In order to reduce exposure bias,
pass the previous generated token as input with a probability of 0.25 instead of ground truth label
:param enc_out: Outputs of the encoder for all time steps (batch_size, length_input_sequence, 2*hidden_size)
:param enc_hidden: Tuple containing final hidden state & cell state of encoder. Shape of h & c: (batch_size, hidden_size)
:param enc_padding_mask: Mask for encoder input; Tensor of size (batch_size, length_input_sequence) with values of 0 for pad tokens & 1 for others
:param ct_e: encoder context vector for time_step=0 (eq 5 in
:param extra_zeros: Tensor used to extend vocab distribution for pointer mechanism
:param enc_batch_extend_vocab: Input batch that stores OOV ids
:param batch: batch object
dec_batch, max_dec_len, dec_lens, target_batch = get_dec_data(batch) #Get input and target batchs for training decoder
step_losses = []

    copy_loss = []
    s_t = (enc_hidden[0], enc_hidden[1])                                                        #Decoder hidden states
    x_t = get_cuda(T.LongTensor(len(enc_out)).fill_(self.start_id))                             #Input to the decoder
    prev_s = None                                                                               #Used for intra-decoder attention (section 2.2 in
    sum_temporal_srcs = None                                                                    #Used for intra-temporal attention (section 2.1 in
    for t in range(min(max_dec_len, config.max_dec_steps)):
        use_gound_truth = get_cuda((T.rand(len(enc_out)) > 0.25)).long()                        #Probabilities indicating whether to use ground truth labels instead of previous decoded tokens
        x_t = use_gound_truth * dec_batch[:, t] + (1 - use_gound_truth) * x_t                   #Select decoder input based on use_ground_truth probabilities
        x_t = self.model.embeds(x_t)
        final_dist, s_t, ct_e, sum_temporal_srcs, prev_s = self.model.decoder(x_t, s_t, enc_out, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, sum_temporal_srcs, prev_s)
        target = target_batch[:, t]
        log_probs = T.log(final_dist + config.eps)

        step_loss = F.nll_loss(log_probs, target, reduction="none", ignore_index=self.pad_id)
        # except:
        #     print('')
        x_t = T.multinomial(final_dist, 1).squeeze()                                            #Sample words from final distribution which can be used as input in next time step
        is_oov = (x_t >= config.vocab_size).long()                                              #Mask indicating whether sampled word is OOV
        x_t = (1 - is_oov) * x_t.detach() + (is_oov) * self.unk_id                              #Replace OOVs with [UNK] token
        # print("ssss",x_t)
    losses = T.sum(T.stack(step_losses, 1), 1)                                                  #unnormalized losses for each example in the batch; (batch_size)
    batch_avg_loss = losses / dec_lens                                                          #Normalized losses; (batch_size)
    mle_loss = T.mean(batch_avg_loss)                                                           #Average batch loss
    # copy_losses = T.sum(T.stack(copy_loss, 1), 1)  # unnormalized losses for each example in the batch; (batch_size)
    # batch_avg_copy_loss = copy_losses.float() / dec_lens  # Normalized losses; (batch_size)
    # avg_copy_loss = T.mean(batch_avg_copy_loss)
    # mle_loss += avg_copy_loss*5
    return mle_loss

Based on the error message it seems that an internal check failed for the sampleMultinomialOnce kernel.
However, I cannot find the THCNumerics<T>::ge(val, zero) operationand guess that you might be hitting one of these checks. Which PyTorch version are you using and could you update to the latest release in case you are using an older one?

1 Like

I was earlier using the updated one, but I read through some other articles on the error, and they suggested using PyTorch 1.1.0. However, using the updated PyTorch also throws the same error!

I reran it and got another error similar to the old one!

Could you check the inputs to the multinomial call and make sure they contain valid values?