Customized criterion calculation is slow in cuda

I customized a criterion function as my loss, while it turned out slower on GPU than running on CPU, in both calculating loss value and do loss.backward(). Here is my code defining the criterion.

import torch
import torch.nn as nn
from torch.distributions.negative_binomial import NegativeBinomial


class NegtiveBinomialLoss(nn.Module):
    def __init__(self):
        """
        this Module is designed as a customized loss criterion for negative binomial distribution,
        implementing -sum(log(likelihood))

        """
        super(NegtiveBinomialLoss, self).__init__()

    def forward(
        self,
        net_output: torch.autograd.Variable,
        y: torch.Tensor,
    ) -> torch.autograd.Variable:
        """
        Args:
            net_output: predict values, aka network output
            y: targets

        Returns: loss function variable

        """
        # alpha is number of success, lambda is number of failure
        param_alpha = net_output.t()[0]
        param_lambda = net_output.t()[1]
        log_likelihoods = torch.Tensor(len(y)).cuda()
        for i in range(len(y)):
            # In torch.distributions.negative_binomial.NegativeBinomial(total_count, probs),
            # total_count = success number, probs = failure prob
            nb = NegativeBinomial(total_count=param_alpha[i], probs=param_lambda[i] / (param_alpha[i] + param_lambda[i]))
            log_likelihoods[i] = nb.log_prob(y[i])
        return -log_likelihoods.sum()

I tried to put the criterion on cuda as well, but nothing changed.

criterion = NegtiveBinomialLoss().cuda()
loss = criterion(output, targets)
loss.backward()

Is there any method to speed it up?

Seems figure out by avoiding to use the for loop in criterion