Simplify time-consuming operation and keep autograd mechanics

The below-code snippets work fine but my implementation makes it very slow. Is there any better way to optimize the code and not break the autograd mechanism?

**Snippet 1**

        theta = torch.tril(torch.Tensor(self.N_parameters, self.N_parameters).to(device)).expand(self.N_subjects, self.N_parameters, self.N_parameters)
        for i in range(0, self.N_subjects):
  [i] = torch.tril([i])
  [i] -= torch.diag(torch.diag([i]))
            theta[i] = self.chol_cov_theta[i] + torch.diag(torch.exp(self.log_diag_chol_cov_theta[i]))

**Snippet 2**

    theta_j = []
    for i in range(0, self.N_subjects):
        L_j = torch.inverse(theta[i]).t()
        theta_j.append((self.m_theta_j[i].view(-1,1) +, sampler_j[i].view(-1,1))).view(1,-1))
    theta_j =