The below-code snippets work fine but my implementation makes it very slow. Is there any better way to optimize the code and not break the autograd mechanism?
**Snippet 1**
theta = torch.tril(torch.Tensor(self.N_parameters, self.N_parameters).to(device)).expand(self.N_subjects, self.N_parameters, self.N_parameters)
for i in range(0, self.N_subjects):
self.chol_cov_theta.data[i] = torch.tril(self.chol_cov_theta.data[i])
self.chol_cov_theta.data[i] -= torch.diag(torch.diag(chol_cov_theta.data[i]))
theta[i] = self.chol_cov_theta[i] + torch.diag(torch.exp(self.log_diag_chol_cov_theta[i]))
**Snippet 2**
theta_j = [] for i in range(0, self.N_subjects): L_j = torch.inverse(theta[i]).t() theta_j.append((self.m_theta_j[i].view(-1,1) + torch.mm(L_j, sampler_j[i].view(-1,1))).view(1,-1)) theta_j = torch.cat(theta_j)