Determinant Slow on GPU

I have a tensor, X:

X.shape = torch.Size([B,N,C,C])

I am trying to calculate the determinant of the elements along B and N, i.e for each CxC matrix:

X_det = X.view(-1, C, C)
X_det = torch.det(X_det)
X_det = X.view(B, C).unsqueeze(-1).repeat(1,1, C)

This function is exceptionally slow on GPU, which is not the case when running on CPU.

I am using Pytorch 1.6.0.

What can I do to implement this function quickly on the GPU?

What’s the size of your tensor, i.e. what are B, N, C, C?
How have you benchmarked?

Generally, the linear algebra algorithms have a harder time to exploit the GPU parallelism than many other tasks as the direct methods are fairly sequential.

Best regards