Performance of LU-factorization, can it be improved?

Dear Pytorch users,

I want to do batched LU factorisation as fast as possible for some medium sized matricies (around 200 x 200). However, when I run a simple benchmark it is much slower than expected. On a GTX 1080Ti I get about 30 GFlop/s on single precision. Considering that it has a peak computational power of 10 TFlops/s I find it a bit disappointing. Is there something I can do to improve the situation or is this the performance to be expected? For instance does anyone have experience of using the MAGMA version of Pytorch and knows if that improves performance? (Some improvement should be possible I already know since Tensorflow is about twice as fast.)

Best
Anders

Bench-marking code used:

import torch
import time

# Repeated operations
rep = 1000 # Batch size
loop = 25 # Number of repeats
m = 200 # Matrix size (m x m)

# cuda:0 corresponds to a GTX 1080 Ti
settings = {'dtype': torch.float32, 'device': 'cuda:0'}


def print_no_flops(t, m):
    lu_ops_total = 2/3*m**3
    GFlops = 1024**3
    print(f"Performance: {lu_ops_total / t / GFlops} GFlops")


# Create data (1 system)
A = torch.randn((m, m), **settings)
ATA = A.t().mm(A)

# Make it a batch
ATA = (ATA).unsqueeze(0) * torch.ones((rep, 1, 1),**settings)


# Time the calculation
tic = time.perf_counter()
for i in range(loop):
    LU = torch.lu(ATA)
toc = time.perf_counter()

dt = toc-tic
dt_per_system = dt / (rep * loop)
print(f"Solve time per system = {dt_per_system * 1e6} us")
print_no_flops(dt_per_system, m)