Sorry, I didn’t finish writing and accidentally clicked to post. Is there any particular reason why calculating the batched inverse of a (minibatches, channels, 3, 3) tensor would be ridiculously slow? For perspective, I have a modified Fast SCNN implemented which has one specific portion at the tail end where I perform least squares for every single channel in every single minibatch. My input to the least squares module should have minibatches*channels weight tensors from the earlier layers. I currently calculate the coefficients of a parabola using B = (X^T * X)^-1 * X^T * Y. The only bottleneck in this entire segment is where I calculate (X^T * X)^-1. Essentially, this is just calculating the inverse of one 3x3 matrix once for each channel in each mini batch. Yet, to put into perspective, the calculation of the inverses, this one line of the form torch.inverse(place), takes around 70% - 80% of the total forward pass. This is utilizing CUDA 10. Could this be because of overhead, or is there another issue?