To the original question: yes, PyTorch uses MAGMA and cuSolver for the GPU implementation (we are working on adding more cuSolver methods to the backend) and LAPACK on the CPU (I’m unsure, if other CPU libs are used).
My best guess would be that LAPACK (or the used CPU linalg library) is using e.g. AVX instructions for specific shapes, which might provide performance cliffs, but I would recommend to profile the workloads properly to see the more general picture.
thank you for explaining. im currenly just measuring the time mentioned in the above two settings, are you talking about breaking the time measurement into layers by layers in transformer? if possible, can you please suggest how to profile the workloads better
torch.utils.benchmark is a nice tool to create benchmark tables for different setups of specific operations and might be helpful to see the performance for different input shapes etc.