i have the following two settings for one sequence of length 1200
input size: batch size by sequence length (1, 1200)
for i in range(3):
input size: batch size by sequence length-(3, 512)(this includes padding)
i assumed setting2 would be faster than setting1, and it is true if i use GPU, but if im using CPU, setting 2 is slower than setting1
can someone please explain why this is the case?
To the original question: yes, PyTorch uses MAGMA and cuSolver for the GPU implementation (we are working on adding more cuSolver methods to the backend) and LAPACK on the CPU (I’m unsure, if other CPU libs are used).
My best guess would be that LAPACK (or the used CPU linalg library) is using e.g. AVX instructions for specific shapes, which might provide performance cliffs, but I would recommend to profile the workloads properly to see the more general picture.
thank you for explaining. im currenly just measuring the time mentioned in the above two settings, are you talking about breaking the time measurement into layers by layers in transformer? if possible, can you please suggest how to profile the workloads better
torch.utils.benchmark is a nice tool to create benchmark tables for different setups of specific operations and might be helpful to see the performance for different input shapes etc.