When I run these algorithms on my own 2080 Ti GPU, algorithm A is slightly faster than B. However, when I submit them to P100 GPU on cluster, algorithm A is about 2x slower than B.
I am pretty sure that I am using the correct time measure method on GPU. Essentially I first sent two cuda timing event and then force syncronization to record the time.
The running time of algorithm B does not change much on both GPUs. However, algorithm A runs much slower on P100. Is it normal for an algorithm to have this drastical difference on difference devices?
Based on the versions, it looks like you are using the PyTorch binaries.
If that’s the case, your local CUDA installation won’t be used, but the one shipped with the binaries.
Could you install the same binaries with matching versions and, if possible, post your code for profiling so that we can have a look, please?