I am comparing two algorithms A and B.
When I run these algorithms on my own 2080 Ti GPU, algorithm A is slightly faster than B. However, when I submit them to P100 GPU on cluster, algorithm A is about 2x slower than B.
I am pretty sure that I am using the correct time measure method on GPU. Essentially I first sent two cuda timing event and then force syncronization to record the time.
The running time of algorithm B does not change much on both GPUs. However, algorithm A runs much slower on P100. Is it normal for an algorithm to have this drastical difference on difference devices?
Are you using exactly the same CUDA, cudnn, nccl etc. libraries?
Differences in these libraries might create a large difference.
I think I am using cudnn. Is there any way to check that? I thought cudnn is the default option.
okay, on my local machine with 2080 Ti, the cuda version is 10.0 and cudnn is 7603. On the server with P100, the cuda version 10.1, and cudnn is 7603.
They are not exactly the same. But the server actually use a lastest cuda version so it is supposed to be faster? It is just weird.
Based on the versions, it looks like you are using the PyTorch binaries.
If that’s the case, your local CUDA installation won’t be used, but the one shipped with the binaries.
Could you install the same binaries with matching versions and, if possible, post your code for profiling so that we can have a look, please?