Large runing time difference between P100 and V100

I am writing a script to register two images using PyTorch and I test my code on both V100 and P100 GPU. It turns out P100 (running time is about 30 seconds ) is much slower than V100 (2-3 seconds). Usually V100 gives a 2x or 3x speedup over the P100. Does anyone has an opinion on what causes this large difference?

Environment:

Ubuntu: 16.04
PyTorch: 1.1.0
CUDA: 9.2

My code is like:

fixed = torch.from_numpy(fixed).float().cuda()
moving = torch.from_numpy(moving).float().cuda()
theta = torch.eye(3,4, requires_grad=Ture).unsqueeze(0).float().cuda()
for i in range(max_iteration):
    grid = torcn.nn.functional.affine_grid(theta, fixed.size())
    output = torch.nn.functional.grid_sample(moving, grid)
    optim.zero_grad()
    loss = loss_fn(fixed, output)
    loss.backward()
    optim.step()

Are you measuring the time for a specific operation or the forward/backward pass?
If so, note that you would need to synchronize the code before starting and stopping the timer via torch.cuda.synchronize(), since CUDA operations are asynchronous.

Thanks for your reply. I am measuring the time for registration process (for loop) and data loading step is excluded. I have added torch.cuda.synchronize() before starting and stopping the timer and the difference still exists.

Related upstream issue.
Let’s continue the discussion there.