Same code much slower on better hardware

I have the following code snippet that I run on two machines. Machine 1 is a server that I am given with better hardware and machine 2 is a slower desktop that I have. The server has more RAM, better CPU and a A40 while my desktop has an older i7, less RAM and an RTX2080. The code snipped prints the versions/hardware and the time it takes to do the following operation. I am somewhat new to running things on a server but I am assured that I run the code the way it is supposed to be run on the server.

The test:

import sys
import torch
import torch.nn as nn
import einops
import time

device = "cuda"

print(f"Python version {sys.version}")
print(f"Torch version {torch.__version__}")
print(f"Cuda version {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(torch.cuda.current_device())}\n")

sample_input = torch.rand(size=([4, 512, 256])).to(device)
conv1 = nn.ConvTranspose3d(kernel_size=16, stride=16, in_channels=256, out_channels=8).to(device)

torch.cuda.synchronize()
start_time = time.time()

out = einops.rearrange(sample_input, "b (x y z) em -> b em x y z", x=8, y=8, z=8)
out = conv1(out)

torch.cuda.synchronize()
print(out.shape)

end_time = time.time()
time_elapsed = end_time - start_time

print(f'Time taken: {time_elapsed:.6f} seconds')

Output from the server:

Python version 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
Torch version 1.12.1
Cuda version 11.6
GPU: NVIDIA A40

torch.Size([4, 8, 128, 128, 128])
Time taken: 93.002562 seconds

Output from my desktop:

Python version 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Torch version 1.12.1+cu102
Cuda version 10.2
GPU: NVIDIA GeForce RTX 2080

torch.Size([4, 8, 128, 128, 128])
Time taken: 0.156402 seconds

I have tried swapping out the einops with a view and a permute and the results did not change. What could be the reason for this slowdown on better hardware. I do not have much control over the versions on the server.

All other operations such as Conv3d or fully connected layers run without this issues on both machines.

The 90s almost look as if your node would JIT compile something before the actual run.
I would recommend profiling the code via e.g. Nsight systems to check the timeline and see what your node is doing.
Alternatively you could also add a few debug print statements to check if the actual workload is even executed or if e.g. the very first operation ios already hanging.

I must admit that going to version 2.0 sort of fixed this issue, as the time to run this operation is now 0.995 seconds. However this is still much slower than my desktop. For version below 2.0 I have the following update:

I have done the following testing after my message (code and outputs below the list):

  • I cannot install Nsight as I do not have admin rights on the server I am working from.
  • I have added the print statements and the code spends almost all the time in the convolution still.
  • I have added a backprop step to see the speed of that.
  • If I run the transposed convolution with random input (no rearrange before) it is really fast as well.

Modified snippet with the prints:

import sys
import torch
import torch.nn as nn
import einops
import time


device = "cuda"

print(f"Python version {sys.version}")
print(f"Torch version {torch.__version__}")
print(f"Cuda version {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(torch.cuda.current_device())}\n")


sample_input = torch.rand(size=([4, 512, 256])).to(device)
sample_label = torch.rand(size=([4, 8, 128, 128, 128])).to(device)

conv1 = nn.ConvTranspose3d(kernel_size=16, stride=16, in_channels=256, out_channels=8).to(device)

optimizer = torch.optim.Adam(params=conv1.parameters(), lr=0.0001)
loss_fn = nn.MSELoss()

print("Start")

torch.cuda.synchronize()
start_time = time.time()

out = einops.rearrange(sample_input, "b (x y z) em -> b em x y z", x=8, y=8, z=8)

torch.cuda.synchronize()
print(f'Rearrange done: {time.time() - start_time:.6f} seconds')

out = conv1(out)

torch.cuda.synchronize()
print(f'Convolution done: {time.time() - start_time:.6f} seconds')

loss = loss_fn(out, sample_label)
loss.backward()
optimizer.step()

torch.cuda.synchronize()
print(f'Backprop done: {time.time() - start_time:.6f} seconds')

torch.cuda.synchronize()
print(out.shape)

end_time = time.time()
time_elapsed = end_time - start_time

print(f'Total time taken: {time_elapsed:.6f} seconds')

Output:

Python version 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
Torch version 1.12.1
Cuda version 11.6
GPU: NVIDIA A40

Start
Rearrange done: 0.000546 seconds
Convolution done: 93.002193 seconds
Backprop done: 93.016308 seconds
torch.Size([4, 8, 128, 128, 128])
Total time taken: 93.016403 seconds

So you can see that the convolution still takes all the time and the backprop is also instant.

Separate rearrange:

torch.cuda.synchronize()
start_time = time.time()

example = torch.rand(size=([4, 512, 256])).to(device)
out = einops.rearrange(example, "b (x y z) em -> b em x y z", x=8, y=8, z=8)

torch.cuda.synchronize()
print(out.shape)

end_time = time.time()
time_elapsed = end_time - start_time

print(f'Total time taken: {time_elapsed:.6f} seconds')

Output:

torch.Size([4, 256, 8, 8, 8])
Total time taken: 0.007333 seconds

Separate convolution:

torch.cuda.synchronize()
start_time = time.time()

example = torch.rand(size=[4, 256, 8, 8, 8]).to(device)
conv = nn.ConvTranspose3d(kernel_size=16, stride=16, in_channels=256, out_channels=8).to(device)
out = conv(example)

torch.cuda.synchronize()
print(out.shape)

end_time = time.time()
time_elapsed = end_time - start_time

print(f'Total time taken: {time_elapsed:.6f} seconds')

Output:

torch.Size([4, 8, 128, 128, 128])
Total time taken: 0.085341 seconds

So it seems the problem only occurs when these operations are combined. Maybe this can give more insight. If you want some modifications to the snippet I am available.

Good to hear the issue is not seen anymore in 2.0.0. Let me know if you are seeing the same or a similar issue and could share a full profile to narrow down where the slowdown is coming from.

After the rearrange, you need to use the next snippet to get the Tensor in that shape in memory, this solves your issue on older versions of PyTorch.

out = out.contiguous()

See the page about Tensor views for more detail.

Thanks! It does make sense I suppose