I have the following code snippet that I run on two machines. Machine 1 is a server that I am given with better hardware and machine 2 is a slower desktop that I have. The server has more RAM, better CPU and a A40 while my desktop has an older i7, less RAM and an RTX2080. The code snipped prints the versions/hardware and the time it takes to do the following operation. I am somewhat new to running things on a server but I am assured that I run the code the way it is supposed to be run on the server.
The test:
import sys
import torch
import torch.nn as nn
import einops
import time
device = "cuda"
print(f"Python version {sys.version}")
print(f"Torch version {torch.__version__}")
print(f"Cuda version {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(torch.cuda.current_device())}\n")
sample_input = torch.rand(size=([4, 512, 256])).to(device)
conv1 = nn.ConvTranspose3d(kernel_size=16, stride=16, in_channels=256, out_channels=8).to(device)
torch.cuda.synchronize()
start_time = time.time()
out = einops.rearrange(sample_input, "b (x y z) em -> b em x y z", x=8, y=8, z=8)
out = conv1(out)
torch.cuda.synchronize()
print(out.shape)
end_time = time.time()
time_elapsed = end_time - start_time
print(f'Time taken: {time_elapsed:.6f} seconds')
Output from the server:
Python version 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
Torch version 1.12.1
Cuda version 11.6
GPU: NVIDIA A40
torch.Size([4, 8, 128, 128, 128])
Time taken: 93.002562 seconds
Output from my desktop:
Python version 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Torch version 1.12.1+cu102
Cuda version 10.2
GPU: NVIDIA GeForce RTX 2080
torch.Size([4, 8, 128, 128, 128])
Time taken: 0.156402 seconds
I have tried swapping out the einops with a view and a permute and the results did not change. What could be the reason for this slowdown on better hardware. I do not have much control over the versions on the server.
All other operations such as Conv3d
or fully connected layers run without this issues on both machines.