Slow backward() on better hardware

arrowtea · February 16, 2021, 1:39pm

I am experiencing a 3-4x slower speed when applying the following simple code on newer hardware:

import torch
import time

device = torch.device('cuda:0')

class NNClass(torch.nn.Module):
    def __init__(self):
        super(NNClass, self).__init__()
        self.fc1 = torch.nn.Linear(3, 128)
        self.fclast = torch.nn.Linear(128, 3)

    def forward(self, inp):
        out = self.fc1(inp)
        return self.fclast(out)

theNN = NNClass().to(device)
lossfunc = torch.nn.MSELoss()
inp = torch.zeros(size=(64, 3), device=device)
labels = torch.zeros(size=(64, 3), device=device)

tic = time.perf_counter()
for idx in range(10000000):
    Fout = theNN(inp)
    loss = lossfunc(Fout, labels)
    loss.backward()

    if idx % 1000 == 0:
        print(f'idx: {idx}')
    if idx == int(3e4):
        break

toc = time.perf_counter()
print(f'wallclock: {toc - tic}')

The inferior hardware is a laptop (Lenovo X1 Carbon) connected to an eGPU containing a 1080ti, and the superior hardware is a desktop with an i9-10940x processor (with an ASUS Pro WS X299 Sage II motherboard) with three GPUs – two 2080ti’s and one 1080ti. Both systems are running Ubuntu 18.04 with nvidia driver 460, and both have PyTorch installed using Anaconda with the instructed install command:

conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch

Basically, when I run the exact same code on both systems, the one with superior hardware is about 3-4x slower when training with the GPU (I trained with one GPU for the superior hardware). The inferior hardware runs in about 12 seconds, while the superior hardware finishes in 46 seconds. As a sanity check, when I train both systems with the CPU only, the superior hardware wins as expected.

Any help is much appreciated!

albanD · February 16, 2021, 6:07pm

Hi,

Which version of pytorch are you using?
Also which version of cudnn are you using on each machine (if you have a custom one)?

arrowtea · February 16, 2021, 6:49pm

Thanks for the reply! It looks like PyTorch is 1.7.1 and cuDNN is 8.0.5, for both machines. When I run ‘conda list’ on both, I see version 1.7.1 with the build py3.8_cuda11.0.221_cudnn8.0.5_0 from the pytorch channel.

albanD · February 16, 2021, 7:52pm

That looks good.

@ptrblck any idea what could be causing this?

ptrblck · February 16, 2021, 7:54pm

I would try to lower the number of variables a bit to further isolate the issue.
If I understand the use case correctly, @arrowtea has already used the same PyTorch, CUDA, cudnn versions on both machines. The only difference would be the hardware.

Could you run the test on the 1080 in your desktop and compare it to the 1080 in the laptop?

arrowtea · February 16, 2021, 9:28pm

On the desktop I ran it with a 1080ti and got the same results.

I just tried comparing Tensorflow code, and I also got the same results – the desktop was slower than the laptop (checking nvidia-smi to make sure the appropriate GPUs were used). Sorry for the trouble, as this is a PyTorch forum. I’ll go ahead see if I can find my solution from perhaps Nvidia support.

But if anyone has suggestions for solutions or perhaps where I can start to look for solutions, please let me know! Thank you!

ptrblck · February 17, 2021, 7:48am

“The same results” means that your desktop is slower using the 1080 vs. your laptop with a 1080?
If so, I would profile your code and check it for other bottlenecks besides the GPU. E.g. in case you are using a fast SSD in your laptop vs. an old HDD in your desktop, the performance of the data loading pipeline would tank.

EDIT: you could also profile the achieved memory bandwidth on both systems for the GPU and check, if the used hardware is a limiting factor (e.g. by sharing the PCIe connection etc.).

arrowtea · February 17, 2021, 8:59pm

Yeah, using the desktop with the 1080 was slower than the laptop with the 1080. I’m also using an SSD for both the desktop and laptop.

In any case, I examined the CPU usage of both the laptop and the desktop and saw the laptop under load was working at about 4.4 GHz for all CPUs while the desktop under load was staying at the minimum clock, 1.2 GHz for all CPUs (all this is training the with GPUs). This prompted me to check the CPU frequency ratios in the bios.

So I went into the bios and instead of using the default CPU frequency ratios of Auto, I manually set them to the advertised setting of (4.8 GHz) which seemed to fix things in PyTorch. And actually, when I was stress testing the CPUs under the Auto setting, I saw that the maximum it would achieve for an individual CPU is about 4.2GHz. So I went back and manually set the CPU frequency ratios to 4.1 GHz just as a test, and even here I saw a x2 speedup over the Auto setting.

So definitely something fishy is going on with the motherboard.

Also, what’s even weirder is the PyTorch code did get sped up, but the Tensorflow code did not. Pretty weird!

Thanks for your the suggestions though!