Torch.cat is much slower on GPU than CPU

Hello, I found that torch.cat runs slower on GPU than on CPU. Does anyone know the reason?

Result on CPU

time cost for autograd: -0.01325
time cost for cat: -0.00016

Result on GPU

time cost for autograd: -0.00249
time cost for cat: -0.00131

Here is the code. I ran it on a Tesla M40. Pytorch1.1.0

import time
import torch
import torch.nn as nn
import torch.autograd as autograd


class netG(nn.Module):
    def __init__(self):
        super(netG, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(1, 2, bias=False),
            nn.Linear(2, 2, bias=True)
        )
        self.weight_init()

    def forward(self, x):
        return self.net(x)

    def weight_init(self):
        self.net[0].weight.data = torch.Tensor([[0.5], [1.0]])
        self.net[1].weight.data = torch.Tensor([[1.0, 0.0],
                                                [2.0, 1.0]])
        self.net[1].bias.data = torch.Tensor([-0.5, 1.0])


class netD(nn.Module):
    def __init__(self):
        super(netD, self).__init__()
        self.net = nn.Linear(2, 1, bias=True)
        self.weight_init()

    def forward(self, x):
        return self.net(x)

    def weight_init(self):
        self.net.weight.data = torch.Tensor([[3.0, 0.0]])
        self.net.bias.data = torch.Tensor([0.0])


if __name__ == '__main__':
    device = torch.device('cuda:0') if torch.cuda.is_available() else 'cpu'
    torch.cuda.init()
    z = torch.tensor([2.0], device=device)
    G = netG().to(device)
    D = netD().to(device)
    loss = D(G(z))
    torch.cuda.synchronize(device=device)
    start = time.time()
    grad_g = autograd.grad(loss, list(G.parameters()), create_graph=True, retain_graph=True)
    grad_d = autograd.grad(loss, list(D.parameters()), create_graph=True, retain_graph=True)
    torch.cuda.synchronize(device=device)
    end = time.time()
    grad_time = start - end
    print('time cost for autograd: {:.5f}'.format(grad_time))

    torch.cuda.synchronize(device=device)
    start = time.time()
    grad_g_vec = torch.cat([g.contiguous().view(-1) for g in grad_g])
    grad_d_vec = torch.cat([g.contiguous().view(-1) for g in grad_d])
    torch.cuda.synchronize(device=device)
    end = time.time()
    grad_time = start - end
    print('time cost for cat: {:.5f}'.format(grad_time))

Well I know GPUs run slow when they are doing large small data sizes and few calculations will run faster on a CPU over a GPU.
From game dev a GPU is much better for larger number and a lot of calculations over and over again. That is why Deep Learning models like the gpu. The tuning of weights are many many small calculations that can happen all at once as GPU have 1000 of cores. A cpu can only do one thing at a time in each core.

Here is a nice post from a Nividia forum that might help explain it better than what I did
https://devtalk.nvidia.com/default/topic/953975/sequential-code-is-faster-than-parallel-how-is-it-possible-/

Thanks for sharing. So if I understand correctly, the torch.cat is a bandwidth-bound problem rather than a compute-bound problem.

I also tried a larger model. You’re right! GPU is faster than CPU.

import time
import torch
import torch.nn as nn
import torch.autograd as autograd


class netG(nn.Module):
    def __init__(self):
        super(netG, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(32, 1024, bias=False),
            nn.Linear(1024, 512, bias=True)
        )

    def forward(self, x):
        return self.net(x)



class netD(nn.Module):
    def __init__(self):
        super(netD, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(512, 512),
            nn.Linear(512, 1)
        )

    def forward(self, x):
        return self.net(x)


if __name__ == '__main__':
    device = torch.device('cuda:0') if torch.cuda.is_available() else 'cpu'
    # device = 'cpu'
    print(device)
    torch.cuda.init()
    z = torch.randn(32, device=device)
    G = netG().to(device)
    D = netD().to(device)
    loss = D(G(z))
    if device == torch.device('cuda:0'):
        torch.cuda.synchronize(device=device)
    start = time.time()
    grad_g = autograd.grad(loss, list(G.parameters()), create_graph=True, retain_graph=True)
    grad_d = autograd.grad(loss, list(D.parameters()), create_graph=True, retain_graph=True)
    if device == torch.device('cuda:0'):
        torch.cuda.synchronize(device=device)
    end = time.time()
    grad_time = start - end
    print('time cost for autograd: {:.5f}'.format(grad_time))

    if device == torch.device('cuda:0'):
        torch.cuda.synchronize(device=device)
    start = time.time()
    grad_g_vec = torch.cat([g.contiguous().view(-1) for g in grad_g])
    grad_d_vec = torch.cat([g.contiguous().view(-1) for g in grad_d])
    if device == torch.device('cuda:0'):
        torch.cuda.synchronize(device=device)
    end = time.time()
    grad_time = start - end
    print('time cost for cat: {:.5f}'.format(grad_time))

Result on big model

cpu
time cost for autograd: -0.03366
time cost for cat: -0.00306
cuda:0
time cost for autograd: -0.00318
time cost for cat: -0.00147

Torch.cat is just putting one tensor at the end of another so this is a very small calculation (if you can call it that) so this will run slower on a GPU compared to CPU due to the small data size and small action that is needed. So it will take time for the code to send this tensor to a GPU as it is overhead for the GPU to start and read in everything. Similar to CPU where it has to read in things, but a CPU does not have the starting overhead.

So by the time the GPU starts up, the CPU would have finished the calculation

Okay, I got it! Thank you so much

No worries glad to help.