GPU 5 times slower then CPU

I want to test pytroch on my GPU, but I am doing something very wrong as it takes 5 times longer then CPU. Actually even if
device = “cpu”
the “.to(device)” commands slow it down from 9 seconds to 12. On GPU it takes 50 seconds.
There might be a problem with my installation, but I do have Cuda installed and torch.cuda.is_available() returns True.

Below are the most important code snippets, it is hard to provide a short working example.
I think it there is something wrong with my net definition.

class Net(nn.Module):
    def __init__(self, d=1):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(d, internal_neurons).to(device)  # TODO: THIS
        self.fc3 = nn.Linear(internal_neurons, 1).to(device)

        self.fc = []

        for k in range(hidden_layer_count):
            self.fc.append(nn.Linear(internal_neurons, internal_neurons).to(device))

    def forward(self, y):
        y = activation1(self.fc1(y))

        for k in range(hidden_layer_count):
            y = activation1(self.fc[k](y))

        y = activation2(self.fc3(y))
        return y

net = Net()
net.to(device)


def train():
    last_epoch = 10e10
    net.train()
    losses = []
    for epoch in range(1, epochs):
        x_train = Variable(torch.from_numpy(x_values)).float()
        y_train = np.zeros_like(x_train)

        llist = []

        for k in range(x_train.shape[0]):
            h = x_train[k] * torch.ones(1)
            y_train[k] = target_function.f(x_train[k])
            h = h.to(device)  # TODO: This
            llist.append(loss_func(net(h), y_train[k]))

        loss = sum(llist)
        losses.append(loss.item())
        opt.zero_grad()
        loss.backward()
        opt.step()

I’m not sure what exactly you’ve profiled and how the model is created, but based on the currently posted model architecture it seems that the actual workload is small.
In that case, you might see the overhead of the kernel launches etc. and the GPU might not be able to provide a speedup compared to the CPU run.
This would be especially visible if you have a lot of small kernels (e.g. a lot of linear layers each with a tiny workload).

CUDA graphs which is an upcoming feature, could reduce this overhead as a single launch would be triggered.

Thanks for the answer I forgot to mention I use 50-150 hidden neurons and 1-3 layers. The test I posted the numbers for was with the maximum of those numbers.

I am not experienced in pytorch, so I am not sure if my code allows for parallelization.

Also I am not sure if my installation works well. I tried to find an example online to compare my cpu vs gpu speed, but couldn’t find on I can execute myself in pytorch.

Lastly I am quite surprised why my code was slowed down by 33% when I just added .to(“cpu”) commands…

I figured out the main problrm that was my installation. I hadn’t installed torch with cuda support and torch 1.4.1 didn’t send warnings about this. It said cuda was available which confused me, but I didn’t question it.

After I updated the torch package version 1.7.1 triggered an assertion that I had installed torch without cuda support and after reinstalling it works ok.

Now there isn’t a time difference between .to(“cpu”) calls and the normal cpu computation, which reassures me al lot (both come in at 7 seconds). If I use .to(“cuda:0”) it takes 11 seconds, which still is considerably slower then cpu, I can work with that and try out stuff.

I would appreciate if somebody with experience would comment on the way I define my net and call it in the training loop. To me it looks wrong, but I am to inexperienced to find a better way.