GPU taking more time than CPU for one iteration [ First Post ]

Revanth_V · May 20, 2020, 3:35pm

Hi, I am new to Pytorch. I have written a custom basic network. The above training framework is taking 5 min of time, for one update_model function call / one iteration in the CPU, whereas taking approximately 7 min when using GPU. And is there a better way of accumulating loss values, as I want the loss from each layer and take the average of loss for an entire batch. So, is there a way to do batch training for a custom written forward function ? And while training the model, the RAM is piling up, and I assuming it is because of the accumulating the loss from each batch. Am I assuming right ?

The below code is for one layer of the network

class DetNet(nn.Module):   
    def __init__(self):
        super(DetNet, self).__init__()
        self.W1 = torch.nn.Parameter(torch.randn(8*K,5*K, requires_grad =True, dtype=torch.float64))
        self.b1 = torch.nn.Parameter(torch.randn(8*K,1, requires_grad = True, dtype=torch.float64))
        self.W2 = torch.nn.Parameter(torch.randn(K, 8*K, requires_grad = True, dtype=torch.float64))
        self.b2 = torch.nn.Parameter(torch.randn(K, 1, requires_grad = True, dtype=torch.float64))
        self.W3 = torch.nn.Parameter(torch.randn(2*K, 8*K, requires_grad = True, dtype=torch.float64))
        self.b3 = torch.nn.Parameter(torch.randn(2*K, 1, requires_grad = True, dtype=torch.float64))
        self.t  = torch.nn.Parameter(torch.randn(1,1, requires_grad = True, dtype=torch.float64))

    def forward(self, x, v, y, H):
        M1 = torch.matmul(torch.transpose(H,0,1), y)
        M2 = torch.matmul(torch.transpose(H,0,1), torch.matmul(H,x))
        con = torch.cat((M1, x, M2, v))
        z = F.relu(torch.matmul(self.W1, con) + self.b1)
        y = torch.matmul(self.W2, z)+self.b2
        one_K = torch.ones([K,1])
        x_k = (F.relu(y+(one_K*self.t))/abs(self.t) - F.relu(y-(one_K*self.t))/abs(self.t)-one_K)
        v_k = torch.matmul(self.W3, z) + self.b3
        return (x_k, v_k)

The below code is for update of network for one epoch of the model.

def update_model(optimizer):
#     Number of samples for each iteration
    H = varying_channel()
    loss = torch.tensor(0).double()
    for samples in range(3500):
      # The below four lines are for generation of a random data sample 
        x_main = torch.DoubleTensor([[(2*round(np.random.rand())-1)] for cnt in range(K)])
        v = torch.zeros([2*K,1], dtype=torch.float64)
        y = received_signal(x_main, H)
        x_tilde = ZF_decoder(H, y)
        x = x_main
      # Passing through each layer for accumulation of loss.
        for cnt in range(3*K):
            foo = DetLayers[cnt]
            (x_loc,v_loc)= foo.forward(x, v, y, H)
            (x, v) = (x_loc, v_loc)
            curr_loss = (log(cnt+1)* (torch.sum((x_main-x)**2))/ torch.sum((x_main-x_tilde)**2))
            loss = loss + curr_loss
    loss = loss/3500
    optimizer.zero_grad()
    loss.backward() 
    optimizer.step()
    return(loss.item())

The below code is for creating a specified number of layers and running the model for specified iterations.

DetLayers = []
ParamLayers = list()
for cnt in range(3*K):
    curr_layer = DetNet()
    DetLayers.append(curr_layer)
    ParamLayers.extend(list(curr_layer.parameters()))

# creating the optimizer
optimizer = optim.Adam(ParamLayers, lr = 0.01)
# Number of iterations
for iterations in range(1):
    val = update_model(optimizer)
    print(val)

Thanks in advance.

futscdav · May 20, 2020, 4:50pm

Yes, essentially you are building a very large computational graph until backward is called. I don’t know what the value of K is, but it’s possible the the GPU version is dominated by data transfers from host to device.

Revanth_V · May 20, 2020, 4:55pm

Hello @futscdav, Thanks for the reply. I have taken K as 30 in there. And how computationally good is the batch training setup unlike mine accumulating loss sample by sample ? And is there a way to tackle the above problem and leverage GPU resources correctly ?

futscdav · May 20, 2020, 5:59pm

The big thing is that autograd will not split the graph to sample by sample basis when you use batch operations. That alone helps a ton, but sometimes cannot be done, I don’t know what your application is. I would start there and see if you still have the GPU throughput issue.