Running a function on GPU

Siva_Rajesh · November 11, 2019, 1:52pm

It is known that we can define a NN model and call model.cuda() to train the network using GPUs - Essentially, the optimization happens on GPUs which is much faster as compared to CPUs. However, I was wondering if there is a similar provision for optimizing custom functions instead of NN modules. For e.g. below is a custom function my_func I am optimizing; it looks like optimizing using GPU takes more time as compared to CPU. I was wondering if it is indeed possible to achieve better speeds using GPUs in these cases?

def my_func(x,n):
    p = x
    y = (torch.sqrt(x))*x*(100-x)/(3+x)
    for i in range(n):
        y = y*(1-y)
        
    return y
x = Variable(torch.ones(1, 1), requires_grad=True)
optimizer = torch.optim.Adam([x], lr=0.005)

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    for epoch in range(100):


        #loss = loss_func(decoded, b_y)      # mean square error
        loss = -my_func(x,1)
        optimizer.zero_grad()               # clear gradients for this training step
        loss.backward()                     # backpropagation, compute gradients
        optimizer.step()                    # apply gradients
        print('Epoch: ', epoch, '| train loss: %.4f' % loss.cpu().detach().data.numpy())
#print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print(prof.key_averages().table())

Output:

-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                        CPU time        CUDA time            Calls        CPU total       CUDA total
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------
sqrt                                        30.136us         30.935us              200       6027.214us       6186.993us
mul                                         27.928us         28.647us             1400      39099.503us      40105.608us
rsub                                        38.489us         39.156us              200       7697.704us       7831.288us
add                                         25.421us         26.067us              500      12710.364us      13033.745us
div                                         23.400us         24.024us              400       9360.093us       9609.656us
neg                                         32.287us         32.855us              500      16143.420us      16427.533us
torch::autograd::GraphRoot                  23.321us          7.905us              100       2332.057us        790.491us
NegBackward                                 94.102us         86.811us              100       9410.230us       8681.093us
MulBackward0                                70.496us         70.423us              300      21148.770us      21126.795us
RsubBackward1                               81.579us         81.515us              200      16315.861us      16302.944us
DivBackward0                               148.883us        148.674us              100      14888.263us      14867.443us
AddBackward0                                 5.043us          4.756us              100        504.294us        475.605us
SqrtBackward                                69.178us         69.117us              100       6917.797us       6911.730us
torch::autograd::AccumulateGrad             21.419us         21.917us              100       2141.940us       2191.669us
mul_                                        24.608us         24.226us              200       4921.564us       4845.275us
add_                                        22.829us         23.233us              200       4565.831us       4646.686us
addcmul_                                    21.290us         21.806us              100       2128.989us       2180.561us
addcdiv_                                    19.587us         20.430us              100       1958.716us       2043.045us
to                                          55.470us         55.027us              100       5546.997us       5502.680us
empty                                        6.057us          5.972us              100        605.682us        597.248us
detach                                       5.344us          5.164us              100        534.369us        516.393us
detach_                                      4.048us          3.775us               99        400.801us        373.734us
zero_                                       15.756us         15.235us               99       1559.891us       1508.311us

tom · November 11, 2019, 2:14pm

GPUs excel at doing the exact same thing to a lot of data items (like 512 or more) at once. Your 1-element input just doesn’t match what they’re good at. If you put in a large vector for which you each compute that, it will give you a speedup.
Some additional speed gain could be achieved by using the PyTorch fuser – it automatically is called when you add @torch.jit.script to your function (and add a type annotation n : int for n).

Best regards

Thomas