No GPU utilization when running script

I wrote a model for doing latent space representation of graphs in PyTorch utilizing the autograd to minimize a cost function, this should according to my understanding be possible to run it on the GPU, but when I run my script i get 0% GPU utliziation.

import numpy as np
import torch
import torch.autograd as autograd
import pickle
import networkx as nx


'''
Pairwise euclidean distance
'''
def distMatrix(m):
    n = m.size(0)
    d = m.size(1)
    x = m.unsqueeze(1).expand(n, n, d)
    y = m.unsqueeze(0).expand(n, n, d) 
    return torch.sqrt(torch.pow(x - y, 2).sum(2) + 1e-4)

def lossIdx(tY):
    d = -distMatrix(tZ)+B
    sigmoidD = torch.sigmoid(d)
    
    #calculating cost
    reduce = torch.add(torch.mul(tY, torch.log(sigmoidD)), torch.mul((1-tY), torch.log(1-sigmoidD)))
    #remove diagonal
    reduce[torch.eye(n).byte().cuda()] = 0
    return -reduce.sum()

def createBAnetwork(n, m, clusters):
    a = nx.barabasi_albert_graph(n,m)
    b = nx.barabasi_albert_graph(n,m)

    c = nx.union(a,b,rename=('a-', 'b-'))
    c.add_edge('a-0', 'b-0')

    for i in range(clusters-2):
        c = nx.convert_node_labels_to_integers(c)
        c = nx.union(a,c,rename=('a-', 'b-'))
        c.add_edge('a-0', 'b-0')
    return(c)

c = createBAnetwork(1000,3,3)
Y = np.asarray(nx.adj_matrix(c).todense())

#loading adjacency matrix from pickle file
#Y = pickle.load(open( "data.p", "rb" ))
n = np.shape(Y)[0]
k = 2

Z = np.random.rand(n,k)

tZ = autograd.Variable(torch.cuda.FloatTensor(Z), requires_grad=True)
B = autograd.Variable(torch.cuda.FloatTensor([0]), requires_grad=True)
tY = autograd.Variable(torch.cuda.FloatTensor(Y.astype("uint8")), requires_grad=False)

optimizer = torch.optim.Adam([tZ, B], lr = 1e-3)

steps = 50000

for i in range(steps):
    optimizer.zero_grad()
    l = lossIdx(tY).cuda()
    l.backward(retain_graph=True)
    optimizer.step()
    del l

It works, and it is solving the problem, just not on the GPU, but on the CPU.

Any help on whats going wrong or how to get this working on the GPU would be highly appreciated.

The file can be found here: https://drive.google.com/open?id=1j6kkD9-YNW11iFfVYBzjYrvssFNcqSXf

How do you know, it’s not working on the GPU?
Since you initialized your data on the GPU, it would throw an error, if that’s not possible.

The windows task manager shows 0% gpu utilization while running the script, it loads the data onto the GPU memory. The GPU itself is not being utilized though. When running other scripts particularly for neural networks I get up to 80-90% GPU utilization. Here it just stays on 0%

OK, I see. Maybe your actual workload is just small compared to data moving (idx = torch.randperm(n)[:1000].cuda()) and expanding (x = m.unsqueeze(1).expand(n, n, d)).

Have you tried to run it on the CPU? If the workload is really small (I cannot estimate it without shapes), probably the data transfer overhead is bigger than the performance advantage on the GPU.

Updated the code without the slicing, it was originally done because matrices around a size of 8000x8000 uses up all my GPU memory (11gb). This problem is however small enough to be entirely in memory (4039x4039).

Without the indexing it still dosen’t run on the gpu

When I run this script it executes fine, but it only uses the CPU and GPU memory

This is from running the script, note that the 5% is from using the snipping tool to take the screenshot…

I assume the window in the top left is showing the utilization?
For other scripts it’s showing a high usage?
Could you try to profile your code with e.g. utils.bottleneck.

Also, could you create random inputs with your shapes, so that I could try it on my machine?

You can use the file from the google drive link with dimensions 4039x4039, or the function I just added in the original question, it generates a network with an adjacency matrix of size 3000x3000

ran the bottleneck on the script with data.p file and 100 steps, got these results. Not sure how to interpret them, seems like barely any CUDA time


Could you change

tY = autograd.Variable(torch.cuda.FloatTensor(Y.astype("uint8")), requires_grad=False)

to

tY = autograd.Variable(torch.cuda.FloatTensor(Y.astype(np.float32)), requires_grad=False)

and run it again?


not much of a difference, I think it might be in the


reduce[torch.eye(n).byte().cuda()] = 0

line

Might be. That’s what I suspected. Some copy or transfer ops are taking more time than the actual computation.
Since the slicing seems to be constant, could you somehow pre-compute it?
Or is it just as an example?

I need to set the diagonal to 0 before taking the sum

reduce[torch.eye(n).byte().cuda()] = 0

Is what I am doing right now, is there another, more efficient way of doing it?

edit:
am currently creating an array with the indexes, so I dont have to call .eye every time, then using that list as an index to overwrite the diagonal

diagArray = np.array([np.array(range(0,n)),np.array(range(0,n))])

You could pre-compute it somewhere outside of the function like you are already doing it with tZ and B:

reduce[idx] = 0
...
idx = torch.eye(n).byte().cuda()

Alternatively you could multiply reduce with a mask.
As a side note, I’m not sure, if it’s a good idea to use a Python keyword as a variable name.

I’ve changed a bit and get a peak GPU utilization (GTX 1070) of approx. 62%.

def distMatrix(m):
    n = m.size(0)
    d = m.size(1)
    x = m.unsqueeze(1).expand(n, n, d)
    y = m.unsqueeze(0).expand(n, n, d) 
    return torch.sqrt(torch.pow(x - y, 2).sum(2) + 1e-4)

def lossIdx(tY):
    d = -distMatrix(tZ)+B
    sigmoidD = torch.sigmoid(d)
    
    #calculating cost
    r = torch.add(torch.mul(tY, torch.log(sigmoidD)), torch.mul((1-tY), torch.log(1-sigmoidD)))
    #remove diagonal
    r = r * idx
    return -r.sum()

device = 'cuda:0'

Y = np.asarray(np.random.randn(3000, 3000))

n = np.shape(Y)[0]
k = 2
idx = 1 - torch.eye(n).to(device)

Z = np.random.rand(n,k)

tZ = torch.tensor(Z, dtype=torch.float, requires_grad=True, device=device)
B = torch.tensor([0.], requires_grad=True, device=device)
tY = torch.from_numpy(Y.astype(np.float32))
tY.requires_grad_(True)
tY = tY.to(device)

optimizer = torch.optim.Adam([tZ, B], lr = 1e-3)

steps = 100
torch.cuda.synchronize()
t0 = time.time()
for i in range(steps):
    optimizer.zero_grad()
    l = lossIdx(tY)#.cuda()
    l.backward(retain_graph=False)
    optimizer.step()

torch.cuda.synchronize()
print(time.time() - t0)

seems to work, still not much utilization:

but atleast there is some CUDA action now. the bottle neck is now the sum function, I cant really do anything about that one can I?

I’m not sure, what your code is doing to be honest, but I think it’s a good sign to see a necessary operation now takes the time.

Creates a k dimensional latent space representation tZ given a graph Y in adjacency matrix format.

Thank you so much for your time, and especially for showing me that super cool and powerful bottleneck tool

1 Like