Running out of memory, when there seemingly should be enough

I am running a latent space model on a network, the network itself dosent take up more than 2gb memory when stored in the local ram of the computer.

But when I try to run my pytorch model i get the following error:

RuntimeError: cuda runtime error (2) : out of memory at c:\anaconda2\conda-bld\pytorch_1519501749874\work\torch\lib\thc\generic/THCTensorMathPointwise.cu:343

I am running on a gtx 1080ti which should have more than enough memory to handle this.

def distMatrix(m):
    n = m.size(0)
    d = m.size(1)
    x = m.unsqueeze(1).expand(n, n, d)
    y = m.unsqueeze(0).expand(n, n, d)
    return torch.sqrt(torch.pow(x - y, 2).sum(2) + 1e-4)

def loss(tY):
    d = -distMatrix(tZ)+B
    sigmoidD = torch.sigmoid(d)
    reduce = tY*torch.log(sigmoidD)+(1-tY)*torch.log(1-sigmoidD)
    #remove diagonal
    reduce[torch.eye(n).byte().cuda()] = 0
    return -reduce.sum()

tZ = autograd.Variable(torch.cuda.FloatTensor(Z), requires_grad=True)
B = autograd.Variable(torch.cuda.FloatTensor([0]), requires_grad=True)

tY = autograd.Variable(torch.cuda.FloatTensor(Ytest), requires_grad=False)

losses = []
biases = []
testScore = []

learning_rate = 1e-3
epochs = 10000

sigmoid = np.vectorize(lambda x: math.exp(-np.logaddexp(0, -x)))

percentDone = 0
percent = 2
for i in range(1,epochs):
    count = (float(i)/epochs)*100
    if count % percent == 0:
        print(count)

    l = loss(tY)
    l.backward(retain_graph=True)
    losses.append(float(l))
    biases.append(B.data)
    tZ.data = tZ.data - learning_rate * tZ.grad.data
    B.data = B.data - learning_rate * B.grad.data
    
    tZ.grad.data.zero_()
    B.grad.data.zero_()

Z is 25059 by 2, Ytest is 25059 by 25059

Any hints on how to avoid the memory error would be greatly appreicated.

I think the loss calculation might blow up your memory usage.
Let’s have a look at distMatrix.
You are calling this function with tZ, which has dimensions [25059, 2] and therefore has 50118 elements.
Then you are creating x and y. They have the same shape of [25059, 25059, 2], so 1,255,906,962 elements each.

If we use 4 bytes (float32) for each element, we would have (50118+2*1255906962)*4 / 1e9 ~ 10.05GB memory consumption for this method.
With the model itself you end up with more or less 12GB.
For the training you also need to store some activations, which also take some memory.

How much memory does your GPU has?

The GPU has 11gb ram.
I just try to optimize the cost function (loss), there is no neural network part in my solution, so I do not think I am storing any activations or weights, unless I am misunderstanding how it works. (which might be likely…)

How do I go about solving this problem? Mini-batching? I am new in this field, any hints, ideas or nudges would be a huge help! :smiley:

The loss.backward() call calculates the gradients you are using in the following line with some intermediate values, which also needs some memory.

Batching might help. Is 25059 your batch size? Do you need it to be that large?
Could you explain the use case a bit more? Maybe the solution is straightforward. :wink:

Its a latent space model for graphs: Takes in a graph and a Z vector with K dimensions and then translate it to latent space. This is done by optimizing the loss function (the one shown in the question).

I absolutely do not need batches of that size, making them smaller would be great, I am just at a loss of how to go about doing that.

furthermore when I run my code (on a small network that fits in ram) I get (relatively) low gpu usage:

What could be the cause of this? Shouldnt it be able to run at close to a 100%?