Prevent intermediary states from accumulating in a loop

vikramnitin9 · February 21, 2019, 5:31am

I’m trying to accumulate into a variable using a loop, as follows. In each iteration of the loop, I do some computation using the variables, and add the result to total_act. Note that the actual operation I’m trying to do is more complicated; this is just a minimalistic example to reproduce the problem.

import torch
import torch.nn as nn
from torch.nn import Parameter

num_x = 20000
num_y = 1000
emb_dim = 500

class Model(torch.nn.Module):

    def __init__(self):
        super(Model, self).__init__()
        self.x_embed = Parameter(torch.FloatTensor(num_x, emb_dim))
        self.y_embed = Parameter(torch.FloatTensor(num_y, emb_dim))

        self.w = Parameter(torch.FloatTensor(emb_dim, emb_dim))

    def forward(self):
        total_act = 0

        for i in range(num_y):
            import pdb; pdb.set_trace()

            trans = self.x_embed - self.y_embed[i]
            trans = torch.mm(trans, self.w)  # If we remove this line, the problem doesn't occur

            total_act += trans

        return total_act

model = Model()
model = model.cuda()

final_act = model.forward()

Now what I notice is that with each iteration of the for loop, the memory occupied on the GPU keeps increasing. I’ve put a pdb in the loop so that one can track memory consumption for each iteration.

My guess is that this is due to the intermediary states of the various variables occupying space, as though it’s building one graph for every single iteration of the loop. How do I solve this problem?

MariosOreo · March 25, 2019, 2:58am

Hello,

In my shallow view, you could use temporary buffer to perform calculation and release it after accumulating. But I think if the variable requires gradient, the intermediate computation graph is needed in each iteration and it is unavoidable.

If you have better methods, let me know, thank you.

vikramnitin9 · March 25, 2019, 7:23am

Hi,

Thank you for your reply!

I didn’t quite get what you meant by “use a temporary buffer to perform calculation”. Could you provide an example?

Also, I managed to find a temporary workaround by defining a custom Autograd function and writing the backward call for it manually (code follows). This is far from satisfactory, so I’m still on the lookout for solutions.

import torch
import torch.nn as nn
from torch.nn import Parameter

num_x = 20000
num_y = 1000
emb_dim = 500

class MyFunc(torch.autograd.Function):

    @staticmethod
    def forward(ctx, X, Y, W):

        ctx.save_for_backward(X, Y, W)
        total_act = 0

        num_y = Y.shape[0]

        for i in range(num_y):
            trans = X - Y[i].expand_as(X)
            trans = torch.mm(trans, W)
            total_act += trans

        return total_act

    @staticmethod
    def backward(ctx, grad_output):
        X, Y, W = ctx.saved_tensors
        num_y = Y.shape[0]
        sum_Y = torch.sum(Y, dim=0)

        grad_X = num_y * torch.mm(grad_output, torch.t(W))
        grad_Y = -torch.sum(torch.mm(grad_output, torch.t(W)), dim=0).repeat(num_y, 1)
        grad_W = torch.mm(torch.t((num_y * X) - sum_Y.expand_as(X)), grad_output)

        return (grad_X, grad_Y, grad_W)
 
class Model(torch.nn.Module):

    def __init__(self):
        super(Model, self).__init__()
        self.x_embed = Parameter(torch.FloatTensor(num_x, emb_dim))
        self.y_embed = Parameter(torch.FloatTensor(num_y, emb_dim))

        self.w = Parameter(torch.FloatTensor(emb_dim, emb_dim))

        self.custom_op = MyFunc.apply

    def custom_forward(self):
        return self.custom_op(self.x_embed, self.y_embed, self.w)

model = Model()
model = model.cuda()

final_act = model.custom_forward()