Memory (RAM) usage keep going up every step

Hello, first of all I would like to say that i like PyTorch so far and eager to see what it do in the future.

I train a custom Module char-RNN because i want to save the last hidden state. but it seems that every step my memory (RAM) usage keep getting bigger and bigger. I don’t know where or what that caused memory leak.

Thanks in advance.

my model:

class CharRNN(torch.nn.Module):
    def __init__(...):
        ...
        self.hidden_state = None

    def forward(self, x):
        rnn_out, hidden = self.gru(x.view(1, 1, -1), self.hidden_state)
        logits = self.dense(rnn_out.view(1, -1))
        self.hidden_state = repackage_hidden(hidden)

        return logits

def repackage_hidden(h):
    """Wraps hidden states in new Variables, to detach them from their history."""
    if type(h) == Variable:
        return Variable(h.data)
    else:
        return tuple(repackage_hidden(v) for v in h)

my train loop:

# init model
char_RNN = CharRNN(n_feature, n_hidden, n_layer, n_feature, dropout)
loss_fn = torch.nn.CrossEntropyLoss()]
char_RNN.double()
char_RNN.cuda()
print char_RNN
optimizer = torch.optim.Adam(char_RNN.parameters(), lr=1e-4)


# init dataset
.....

losses = []

for i_epoch in range(n_epoch):
    rand_idx = range(len(dataset))
    random.shuffle(rand_idx)

    for i_minibatch in range(len(dataset)):
        x, y = dataset[rand_idx[i_minibatch]]
        x = Variable(torch.LongTensor(x), requires_grad=False).cuda()
        y = Variable(torch.LongTensor(y), requires_grad=False).cuda()
        char_RNN.zero_grad()

        loss = 0
        for i_char in range(len(x)):
            out = char_RNN(x[i_char])
            loss = loss + loss_fn(out, y[i_char])
        losses.append(loss.data[0] / float(len(x)))

        loss.backward()
        optimizer.step()

        if i_minibatch % 500 == 0:
            # print summary every 500 step
            summary = 'epoch {} batch {} avg. loss: {}'.format(i_epoch, i_minibatch, sum(losses) / float(500))
            losses = []
2 Likes

Is the leak on CPU memory or GPU memory? How quickly is it leaking?

CPU memory. It’s slow enough that you didn’t notice it at first. my 8 GB of CPU memory is full after the first epoch (approx. 144.000 mini-batch) and started to spill to swap. Also began from epoch 2, the network became really slow, it’s only utilizing 5%-35% GPU while in the first epoch it’s always around 90%.

I can upload the source codes with the dataset in the morning if you want

Do you have cuDNN installed and enabled?

If so, could you check if your observation is same as the issue described here? https://github.com/pytorch/pytorch/issues/3665

2 Likes

Yes.
I think it’s the same issue. I’ve tried to run without enabling cuDNN and so far my memory usage is constant. Thank You.

It could be nice if this issue shown in the nn documentation to prevent others stumbling into the same issue.

2 Likes

anyone knows what will cause cpu memory usage. as I think if we do .cuda(), ops may computing on GPU. if so seems no huge CPU memory cost, but in fact, there are huge cpu memory cost(>1G)

Hi all, same thing is happening to me, but not always, and that’s the weird part, the inconsistency. I’m using CPU only on gcloud VM. Note that I had a successful run of 100
epochs. I changed learning rate and input/hidden dimensions very slightly, and the memory increase issue happened until it killed my job.

Are you storing some variables which were not detached from the computation graph, e.g. the loss in a list?
This would increase the memory usage in each iteration so I’m currently a bit lost why your code seems to run well most of the time.
Also, could you share a (small) code snippet so that we could have a look?

6 Likes

Thank you for the tip! I was detaching the losses before appending, but had a duplicated line that was being added to list without detaching that totally evaded me. That was the issue. Thank you!

3 Likes