GPU memory increases as output sequence increases

Hi folks,

I’m using wavenet to predict n time steps base on a sequence. The training code is like this.

def train_batch(self, input_batch, target_batch):
    input_batch = Variable(input_batch)
    target_batch = Variable(target_batch)

    # Zero gradients of both optimizers
    self.optimizer.zero_grad()

    target_len = target_batch.size(0)
    batch_size = input_batch.size(0)

    # Move new Variables to CUDA
    predictions = Variable(torch.zeros(target_len, batch_size))
    if USE_CUDA:
        input_batch = input_batch.cuda()
        target_batch = target_batch.cuda()
        predictions = predictions.cuda()

    for t in range(target_len):
        output_batch = wavenet(input_batch)
        predictions[t] = output_batch[:, 0, -1]
        input_batch = torch.cat([input_batch[:, :, 1:], output_batch[:, :, -1], dim=2)

    # Loss calculation and backpropagation
    loss = self.criterion(predictions, target_batch)
    loss.backward()

I found that the GPU memory is proportional to the target_len. My understanding is pytorch allocates memory for wavenet instance for every time step. If I’m right, it is not necessary because calculating each time step needs the output of last time step makes the process actually executed sequentially, hence one wavenet instance is enough. The result is my implementation which contains one wavenet layer takes about 5G GPU. Another tensorflow implementation which contains three wavenet layers can run in a 8G GPU. Is there any solution? Is there any way to control the allocation of the memory?

Thanks!

BTW, a similar topic is here:

The difference is they found the GPU memory is proportional to the input length and I found it is also proportional to the output length.

Hi,

The thing is that to be able to do backpropagation, you need to keep some intermediary results computed during the forward pass.
So it is expected that the memory usage increases when you run more of these, because there are more intermediary results to save.

Thanks @albanD!

IMHO, the computations of each time step are independant, so the related resources should be released after the computation of the time step.

If you only want to do a forward pass and have all ressources released, you can create the input Variable with volatile=True.
If you want to backward, you need to keep intermediary results on everything that you are going to backward.

I tried to cut the connections between each time step.
I change this

    input_batch = torch.cat([input_batch[:, :, 1:], output_batch[:, :, -1], dim=2)

to

    input_batch = Variable(torch.cat([input_batch[:, :, 1:].data, output_batch[:, :, -1].data, dim=2))

but got same result. I’m a little confused, because I recreate the input_batch, the intermediary should not exist. Later, I think it is not about the intermediary result, but the compiler doesn’t know the relation between each time step and just create instance for each time step, which wastes so much memory. Since the compiler is not so smart yet, is there any way to tell it explicitly that I only need to create one instance here?

I guess you keep the Variable output_batch that was the output of the previous time step? As long as this Variable exists, the intermediary results will remain.
If you look back at the tutorials you should do one the following thing:

  • For evaluation, just use volatile=True.
  • When computing gradients, you should wrap your input in a Variable, forward it, zero_grad, backward, step and then start over. If you keep around the network output, it will keep the corresponding graph.

There is no compiler in pytorch, the graph is just created when you work with Variables and destroyed when these Variables does not exist in python anymore.

@albanD Thanks!
Yes. The wavenet model use the Nth step output_batch as the input_batch of (N+1)th step. The issue happens in training.

Is it possible that the wavenet model is created for each time step, which takes to much memeory? The model is much bigger than the variables.

Intermediary results are usually much more memory angry than the model parameters.
This is expected behavior.

Really? I thought it is the model which takes much memory. Do you know how to trace and calculate the space of the intermediary variables?

I don’t know about any automated way to do this.

I read this post. My issue is similar to the second problem in it.

I actually feedback the output as part of the input to feed into the model for tens of times, and then calculate the loss. If the model graph and its variables are generated in every loop, that will take much memory. I’m not sure if this will lengthen the backward path and make the backprop harder.

In tensorflow, I can use while_loop to let a model to run for several times and then calculate the loss.