I’m using wavenet to predict n time steps base on a sequence. The training code is like this.
def train_batch(self, input_batch, target_batch):
input_batch = Variable(input_batch)
target_batch = Variable(target_batch)
# Zero gradients of both optimizers
self.optimizer.zero_grad()
target_len = target_batch.size(0)
batch_size = input_batch.size(0)
# Move new Variables to CUDA
predictions = Variable(torch.zeros(target_len, batch_size))
if USE_CUDA:
input_batch = input_batch.cuda()
target_batch = target_batch.cuda()
predictions = predictions.cuda()
for t in range(target_len):
output_batch = wavenet(input_batch)
predictions[t] = output_batch[:, 0, -1]
input_batch = torch.cat([input_batch[:, :, 1:], output_batch[:, :, -1], dim=2)
# Loss calculation and backpropagation
loss = self.criterion(predictions, target_batch)
loss.backward()
I found that the GPU memory is proportional to the target_len. My understanding is pytorch allocates memory for wavenet instance for every time step. If I’m right, it is not necessary because calculating each time step needs the output of last time step makes the process actually executed sequentially, hence one wavenet instance is enough. The result is my implementation which contains one wavenet layer takes about 5G GPU. Another tensorflow implementation which contains three wavenet layers can run in a 8G GPU. Is there any solution? Is there any way to control the allocation of the memory?
Thanks!
BTW, a similar topic is here:
The difference is they found the GPU memory is proportional to the input length and I found it is also proportional to the output length.
The thing is that to be able to do backpropagation, you need to keep some intermediary results computed during the forward pass.
So it is expected that the memory usage increases when you run more of these, because there are more intermediary results to save.
If you only want to do a forward pass and have all ressources released, you can create the input Variable with volatile=True.
If you want to backward, you need to keep intermediary results on everything that you are going to backward.
but got same result. I’m a little confused, because I recreate the input_batch, the intermediary should not exist. Later, I think it is not about the intermediary result, but the compiler doesn’t know the relation between each time step and just create instance for each time step, which wastes so much memory. Since the compiler is not so smart yet, is there any way to tell it explicitly that I only need to create one instance here?
I guess you keep the Variableoutput_batch that was the output of the previous time step? As long as this Variable exists, the intermediary results will remain.
If you look back at the tutorials you should do one the following thing:
For evaluation, just use volatile=True.
When computing gradients, you should wrap your input in a Variable, forward it, zero_grad, backward, step and then start over. If you keep around the network output, it will keep the corresponding graph.
There is no compiler in pytorch, the graph is just created when you work with Variables and destroyed when these Variables does not exist in python anymore.
I read this post. My issue is similar to the second problem in it.
I actually feedback the output as part of the input to feed into the model for tens of times, and then calculate the loss. If the model graph and its variables are generated in every loop, that will take much memory. I’m not sure if this will lengthen the backward path and make the backprop harder.
In tensorflow, I can use while_loop to let a model to run for several times and then calculate the loss.