In my model I sum the output of three layers and run that sum over a tanh and another layer, like in the code below. whereas with torch 0.3 adding the three variables does not produce an increase in GPU memory usage, in 0.4 the memory usage increases on every variable addition, producing OOM after a few iterations (this step is the attention of a decoder)
How can one fix this such that adding these variables produces no increase in gpu memory usage and why does this happen on 0.4 but not in 0.3?
processed_query = self.query_layer(query)
processed_memory = self.memory_layer(memory)
processed_attention_weights = self.location_layer(attention_weights)
alignment = processed_query + processed_attention_weights + processed_memory
alignment = self.v(F.tanh(alignment))
This could be a regression. Could you build the latest master and verify that the memory is increased there?
@richard The increase in memory on the addition happens with 0.4.0a0+539b1ed but doesn’t happen with
with 0.4.0a0. Is it correct to say that 0.4.0a0 came earlier than 0.4.0a0+539b1ed ?
Just to make sure, iterations here mean training iterations, right? That is, graph is freed after each iteration (usually via
iterations here mean decoder iterations that happen within training a single training iteration. we hit OOM before getting to loss.backward() with 0.4.0a0+539b1ed
I see. Might be a regression then. Could you try current master and see if it is fixed already?
@SimonW, @richard upgraded to pytorch 0.4.0a0+02b758f and it continues to run out of memory on a 16GB gpu and batch size 32, whereas with pytorch 0.3 it could run the same 32 batch size on a smaller 12 gb GPU.
It is able to go past the decoder steps but fails in subsequent operations, more specifically when computing dropout.
Definitely a regression then! Since we don’t have access to your code, could you try to come up with a small reproducing example? If this commit doesn’t OOM, you can use
torch.cuda.max_memory_allocated() to print out the memory usage (http://pytorch.org/docs/master/cuda.html#memory-management) for the small example.
Sorry. The comparison is between 0.2 and 0.4. Even with simple code like the one below the difference in memory footprint is large:
On the code below,
0.4.0a0+02b758f leaves a 952mb footprint, whereas 0.2 takes 369mb…
net = torch.nn.Conv1d(512, 512, kernel_size=(5,), stride=1, padding=2).cuda()
data = torch.autograd.Variable(torch.FloatTensor(1, 512, 107)).cuda()
out = net(data)
What would be the best strategy to produce the small reproducing example? The model in question is a seq2seq with attention and I’m not clear what parts can be removed such that the code is simple but the OOM remains…
I think I have an explanation to the differences:
For every decoder step the model in question would compute a linear forward pass on a fixed input.
It could be 0.2 cached this value given that the input is fixed whereas 0.4 recomputed and added the result to the graph at every iteration.
I’m able to run the code without problems with 0.2 and 0.4 now that the linear transform on that fixed input is computed only once and reused during the decoder steps!
Thank you all.
Glad to know that it works now. Afaik, we don’t cache output values (maybe I’m wrong on this). It also doesn’t explain the memory usage difference in the conv1d example you gave above…