So, fairly typical sequence in many PyTorch training/optimization or evaluation loops:
- Create and initialize a Tensor
- Transfer (maybe) to GPU
- Initialize a Variable with this Tensor
- Variable used from here on, Tensor left dangling
What’s the best practice in this case for optimal memory usage? Especially when this may be done for multiple BxCxHxW tensors. To me it begs for a ‘del <tensor_name>’ after the variable is initialized, but I’m not sure if there is some tricks under the hood to make this unnecessary as I don’t see it done.
It’s often possible to wrap the tensor init within the variable creation but that’s not always conducive to clean code.
modifier = torch.normal(input_var.size(), std=0.001).float()
modifier = torch.zeros(input_var.size()).float()
modifier = modifier.cuda()
modifier_var = autograd.Variable(modifier, requires_grad=True)
del modifier # best to do this by default if modifier never used again in train/eval loop???
Variable share the same memory as its underlying
Tensor, so there is no memory savings by deleting it afterwards.
That being said, you can also replace the
Tensor variable with a
Variable containing the
x = torch.rand(5)
x = Variable(x)
The only case where you might see some (small) savings is when you reach the end of the training loop, you might want to delete all references to the input tensor.
for i, (x, y) in enumerate(train_loader):
x = Variable(x)
y = Variable(y)
# compute model and update
del x, y, output
This ensures that you won’t have double the memory necessary for
train_loader first allocates the memory, and then assign it to
But note that if your input tensor is relatively small, this saving by doing that is negligible and not worth it (you also need to make sure that all references to
x are deleted, that’s why I
del output as well).
Thanks, I thought I’d read somewhere about the memory being shared but last time I did a quick test to try that, it looked to be different. Looking back I just realized that I didn’t use an in-place op in my quick test, fixed and confirmed they are in fact sharing memory and all is well with the universe
I’ll keep the second case in mind, with models working on large images where targets are also masks/images, a few hundred MB could tip the balance, but still small in comparison to the model parameters and gradient state.
How about the case when the tensor is inside a function, if we exit the function, does that mean the data stored to the device (GPU in this case) will be deleted?
data, target = data.to(device), target.to(device)
for i, (data, target) in enumerate(train_loader):
# do something
result = ...
del data, target # is this step necessary?
foo() # first call
foo() # second call
I am still confused. Many times ppl create intermediate tensors that are created using operations. But these have different values than the tensors they came from. But sometimes we create them just to make the code easier to read but aren’t actually useful. How does pytorch mange these temp tensors?
In fact, when de we ever have to call