Hi,
I have read the related posts about best practices for usage on GPU, and this question is different.
I am tuning my batch size such that the largest batch uses around 90% of GPU memory. Thus, some batches will use close to 90% while other (most use around 50%)
I am noticing that if 2 large batches (both using close to 90%) are processed (i.e. one forward prop and batch prop) consecutively it crashes with out of memory error. I suspect because the GPU is unable to clear the memory of the previous batch. However, if large batches and interspersed with smaller batches it seems to work fine.
Is there a rule of thumb to GPU utilization for such purposes, i.e. an optimal memory usage that works with the garbage collection?
Thanks for the pointer. I checked and as far as I can tell I’m not keeping a variable alive for more than it should be. Explicitly deleting variables does seem to help, but shouldn’t garbage collection immediately free dereferenced memory anyway?
A follow-up is whether nvidia-smi accurately reports memory usage? I managed to run my training session on a Tesla K20 (4G memory). When I run the same training (same data and batching) on a 10G gpu nvidia-smi says I’m using 9G! Clearly, this can’t be because I just successfully ran it on a 4G GPU.
This is a sign of holding on to variable for too long. Could you post your code?
No. PyTorch uses a cached GPU memory allocator. Read more here: CUDA semantics — PyTorch master documentation. Notice that some monitoring methods are not available in a public release yet (they exist on github master).
Thanks again! I’m making changes on a fork of OpenNMT-py, so posting code might be difficult. I will try to make an example code to reproduce this. Also, I realize as I’m writing this that I’m working with an older version of OpenNMT-py so I will try updating as well.
Thanks for the pointer once again regarding GPU mem allocation.
Sounds to me like you might be accidentally using the following anti-pattern:
for input, answer in batches:
optimizer.zero_grad()
output = model(input)
loss_var = loss_func(output, answer)
loss_var.backward()
optimizer.step()
del loss_var # missing this!
Without the del the loss_var is still holding a reference to gradients all the way through the second iteration of the loop, including during the calculation of loss_func. Python can’t garbage collect the old value until after the new has been computed and assigned to loss_var!
If it’s not clear what I mean, try this:
loss_var = None
for input, answer in batches:
optimizer.zero_grad()
output = model(input)
print loss_var
loss_var = loss_func(output, answer)
loss_var.backward()
optimizer.step()