Best Practices for Maximum GPU utilization

I have read the related posts about best practices for usage on GPU, and this question is different.

I am tuning my batch size such that the largest batch uses around 90% of GPU memory. Thus, some batches will use close to 90% while other (most use around 50%)

I am noticing that if 2 large batches (both using close to 90%) are processed (i.e. one forward prop and batch prop) consecutively it crashes with out of memory error. I suspect because the GPU is unable to clear the memory of the previous batch. However, if large batches and interspersed with smaller batches it seems to work fine.

Is there a rule of thumb to GPU utilization for such purposes, i.e. an optimal memory usage that works with the garbage collection?


You are probably holding reference to some Variable from the previous iteration. This can cause the graph not being freed properly in some cases. See

Thanks for the pointer. I checked and as far as I can tell I’m not keeping a variable alive for more than it should be. Explicitly deleting variables does seem to help, but shouldn’t garbage collection immediately free dereferenced memory anyway?

A follow-up is whether nvidia-smi accurately reports memory usage? I managed to run my training session on a Tesla K20 (4G memory). When I run the same training (same data and batching) on a 10G gpu nvidia-smi says I’m using 9G! Clearly, this can’t be because I just successfully ran it on a 4G GPU.

This is a sign of holding on to variable for too long. Could you post your code?

No. PyTorch uses a cached GPU memory allocator. Read more here: CUDA semantics — PyTorch master documentation. Notice that some monitoring methods are not available in a public release yet (they exist on github master).

Thanks again! I’m making changes on a fork of OpenNMT-py, so posting code might be difficult. I will try to make an example code to reproduce this. Also, I realize as I’m writing this that I’m working with an older version of OpenNMT-py so I will try updating as well.

Thanks for the pointer once again regarding GPU mem allocation.

Sounds to me like you might be accidentally using the following anti-pattern:

for input, answer in batches:
    output = model(input)
    loss_var = loss_func(output, answer)
    del loss_var # missing this!

Without the del the loss_var is still holding a reference to gradients all the way through the second iteration of the loop, including during the calculation of loss_func. Python can’t garbage collect the old value until after the new has been computed and assigned to loss_var!

If it’s not clear what I mean, try this:

loss_var = None
for input, answer in batches:
    output = model(input)
    print loss_var
    loss_var = loss_func(output, answer)

Good luck!


can this be caused by storing a snapshot of weights eg best_weights=copy.deepcopy(model.state_dict()) ?