Best Practices for Maximum GPU utilization

Hi,
I have read the related posts about best practices for usage on GPU, and this question is different.

I am tuning my batch size such that the largest batch uses around 90% of GPU memory. Thus, some batches will use close to 90% while other (most use around 50%)

I am noticing that if 2 large batches (both using close to 90%) are processed (i.e. one forward prop and batch prop) consecutively it crashes with out of memory error. I suspect because the GPU is unable to clear the memory of the previous batch. However, if large batches and interspersed with smaller batches it seems to work fine.

Is there a rule of thumb to GPU utilization for such purposes, i.e. an optimal memory usage that works with the garbage collection?

2 Likes

You are probably holding reference to some Variable from the previous iteration. This can cause the graph not being freed properly in some cases. See http://pytorch.org/docs/master/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory

Thanks for the pointer. I checked and as far as I can tell I’m not keeping a variable alive for more than it should be. Explicitly deleting variables does seem to help, but shouldn’t garbage collection immediately free dereferenced memory anyway?

A follow-up is whether nvidia-smi accurately reports memory usage? I managed to run my training session on a Tesla K20 (4G memory). When I run the same training (same data and batching) on a 10G gpu nvidia-smi says I’m using 9G! Clearly, this can’t be because I just successfully ran it on a 4G GPU.

This is a sign of holding on to variable for too long. Could you post your code?

No. PyTorch uses a cached GPU memory allocator. Read more here: CUDA semantics — PyTorch master documentation. Notice that some monitoring methods are not available in a public release yet (they exist on github master).

Thanks again! I’m making changes on a fork of OpenNMT-py, so posting code might be difficult. I will try to make an example code to reproduce this. Also, I realize as I’m writing this that I’m working with an older version of OpenNMT-py so I will try updating as well.

Thanks for the pointer once again regarding GPU mem allocation.

Sounds to me like you might be accidentally using the following anti-pattern:

for input, answer in batches:
    optimizer.zero_grad()
    output = model(input)
    loss_var = loss_func(output, answer)
    loss_var.backward()
    optimizer.step()
    del loss_var # missing this!

Without the del the loss_var is still holding a reference to gradients all the way through the second iteration of the loop, including during the calculation of loss_func. Python can’t garbage collect the old value until after the new has been computed and assigned to loss_var!

If it’s not clear what I mean, try this:

loss_var = None
for input, answer in batches:
    optimizer.zero_grad()
    output = model(input)
    print loss_var
    loss_var = loss_func(output, answer)
    loss_var.backward()
    optimizer.step()

Good luck!

7 Likes

can this be caused by storing a snapshot of weights eg best_weights=copy.deepcopy(model.state_dict()) ?