Forward pass on GPU efficiently

fabiola · April 21, 2023, 9:51am

I am trying to pass some inputs on a network. I can not simply use net(input_) because I can not push all the inputs at once to the GPU due to memory limits.
I thus wrote a function to pass the inputs by batches:

def batch_pass(net, input_, batch, device):
    print('cuda memory before batch pass', torch.cuda.memory_allocated(device=device))
    l = 0 if len(input_) % batch == 0 else 1
    r = []
    for i in range(len(input_)//batch + l):
        in_ = input_[i*batch:(i+1)*batch].to(device)
        r += [net(in_)]
        del in_
        torch.cuda.empty_cache()
    print('cuda memory after batch pass', torch.cuda.memory_allocated(device=device))
    return torch.vstack(r)

However, the cuda memory allocated keeps on growing, and it seems that doing del in_ and torch.cuda.empty_cache() has no effect.

The output is:

cuda memory before batch pass 26404352
cuda memory during batch pass 65019904
cuda memory during batch pass 103635456
cuda memory during batch pass 142251008
cuda memory during batch pass 180866560
...
cuda memory during batch pass 13232923136
cuda memory during batch pass 13271538688
cuda memory during batch pass 13310154240
cuda memory during batch pass 13348769792
cuda memory during batch pass 13387385344

And the function cannot finish due to error:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.90 GiB total capacity; 12.48 GiB already allocated; 9.44 MiB free; 15.05 GiB reserved in total by PyTorch)

How can I pass input_ without using too much cuda memory?

ptrblck · April 21, 2023, 5:00pm

This is expected since you are appending the model outputs to the r list including the entire computation graph of each iteration.
If you don’t want to train the model and only attach the actual output tensors, call .detach() on these tensors before appending them to the list.