I have a model that operates on inputs of variable length. At inference time, I need to split the data into chunks of fixed size, else I will get a cuda oom. I set the size of the chunks to be as large as possible, so as to fill my GPU memory. The code goes like this:
(...)
model.eval()
# Forward chunks
chunks = extract_chunks(input) # Returns numpy arrays
for chunk in chunks:
x, y_true = collate([chunk]) # Collate a single chunk into a batch, returns CPU tensors
x, y_true = x.cuda(), y_true.cuda()
y_pred = model(x)
And I monitor my GPU usage (with nvidia-smi) when running this code.
I noticed that if I add the line
# Free the memory
del x, y_pred, y_true
at the end of the loop, the memory usage is halved, and I can effectively double my chunk size. There seems to be no overhead for doing this either. What is going on here?