Forward in chunk

Jean_Kim · October 15, 2024, 7:56am

Hi,

I have a large dataset which must be fully pass a forward and I cannot take advantage of the mini-batches due to the characteristic of my evaluation metric.
Since the whole dataset does not fit to GPU memory, I first thought that I need to chunk it into pieces to achieve partial outputs of my model which should be concatenated after being transferred to cpu.

Basically this is what has been done for this:

chunks = DataLoader(data, batch_size)

model = model.to(device)

output = []
for batchidx, data in enumerate(chunks):
  data = data.to(device)
  this_out = model(data)
  output.append(this_out.cpu())

output = torch.cat(output, ..)

But I found that during the transfer from GPU to CPU, the whole computational graph is copied for each chunk and stays within the GPU memory, so eventually I end up with OOM. I’m not really sure why this also happens even when I moved ‘this_out’ to cpu.
This is somewhat a large overhead when I am supposed to have the same structure for every chunk. Does anyone have a good solution for this kind of situation?

ptrblck · October 15, 2024, 8:12pm

This is expected since the to() operation is differentiable if tensors are moved between devices and all intermediate tensors created in the computation graph will stay on their original device. The backward pass will thus move the gradient back to the original device.

If you only want to collect the outputs for some sort of evaluation metric, you could execute the forward pass in a torch.no_grad() context, which will not store the intermediate tensors, and torch.cat/stack the outputs.