Hi,
I have a large dataset which must be fully pass a forward and I cannot take advantage of the mini-batches due to the characteristic of my evaluation metric.
Since the whole dataset does not fit to GPU memory, I first thought that I need to chunk it into pieces to achieve partial outputs of my model which should be concatenated after being transferred to cpu.
Basically this is what has been done for this:
chunks = DataLoader(data, batch_size)
model = model.to(device)
output = []
for batchidx, data in enumerate(chunks):
data = data.to(device)
this_out = model(data)
output.append(this_out.cpu())
output = torch.cat(output, ..)
But I found that during the transfer from GPU to CPU, the whole computational graph is copied for each chunk and stays within the GPU memory, so eventually I end up with OOM. I’m not really sure why this also happens even when I moved ‘this_out’ to cpu.
This is somewhat a large overhead when I am supposed to have the same structure for every chunk. Does anyone have a good solution for this kind of situation?