Multiple forward passes, tuple mining

Hello,
i want to perform multiple forward passes of full batches with a GPU.
Therefore each batch uses the complete VRAM.
I want to store them temporarily and perform tuple mining with the concatenad batches.
This is my approach but i run out of VRAM:

stored_embeddings = []
stored_labels = []
for i, (inputs, labels) in enumerate(loader):
                # Perform the forward pass
                outputs = model(inputs)
                # Mining on CPU if required
                if cpu_mining:
                    outputs_cpu = outputs.cpu()
                    del outputs
                    torch.cuda.empty_cache()
                    if (i + 1) % cpu_mining == 0:
                        # concat embeddings and pass to miner then compute loss based on them
                        outputs = torch.cat([torch.cat(stored_embeddings), outputs_cpu])
                        labels = torch.cat([torch.cat(stored_labels), labels])
                    else:
                        # fetch more data before mining
                        stored_embeddings.append(outputs_cpu)
                        stored_labels.append(labels)
                        continue

What is the correct way of doing this?
Any help is appreciated.

You are attaching the computation graph with the intermediate tensors on the GPU via outputs_cpu.
Moving outputs to the CPU and deleting outputs will not free the computation graph, since outputs_cpu is still attached to the computation graph as seen here:

model = models.resnet50().cuda()
inputs = torch.randn(8, 3, 224, 224).cuda()

print(torch.cuda.memory_allocated()/1024**2)
# 102.32177734375

outputs = model(inputs)
print(torch.cuda.memory_allocated()/1024**2)
# 759.55859375

outputs_cpu = outputs.cpu()
del outputs # this does not delete the intermediates, as outputs_cpu still holds to them
print(torch.cuda.memory_allocated()/1024**2)
# 759.52783203125 !!!

# here you are attaching the outputs_cpu tensor with the entire computation graph
# with intermediates on the GPU
outputs = torch.cat([torch.cat(stored_embeddings), outputs_cpu])

.detach() the outputs before moving them to the CPU and it should properly free the graph in del outputs:

outputs_cpu = outputs.detach().cpu()

Thanks, you are a gem in this forum. Just to be sure this won’t break later backpropagation? Because the docs say: The result will never require gradient. But i guess this does not mean that it removes the old ones?

Detaching the output will break the computation graph and you will not be able to call .backward() on output_cpu or any loss calculated from it.
If you need to call backward and calculate the gradients on output_cpu or any created tensor from it, you would need to keep the computation graph alive but would also store it in each iteration thus increasing the GPU memory.

Then there is no option to keep the gradients and free the GPU?

I’m not sure we are taking about the same issue.
Are you referring to already calculated gradients which are stored in the .grad attributes of the parameters by “keep the gradients”? If so, these .grad attributes will not be changed in any way.
Detaching the output will cut if from the computation graph, which stores the intermediate forward activations (not gradients), which are needed to calculate the gradients in the backward call.

Sorry for beeing imprecise, i mean the forward activations to calculate the gradients in the backward
call.

Ah OK. Yes, you will not be able to delete the forward activations and still be able to calculate the gradients (they are needed in the backpropagation).