Multiple forward passes, tuple mining

FrAnY · July 22, 2022, 4:53pm

Hello,
i want to perform multiple forward passes of full batches with a GPU.
Therefore each batch uses the complete VRAM.
I want to store them temporarily and perform tuple mining with the concatenad batches.
This is my approach but i run out of VRAM:

stored_embeddings = []
stored_labels = []
for i, (inputs, labels) in enumerate(loader):
                # Perform the forward pass
                outputs = model(inputs)
                # Mining on CPU if required
                if cpu_mining:
                    outputs_cpu = outputs.cpu()
                    del outputs
                    torch.cuda.empty_cache()
                    if (i + 1) % cpu_mining == 0:
                        # concat embeddings and pass to miner then compute loss based on them
                        outputs = torch.cat([torch.cat(stored_embeddings), outputs_cpu])
                        labels = torch.cat([torch.cat(stored_labels), labels])
                    else:
                        # fetch more data before mining
                        stored_embeddings.append(outputs_cpu)
                        stored_labels.append(labels)
                        continue

What is the correct way of doing this?
Any help is appreciated.

ptrblck · July 22, 2022, 8:20pm

You are attaching the computation graph with the intermediate tensors on the GPU via outputs_cpu.
Moving outputs to the CPU and deleting outputs will not free the computation graph, since outputs_cpu is still attached to the computation graph as seen here:

model = models.resnet50().cuda()
inputs = torch.randn(8, 3, 224, 224).cuda()

print(torch.cuda.memory_allocated()/1024**2)
# 102.32177734375

outputs = model(inputs)
print(torch.cuda.memory_allocated()/1024**2)
# 759.55859375

outputs_cpu = outputs.cpu()
del outputs # this does not delete the intermediates, as outputs_cpu still holds to them
print(torch.cuda.memory_allocated()/1024**2)
# 759.52783203125 !!!

# here you are attaching the outputs_cpu tensor with the entire computation graph
# with intermediates on the GPU
outputs = torch.cat([torch.cat(stored_embeddings), outputs_cpu])

.detach() the outputs before moving them to the CPU and it should properly free the graph in del outputs:

outputs_cpu = outputs.detach().cpu()

FrAnY · July 22, 2022, 8:40pm

Thanks, you are a gem in this forum. Just to be sure this won’t break later backpropagation? Because the docs say: The result will never require gradient. But i guess this does not mean that it removes the old ones?

ptrblck · July 22, 2022, 8:43pm

Detaching the output will break the computation graph and you will not be able to call .backward() on output_cpu or any loss calculated from it.
If you need to call backward and calculate the gradients on output_cpu or any created tensor from it, you would need to keep the computation graph alive but would also store it in each iteration thus increasing the GPU memory.

FrAnY · July 22, 2022, 8:44pm

Then there is no option to keep the gradients and free the GPU?

ptrblck · July 22, 2022, 8:46pm

I’m not sure we are taking about the same issue.
Are you referring to already calculated gradients which are stored in the .grad attributes of the parameters by “keep the gradients”? If so, these .grad attributes will not be changed in any way.
Detaching the output will cut if from the computation graph, which stores the intermediate forward activations (not gradients), which are needed to calculate the gradients in the backward call.

FrAnY · July 22, 2022, 8:49pm

Sorry for beeing imprecise, i mean the forward activations to calculate the gradients in the backward
call.

ptrblck · July 22, 2022, 8:51pm

Ah OK. Yes, you will not be able to delete the forward activations and still be able to calculate the gradients (they are needed in the backpropagation).