PyTorch Computation Graph not getting freed

Hi,

I’m training a neural network in Pytorch and am facing “CUDA Out of Memory issue” and the reason seems to be that the computation graph created by PyTorch is not being freed after optimizer.step() and before the loss for the next batch is calculated. Here are the details:

PyTorch version: 0.4.0 (stable)
GPU: NVIDIA 1080Ti
CUDA Version: 9.0

Model Details:
Model contains two parameters (embeddings) of shapes as follows:
Embedding Param1 = 704990 x dim_size
Embedding Param2 = 957760 x dim_size

Memory details for dim_size=50
CUDA Initialization: 584 MB / 11172 MB
MODEL + CUDA: 904 MB / 11172 MB
FORWARD PASS: 1010MB / 11172 MB
BACKWARD PASS: 1986 MB / 11172 MB
OPTIMIZER STEP: 2354 MB / 11172 MB
After del loss, del batch: 1962 MB / 11172 MB

Memory details for dim_size=300
CUDA Initialization: 584 MB / 11172 MB
MODEL + CUDA: 2490 MB / 11172 MB
FORWARD PASS: 2514MB / 11172 MB
BACKWARD PASS: 8232 MB / 11172 MB
OPTIMIZER STEP: 10428 MB / 11172 MB
After del loss, del batch: 8228 MB / 11172 MB
FORWARD PASS on second batch: CUDA OOM

The problem in the case of dim_size=300 is that as soon as the second batch is loaded and backward is called on the second batch it goes out of memory as 8228 + ~5gb is greater than the GPU RAM.

Currently after loss.backward() and optimizer.step() I execute the following operations to free up memory
del loss
del input_batch
torch.cuda.empty_cache()

Is there anything that can be done to make sure that the computation graph is completely deleted and that before the second batch loads the memory occupied by PyTorch on CUDA RAM be same as just before FORWARD PASS before the first pass?

Could you post your code that’s causing this?

# text_input = batch_size x 2000
# entity_input = batch_size x 31 (1 pos, 30 neg samples)
# entity_embedding.shape[1] = 300
# linear W = 300x300, linear Bias = 300
 def forward(self, text_input, entity_input):
		sentence_embedding = self.linear(F.normalize(torch.sum(self.word_embedding(text_input), dim=1), dim =-1))
        denominator = logsumexp(torch.sum(self.entity_embedding(entity_input) * sentence_embedding.unsqueeze(1), 2), 1)
        numerator = torch.sum(self.entity_embedding(entity_input)[:,0,:] * sentence_embedding, 1)
        return torch.sum(denominator - numerator)

Training loop

            words = torch.from_numpy(word_file.root.data[start:start+batch_size, :].astype(int))
	        entities = torch.from_numpy(entity_file.root.data[start:start+batch_size, :].astype(int))
			optimizer.zero_grad()
			words.cuda()
			entities.cuda()

			loss = model(words, entities)	
			loss.backward(retain_graph=False)
			optimizer.step()
			
			del loss
			del words
			del entities
			torch.cuda.empty_cache()	
            batchno = batchno + 1
			start = start + batch_size	

Your training loop looks a bit strange, as you are not assigning words and entities back, so that they should still be on the CPU.
Could you check the device of these tensors?

The inputs tensors are in the same gpu that I’m running the model on.

Any advice on how I can definitely make sure that the graph is deleted? torch.cuda.empty_cache() isn’t working in this case.

As long as some object holds a reference to the graph it cannot be freed.
Make sure you are not storing the tensors, e.g. loss, in a list etc.