I’m training a neural network in Pytorch and am facing “CUDA Out of Memory issue” and the reason seems to be that the computation graph created by PyTorch is not being freed after optimizer.step() and before the loss for the next batch is calculated. Here are the details:
PyTorch version: 0.4.0 (stable)
GPU: NVIDIA 1080Ti
CUDA Version: 9.0
Model Details:
Model contains two parameters (embeddings) of shapes as follows:
Embedding Param1 = 704990 x dim_size
Embedding Param2 = 957760 x dim_size
Memory details for dim_size=50
CUDA Initialization: 584 MB / 11172 MB
MODEL + CUDA: 904 MB / 11172 MB
FORWARD PASS: 1010MB / 11172 MB
BACKWARD PASS: 1986 MB / 11172 MB
OPTIMIZER STEP: 2354 MB / 11172 MB
After del loss, del batch: 1962 MB / 11172 MB
Memory details for dim_size=300
CUDA Initialization: 584 MB / 11172 MB
MODEL + CUDA: 2490 MB / 11172 MB
FORWARD PASS: 2514MB / 11172 MB
BACKWARD PASS: 8232 MB / 11172 MB
OPTIMIZER STEP: 10428 MB / 11172 MB
After del loss, del batch: 8228 MB / 11172 MB FORWARD PASS on second batch: CUDA OOM
The problem in the case of dim_size=300 is that as soon as the second batch is loaded and backward is called on the second batch it goes out of memory as 8228 + ~5gb is greater than the GPU RAM.
Currently after loss.backward() and optimizer.step() I execute the following operations to free up memory
del loss
del input_batch
torch.cuda.empty_cache()
Is there anything that can be done to make sure that the computation graph is completely deleted and that before the second batch loads the memory occupied by PyTorch on CUDA RAM be same as just before FORWARD PASS before the first pass?
words = torch.from_numpy(word_file.root.data[start:start+batch_size, :].astype(int))
entities = torch.from_numpy(entity_file.root.data[start:start+batch_size, :].astype(int))
optimizer.zero_grad()
words.cuda()
entities.cuda()
loss = model(words, entities)
loss.backward(retain_graph=False)
optimizer.step()
del loss
del words
del entities
torch.cuda.empty_cache()
batchno = batchno + 1
start = start + batch_size
Your training loop looks a bit strange, as you are not assigning words and entities back, so that they should still be on the CPU.
Could you check the device of these tensors?