CUDA Out of Memory issues when training a simple model

I’m getting a weird OOM issue when training my model on GPU. My model has very few parameters, just an embedding layer(about 20000 x 300) and a matrix param(300 x 20000). In theory it should only consumes several hundreds MB of space in memory and can easily fit into GPU, however during training my GPU memory consumption will skyrockets to over 10GB after running for just a few minutes.
Here is the code of my model

	class Model(nn.Module):
	def __init__(self, hidden_size, embedding, layer_num=1):
		super(Bug, self).__init__()
		self.layer_num = layer_num
		self.hidden_size = hidden_size
		self.embedding = embedding
		self.voc_size = embedding.vocab_size
		self.embed_size = embedding.embedding_size
		self.param = nn.Linear(self.hidden_size, self.vocab_size)
	def forward(self, inputs, lengths):
		emb = self.embedding.embedding(inputs)
		out = self.param(embed)  # (L,B,vocab_size)
		return out

And here is my training loop

    for epoch in range(epoches):
		for idx, batch in enumerate(next_batch(lines, BATCH_SIZE)):
			pad_sents, lengths, pad_labels, mask, _, _ = batch2train(emb, batch)
			out = model(pad_sents, lengths)
			ret = out
			labels = pad_labels
			ret = ret.view(ret.size(0) * ret.size(1), -1)
			labels = labels.view(-1)
			loss = criterion(ret, labels)
			torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25)

I tries to print all my tensor objects using the following code

def get_tensors(self):
    for obj in gc.get_objects():
           if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(
                 yield tensor

It seems like there are not many tensor objects used during training but the overall memory consumption is abnormally high. Any idea what might be the cause and how can I print out all objects reside in GPU memory?

How much RAM does your GPU have, what is the batch size, and what is the tensor size?
I’ve found that the larger the images/tensors, the more channels (RGB vs Grayscale) and the larger the batch-sizes, the more GPU RAM is used. Run nvidia-smi and monitor GPU RAM whilst playing around with your parameters?

Hi Alex.
My batch size is set to 32, which is pretty small. I’m running on a Titan X which has about 12GB of VRAM. Originally I’m trying to implement a simple language model so all my inputs are text.
Here is all the tensor object returned from python gc module, printed in type, shape, size(MB) format

[(<class 'torch.nn.parameter.Parameter'>, (17794, 300), 21.3528), 
(<class 'torch.nn.parameter.Parameter'>, (17794, 300), 21.3528), 
(<class 'torch.Tensor'>, (17794, 300), 21.3528), 
(<class 'torch.Tensor'>, (17794, 300), 21.3528),
(<class 'torch.Tensor'>, (17794, 300), 21.3528), 
(<class 'torch.Tensor'>, (17794, 300), 21.3528), 
(<class 'torch.nn.parameter.Parameter'>, (17794,), 0.071176),
(<class 'torch.Tensor'>, (17794,), 0.071176), 
(<class 'torch.Tensor'>, (17794,), 0.071176), 
(<class 'torch.Tensor'>, (1824, 17794), 129.825024),
(<class 'torch.Tensor'>, (1824,), 0.007296), 
(<class 'torch.Tensor'>, (57, 32, 17794), 129.825024),
(<class 'torch.Tensor'>, (57, 32), 0.007296), 
(<class 'torch.Tensor'>, (57, 32), 0.007296), 
(<class 'torch.Tensor'>, (57, 32), 0.007296), 
(<class 'torch.Tensor'>, (32,), 0.000128),
(<class 'torch.Tensor'>, (), 4e-06)]

Do you load all your data on the CUDA memory?
How much residual CUDA RAM do you have, after you instantiate your model?
Try with watch -n 1 nvidia-smi whilst running your code.
It’s rarely the model that uses all the memory, but either the copies for the batches, or the dataset itself IMHO, at least that is what’s happened to me many times.

I make a try with Pytorch’s own memory helper function and find another weird phenomenon.
I add following code in every iteration

print('current memory allocated: {}'.format(torch.cuda.memory_allocated() / 1024 ** 2))
print('max memory allocated: {}'.format(torch.cuda.max_memory_allocated() / 1024 ** 2))
print('cached memory: {}'.format(torch.cuda.memory_cached() / 1024 ** 2))

And this is what I get

current memory allocated: 145.43115234375
max memory allocated: 2584.5517578125
cached memory: 7989.5

Looks like current allocated memory size is pretty normal but the size of max memory allocated and cached memory is abnormally high. I try to call torch.cuda.empty_cache() after each iteration, which does decrease the VRAM usage significantly but I still get OOM issue after running for a longer time.
So why does cached memory allocator eats up so much VRAM?

Seems like the problem is closely related to my input tensor size. Each time there is a input tensor which is larger than any tensor before the cached memory size will increase significantly. The memory leaking issue happens at three place. First is at out = self.param(embed) in forward function. Second increase happens at loss = criterion(ret, labels). The third increase happens at loss.backward().

:confused:So it appears that once I get a tensor input which has the largest size so far during training, the cached memory size will increase for every operation after a linear transformation. Sometimes the cached memory get released when GPU is running out of VRAM, sometimes Pytorch just throws an OOM error and get aborted. What might be the reason for this?

I’m sorry I don’t know, I think your best bet is to try and wait for a reply from one of the pytorch developers.
What happens when you try with other networks? Have you tried with one of the pretrained nets?

Hi, did you solve this issue yet? It seems like I ran into same issue.

I trained my model with 8 GPUs and the first one always gave me OOM error. So I moved my model to the last 7 GPUs and left the first GPU only for calculating the loss and call loss.backward. But this still costed abnormally large memory in the first GPU.

I have a embedding layer (95000 x 1024) and a decoder layer shares the same parameter with the embedding layer. When batch size is 16, the output size is 16 x 255 x 9500, it consumed 9000 MB memory in the first GPU.

Hi, is there any solution found for this problem? I faced the same issue. Thanks

1 Like

Any ideas here ?
A batch of 32 images 320x320x3 takes around 40GB of VRAM and I have no idea why