CUDA error: out of memory - huge embedding layer

I am working on a CNN autoencoder for text with an architecture and loss function like this:
screen shot 2018-08-06 at 4 32 25 pm

def computeCrossEntropy(log_prob, target):
    loss = [F.nll_loss(sentence_emb_matrix, word_ids, size_average=False) for sentence_emb_matrix, word_ids in zip(log_prob, target)]
    average_loss = sum([torch.sum(l) for l in loss]) / log_prob.size()[0]
    return average_loss

I understand the embedding is huge and I can only have a batch size of 8. Otherwise, I would have memory issue (when loss.backward() is called). Apparently, the line below in the loss function occupies 60% of computation time:

[torch.sum(l) for l in loss]

are there any ways I can deal with this memory issue so I can increase my batch size? Like having the embedding layer processed somewhere else, assuming I have multiple GPUs? Sorry for my ignorance as I am pretty new to this. I would like to know the best practice in handling such situation to improve training time. Thanks a lot for the guidance in advance

if you have several gpus you can use dataparallel, which splits the batch in the amount of gpus you have.
If even doing that the embedding is still very big you can go inside the model an manually assign in which gpu you would like to store each tensor (you can ie process conv1 in gpu0 conv2 in gpu1 and so on) however this is not optimal since you are not using gpus in parallel but sequentially.

Check https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

@JuanFMontesinos Thank you very much for the reply. I have already tried both dataparallel and distributeddataparallel. I thought the former would replicate the model in multiple gpus so all of them would have the embedding layer, making no difference in this case, am I right?

For the distributed one, I had to lower the batch size from 8 to 2 for it to run without memory, offsetting its increased speed.

P.S. I am using 4 GPUs. Have been stuck on this issue for days…

Well, But if you can use ie bs 8 with one gpu, dataparallel will replicate model and process 2 samples of your batch with each gpu. Therefore you should have free memory in all your gpus which allows you to increase your batch size.

Can you check memory usage after and before using data parallel?

You would have a problem if your gpu could store the embedding with bs1 but if your model fits in one gpu data parallel should let you increase bs for sure.
This process is not linear because there is usually one gpu which requires more memory (cuda:0 usually) because when you transfer input from cpu to gpu they are initially stored there, some optimizers also store parameters there.

That’s why it would be fine you to check memory usage with and without dataparallel per gpu

Thanks for the help. I tweaked the model a little bit so the encoder and decoder shares the same embedding since it’s not supposed to be trainable. Now I am able to do 20 bs on one single GPU. But when I do it with DataParallel, I am unable to increase my batch size. I tried 24, which already gives me an error. By observing the memory usage, the core gpu that the results are collected would blow up…

I was setting the GPU IDs wrongly. I realised that there is a GPU imbalance issue. The main GPU is fully used while the rest are not fully used like a quarter of its 12xxx Mb. Is this normal?

Well, it depends on several things. I found that some optimizers requires lot of memory. I got this issue using adam. Those optimizers save their parameters in by-default gpu. I’m trying to figure out if there is a way of solving that problem, however it seems there is no simple way to store those parameters in cpu or another gpu.

At the same time, when you load input variables they have to be saved in gpu. You can try to save some memory allocating the input in a gpu different from by-default one.

Anyone you can try to use a simpler optimizer such us SGD.

2 Likes