Spliting my model on 4 GPU and cuda out of memory problem

In order to deal with high resolution 2K images and progressive training of GAN and limited GPU memory(12GB), I am forced to split my model on 4 GPU. However, there seems to be the problem of CUDA OUT OF MEMORY error each time.

And the strange thing is that it happens after one iteration(which means forward and backward pass are both performed). I have checked my code and there definitely will not be any case that my code behaves in the following way:

total loss = 0

for (iteration ):
loss = …
total loss = loss + total loss

Except the above case, I really cannot think of a possibility how there will be an CUDA out of memory bug after one iteration (Also, I do not accumulate gradient as well)

Can anyone help?

My memory usage after the first iteration

memory.used [MiB], memory.free [MiB]
4229 MiB, 7978 MiB
3686 MiB, 8521 MiB
1104 MiB, 11103 MiB
5754 MiB, 6453 MiB

Best Regards

My error message:
Traceback (most recent call last):
File “train.py”, line 215, in
loss_G1.backward()
File “/scratch_net/biwidl207/ligua/anaconda/lib/python2.7/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/scratch_net/biwidl207/ligua/anaconda/lib/python2.7/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: out of memory

If the OOM happens on the second iteration then it may be because the variables inside the loop are still alive during the second run. Python has function scoping (not block or loop scoping) so any variables declared during the first iteration remain alive in subsequent iterations until they are redefined.

Here is another post about this:

See my comment about how you can rewrite your program to avoid two versions alive at once.

3 Likes

Ok, now I have a more detailed analysis of what happened.

First iteration

At the start of iteration:
memory.used [MiB], memory.free [MiB]
533 MiB, 11674 MiB
1147 MiB, 11060 MiB
533 MiB, 11674 MiB
1147 MiB, 11060 MiB

After forward pass:
memory.used [MiB], memory.free [MiB]
2903 MiB, 9304 MiB
2232 MiB, 9975 MiB
2899 MiB, 9308 MiB
2233 MiB, 9974 MiB

At the end of iteration, after forward and backward pass:
memory.used [MiB], memory.free [MiB]
1104 MiB, 11103 MiB
4231 MiB, 7976 MiB
5754 MiB, 6453 MiB
3718 MiB, 8489 MiB
memeda!

Second iteration:

memory.used [MiB], memory.free [MiB]
1104 MiB, 11103 MiB
4231 MiB, 7976 MiB
5754 MiB, 6453 MiB
3718 MiB, 8489 MiB
memory.used [MiB], memory.free [MiB]
3289 MiB, 8918 MiB
4440 MiB, 7767 MiB
5754 MiB, 6453 MiB
4344 MiB, 7863 MiB

memory.used [MiB], memory.free [MiB]
4720 MiB, 7487 MiB
5591 MiB, 6616 MiB
5888 MiB, 6319 MiB
5054 MiB, 7153 MiB
memeda!

Third iteration:

memory.used [MiB], memory.free [MiB]
4720 MiB, 7487 MiB
5591 MiB, 6616 MiB
5888 MiB, 6319 MiB
5054 MiB, 7153 MiB
memory.used [MiB], memory.free [MiB]
4720 MiB, 7487 MiB
5591 MiB, 6616 MiB
5888 MiB, 6319 MiB
5054 MiB, 7153 MiB

Traceback (most recent call last):
File “train.py”, line 218, in
loss_G1.backward()
File “/scratch_net/biwidl207/ligua/anaconda/lib/python2.7/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/scratch_net/biwidl207/ligua/anaconda/lib/python2.7/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: out of memory

This is weird since this time it run through 3 iterations instead of 2.

The number of iterations the program can run through is very different each time. Sometimes 2, sometimes 3, sometimes it run through without CUDA OOM error.

I am wondering what the cause may be.

Theory one: By default, the cuda() should send the result to the first GPU(default). Is it possible it behaves otherwise?

I am just confused where should I look since the program is huge and there are way too many codes to look it

Best Regards

I am having the same issue, some times it runs more iterations and some times less if I modify the number of sub-processes (when loading the dataset) it seams it also affects the number of iterations it can perform even though it does not make sense since they are loaded using my CPU and the training is still sequential. Well in my case when I say iterations I mean to say forward passes or batch passes.

Hi!
I countered the same problem.And I can’t find a proper way to show my memory usage after every train epoch iteration.
Could u give me your code sample about how to print the mem usage after every iter?