Ok, now I have a more detailed analysis of what happened.
First iteration
At the start of iteration:
memory.used [MiB], memory.free [MiB]
533 MiB, 11674 MiB
1147 MiB, 11060 MiB
533 MiB, 11674 MiB
1147 MiB, 11060 MiB
After forward pass:
memory.used [MiB], memory.free [MiB]
2903 MiB, 9304 MiB
2232 MiB, 9975 MiB
2899 MiB, 9308 MiB
2233 MiB, 9974 MiB
At the end of iteration, after forward and backward pass:
memory.used [MiB], memory.free [MiB]
1104 MiB, 11103 MiB
4231 MiB, 7976 MiB
5754 MiB, 6453 MiB
3718 MiB, 8489 MiB
memeda!
Second iteration:
memory.used [MiB], memory.free [MiB]
1104 MiB, 11103 MiB
4231 MiB, 7976 MiB
5754 MiB, 6453 MiB
3718 MiB, 8489 MiB
memory.used [MiB], memory.free [MiB]
3289 MiB, 8918 MiB
4440 MiB, 7767 MiB
5754 MiB, 6453 MiB
4344 MiB, 7863 MiB
memory.used [MiB], memory.free [MiB]
4720 MiB, 7487 MiB
5591 MiB, 6616 MiB
5888 MiB, 6319 MiB
5054 MiB, 7153 MiB
memeda!
Third iteration:
memory.used [MiB], memory.free [MiB]
4720 MiB, 7487 MiB
5591 MiB, 6616 MiB
5888 MiB, 6319 MiB
5054 MiB, 7153 MiB
memory.used [MiB], memory.free [MiB]
4720 MiB, 7487 MiB
5591 MiB, 6616 MiB
5888 MiB, 6319 MiB
5054 MiB, 7153 MiB
Traceback (most recent call last):
File “train.py”, line 218, in
loss_G1.backward()
File “/scratch_net/biwidl207/ligua/anaconda/lib/python2.7/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/scratch_net/biwidl207/ligua/anaconda/lib/python2.7/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: out of memory
This is weird since this time it run through 3 iterations instead of 2.
The number of iterations the program can run through is very different each time. Sometimes 2, sometimes 3, sometimes it run through without CUDA OOM error.
I am wondering what the cause may be.
Theory one: By default, the cuda() should send the result to the first GPU(default). Is it possible it behaves otherwise?
I am just confused where should I look since the program is huge and there are way too many codes to look it
Best Regards