I have the issue that GPU memory suddenly increases after several epochs. I cannot observe a single event that leads to this increase, and it is not an accumulated increase over time. GPU memory stays nearly constant for several epochs but then suddenly is uses more than double the amount of memory and finally crashes because out of memory. (I observed this in nvidia-smi) Every epoch does identical stuff, I don’t see an obvious reason for this.
This issue first appeared when I extended my semantic segmentation model by a second cross entropy loss. What I basically want to do is to split the image into two regions defined by a particular binary mask. Then one part of the image goes though loss1 and the other though loss2. I want to weight different parts of the image differently.
Here are some snippets from the training code:
# Loss class CrossEntropyLoss2d(nn.Module): def __init__(self, weight=None, size_average=True, ignore_index=255): super(CrossEntropyLoss2d, self).__init__() self.nll_loss = nn.NLLLoss2d(weight, size_average, ignore_index) def forward(self, inputs, targets): return self.nll_loss(F.log_softmax(inputs), targets) ce_loss_criterion = CrossEntropyLoss2d() # Main training code (inside loop) seg_out = <Network outputs one-hot> mask = Variable(<A batch of binary masks BxHxW>, requires_grad=False) seg_gt = Variable(<Ground truth labels>, requires_grad=False) seg_gt_masked_1 = seg_gt.clone() seg_gt_masked_2 = seg_gt.clone() seg_gt_masked_1[mask == 0] = 255 seg_gt_masked_2[mask == 1] = 255 seg_loss = ce_loss_criterion(seg_out, seg_gt_masked_1) * weight_1 seg_loss += ce_loss_criterion(seg_out, seg_gt_masked_2) * weight_2 del seg_gt_masked_1 del seg_gt_masked_2 gc.collect() seg_loss.backward() opt.step()
Does anyone see an obvious issue with the approach above?
Any comments regarding my code or what I could do to find the root of this issue are welcome.
I tried with PyTorch 0.2 and 0.3 and with different machines (NVIDIA Titan “old” and “new” variant) but both lead to this crash. I use multi-GPU-training but also tried single GPU and I started from a batch size where only 6GB per GPU were used (GPUs have 12GB). But still, after several epochs suddenly the whole thing crashes. As you can see I already tried “del” and “gc.collect” but no effect.