GPU memory usage suddenly increases after several epochs and then out of memory error occurs

Hi,

I have the issue that GPU memory suddenly increases after several epochs. I cannot observe a single event that leads to this increase, and it is not an accumulated increase over time. GPU memory stays nearly constant for several epochs but then suddenly is uses more than double the amount of memory and finally crashes because out of memory. (I observed this in nvidia-smi) Every epoch does identical stuff, I don’t see an obvious reason for this.

This issue first appeared when I extended my semantic segmentation model by a second cross entropy loss. What I basically want to do is to split the image into two regions defined by a particular binary mask. Then one part of the image goes though loss1 and the other though loss2. I want to weight different parts of the image differently.

Here are some snippets from the training code:

# Loss
class CrossEntropyLoss2d(nn.Module):
    def __init__(self, weight=None, size_average=True, ignore_index=255):
        super(CrossEntropyLoss2d, self).__init__()
        self.nll_loss = nn.NLLLoss2d(weight, size_average, ignore_index)

    def forward(self, inputs, targets):
      return self.nll_loss(F.log_softmax(inputs), targets)
ce_loss_criterion = CrossEntropyLoss2d()

# Main training code (inside loop)
seg_out = <Network outputs one-hot>
mask = Variable(<A batch of binary masks BxHxW>, requires_grad=False)
seg_gt = Variable(<Ground truth labels>,  requires_grad=False)
seg_gt_masked_1 = seg_gt.clone()
seg_gt_masked_2 = seg_gt.clone()
seg_gt_masked_1[mask == 0] = 255
seg_gt_masked_2[mask == 1] = 255
seg_loss  = ce_loss_criterion(seg_out, seg_gt_masked_1) * weight_1
seg_loss += ce_loss_criterion(seg_out, seg_gt_masked_2) * weight_2
del seg_gt_masked_1
del seg_gt_masked_2
gc.collect()
seg_loss.backward()
opt.step()

Does anyone see an obvious issue with the approach above?

Any comments regarding my code or what I could do to find the root of this issue are welcome.

I tried with PyTorch 0.2 and 0.3 and with different machines (NVIDIA Titan “old” and “new” variant) but both lead to this crash. I use multi-GPU-training but also tried single GPU and I started from a batch size where only 6GB per GPU were used (GPUs have 12GB). But still, after several epochs suddenly the whole thing crashes. As you can see I already tried “del” and “gc.collect” but no effect.

I have a theory: Maybe those two lines

seg_loss  = ce_loss_criterion(seg_out, seg_gt_masked_1) * weight_1
seg_loss += ce_loss_criterion(seg_out, seg_gt_masked_2) * weight_2

in combination with

seg_loss.backward()

lead to the situation were the network is implicitly split into two copies and each loss is backpropagated separately. Then it is no surprise that I observe the consumption of the double amount of memory once in a while. Can anyone confirm or disprove this?

Ok, my theory has been dismantled. I tried the following using brand new pytorch 0.3 functionality:


seg_out = <Network outputs one-hot>
mask = Variable(<A batch of binary masks BxHxW>, requires_grad=False)
seg_gt = Variable(<Ground truth labels>,  requires_grad=False)
seg_loss  = ce_loss_criterion(seg_out, seg_gt, reduce=False)
seg_loss[mask == 1] = seg_loss[mask == 1] * weight_2
seg_loss[mask == 0] = seg_loss[mask == 0] * weight_1
seg_loss = seg_loss.mean()
seg_loss.backward()
opt.step()

But it still randomly runs out of memory. Very disturbing to see that if you do something that is a little bit different from the standard vanilla pipeline (network -> prediction -> crossentropy) you get troubles after troubles. This normally indicates that test coverage is very thin.