Suddenly OOM after several epochs training

jangbi · August 18, 2018, 7:28am

I am implementing fast-rcnn(base on VGG16) by PyTorch. And my graphic card is GTX1060(6G version).

During the training stage, the program always crash(out of memory) after several epochs training.

But the strange part is, sometimes after 100~200 epochs, the error will occur, while sometimes only after 5~10 epochs.

I use the max_memory_allocated to show the memory usage, the memory usage keeps about 4000M in each epoch. Since the memory usage remain constant before the OOM occurs, I think there may be not memory leak. And the nvidia-smi’s result is approximate 5600M. But suddenly, the crash occurs.

Here’s the training part code.

    for epoch in range(num_epochs):
        scheduler.step()

        tensor, origin_image, boxes, labels = next(dataiter)

        optimizer.zero_grad()

        rois, roi_indices = proposal_generator(origin_image)

        # since only support one image per epoch
        boxes = boxes[0]
        labels = labels[0]
        sample_rois, gt_roi_loc, gt_roi_label = proposal_target_creator(rois, boxes, labels)
        n_sample = sample_rois.size(0)
        roi_indices = torch.zeros(n_sample, dtype=torch.int)

        tensor = tensor.to(device)
        sample_rois = sample_rois.to(device)
        gt_roi_loc = gt_roi_loc.to(device)
        roi_indices = roi_indices.to(device)
        gt_roi_label = gt_roi_label.to(device)

        roi_cls_loc, roi_scores = model(tensor, sample_rois, roi_indices)

        gt_roi_label = gt_roi_label.long()
        class_loss = class_criterion(roi_scores, gt_roi_label)
        regression_loss = loc_loss(bbox_regression_criterion, n_sample,
                                   roi_cls_loc, gt_roi_loc, gt_roi_label)

        loss = class_loss + regression_loss
        loss.backward()
        print(torch.cuda.max_memory_allocated() / (1024 * 1024))
        running_loss += loss.item()
        print(epoch)
        if epoch % 100 == 0:
            print('[%d] loss: %.3f' %
                  (epoch, running_loss / 100))
            running_loss = 0.

        optimizer.step()

Update:
Here’s another question, i use the max_memory_allocated to print the memory usage before training start, its result is approximate 500M, which is close to VGG’s parameters occupation. However, after the first epoch, it suddenly rise up to 3500M, and then stably keeps at 4000M after several epochs. I calculate the size of tensor, sample_rois, roi_indices and gt_roi_label, the sum is only about 10M. So why does the memory usage rise up so high?

Daniel_Dagnino · August 19, 2018, 3:59pm

The model parameters are only a small part of the total memory that you need to run a model. Even thought the model parameters only requieres 500M, to run the model you also need to keep in memory the values passed in every layer. For example, if you have a simple first layer Conv2d with a kernel size 3x3 that acts over a RGB image and gives 2 output channels i.e. (Cin,Cout,H,W)=(3,2,3,3), in this case the first layer only have 54 parameters. However if the input image size is (C,H,W)=3x128x128, the intermediate first layer will have 128x128x2=32768 values (assuming padding to keep the input size). Moreover, depending on the optimizer that you are using it will also increase the memory, for instance Adam saves two vectors with the same size that the parameters of the model, so that 2x500M. However, the increase of memory after several epoch is strange.

jangbi · August 19, 2018, 4:08pm

yep, the crash occurs so suddenly, since the max_memory_allocated output and nvidia-smi’s output remains constant. Sometimes i can train like 200 epochs while sometimes 10~20 only.
I mainly reference https://github.com/chenyuntc/simple-faster-rcnn-pytorch and https://github.com/pytorch/examples/tree/d8d378c31d2766009db400ac03f41dd837a56c2a/fast_rcnn this two repos. The first one reports that it only takes about 3G GPU memory usage. My program should have smaller memory usage since i’ve only implemented the fast-rcnn part.

jangbi · August 19, 2018, 4:13pm

Here’s 3 picture of one experiment, the max_memory_allocated rise up quickly in the first or two epoch, and remains almost constant after many epoch. All of sudden, the OOM occurs.