Exception when changing the backbone of Maskrcnn

azer · August 21, 2020, 9:40pm

Hi

I am new to computer vision I have been trying to change the backbone of maskrcnn to resnet 101 as follows:

model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=False) backbone = resnet_fpn_backbone(‘resnet101’, pretrained=False)
model.backbone = backbone

When I start the training my losses are and the system exits. I do not know if there is anything else I can do. If someone can help I would be very grateful.

Epoch: [0] [ 0/60] eta: 0:00:46 lr: 0.00the 0090 loss: 66788.3438 (66788.3438) loss_classifier: 553.5748 (553.5748) loss_box_reg: 1551.7709 (1551.7709) loss_mask: 63966.0742 (63966.0742) loss_objectness: 543.3094 (543.3094) loss_rpn_box_reg: 173.6078 (173.6078) time: 0.7696 data: 0.2757 max mem: 11470
Loss is nan, stopping training
{‘loss_classifier’: tensor(0., device=‘cuda:0’, grad_fn=), ‘loss_box_reg’: tensor(6.8742e+30, device=‘cuda:0’, grad_fn=), ‘loss_mask’: tensor(nan, device=‘cuda:0’, grad_fn=), ‘loss_objectness’: tensor(8.7334e+11, device=‘cuda:0’,
grad_fn=), ‘loss_rpn_box_reg’: tensor(1.5060e+12, device=‘cuda:0’, grad_fn=)}
An exception has occurred, use %tb to see the full traceback.

SystemExit: 1

ptrblck · August 24, 2020, 9:27am

Are you running the code in a Jupyter notebook?
If so, could you rerun it in a terminal and post the complete stack trace here?

PS: you can post code snippets by wrapping them into three backticks ```

azer · August 24, 2020, 5:12pm

Thanks for tip!
Hi I am running in google collab. I managed to make the code to run but now I have new issue. I used pretrained weight with coco (101-fpn) from the maskrcnn benchmark, however the learning rate decay did not updated.

params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(params, lr=0.005,
                                momentum=0.9, weight_decay=0.0005)
    # and a learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                                   step_size=5,
                                                   gamma=0.1)

Epoch: [29] [500/583] eta: 0:01:00 lr: 0.005000 loss: 0.4555 (0.4845) loss_classifier: 0.0889 (0.0954) loss_box_reg: 0.1322 (0.1424) loss_mask: 0.2333 (0.2222) loss_objectness: 0.0111 (0.0123) loss_rpn_box_reg: 0.0102 (0.0124) time: 0.7206 data: 0.0151 max mem: 7938

Any advice would be of great help

ptrblck · August 25, 2020, 8:26pm

Could you post a code snippet showing when you are calling lr_scheduler.step() and how you are checking the learning rate?

azer · August 26, 2020, 11:08pm

Hi

Thanks for your reply, please see below the snippet of my code that is based on train_one_epoch from detection engine.py.

Any help would be appreciated

eval = {'train_loss': [], 'valid_loss': []}


for epoch in range (num_epochs):
    t_loss= 0
    v_loss=0
    total_img = 0
    model.train()
    metric_logger = utils.MetricLogger(delimiter="  ")
    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
    header = 'Epoch: [{}]'.format(epoch)

    lr_scheduler = None
    if epoch == 0:
        warmup_factor = 1. / 1000
        warmup_iters = min(1000, len(data_loader) - 1)

        lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)
    
    print("lr", lr_scheduler)
    for images, targets in metric_logger.log_every(data_loader, print_freq, header):
        total_img +=1
        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        
        loss_dict = model(images, targets)

        losses = sum(loss for loss in loss_dict.values())

        # reduce losses over all GPUs for logging purposes
        loss_dict_reduced = utils.reduce_dict(loss_dict)
        losses_reduced = sum(loss for loss in loss_dict_reduced.values())

        loss_value = losses_reduced.item()

        if not math.isfinite(loss_value):
            print("Loss is {}, stopping training".format(loss_value))
            print(loss_dict_reduced)
            sys.exit(1)

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        t_loss +=loss_value
        if lr_scheduler is not None:
            lr_scheduler.step()  

        metric_logger.update(loss=losses_reduced, **loss_dict_reduced)
        metric_logger.update(lr=optimizer.param_groups[0]["lr"])

    torch.save(model.state_dict(), os.path.join(out_dir, 'model{}.pth'.format(epoch)))
    train_loss.append(t_loss/total_img)```

azer · August 27, 2020, 7:38pm

Hi Patrick
I found my mistake, I was naming the same way the lr scheduler inside and outside the loop. I just changed the name for scheduler outside the loop and it worked.