I am new to computer vision I have been trying to change the backbone of maskrcnn to resnet 101 as follows:
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=False) backbone = resnet_fpn_backbone(‘resnet101’, pretrained=False)
model.backbone = backbone
When I start the training my losses are and the system exits. I do not know if there is anything else I can do. If someone can help I would be very grateful.
Epoch: [0] [ 0/60] eta: 0:00:46 lr: 0.00the 0090 loss: 66788.3438 (66788.3438) loss_classifier: 553.5748 (553.5748) loss_box_reg: 1551.7709 (1551.7709) loss_mask: 63966.0742 (63966.0742) loss_objectness: 543.3094 (543.3094) loss_rpn_box_reg: 173.6078 (173.6078) time: 0.7696 data: 0.2757 max mem: 11470
Loss is nan, stopping training
{‘loss_classifier’: tensor(0., device=‘cuda:0’, grad_fn=), ‘loss_box_reg’: tensor(6.8742e+30, device=‘cuda:0’, grad_fn=), ‘loss_mask’: tensor(nan, device=‘cuda:0’, grad_fn=), ‘loss_objectness’: tensor(8.7334e+11, device=‘cuda:0’,
grad_fn=), ‘loss_rpn_box_reg’: tensor(1.5060e+12, device=‘cuda:0’, grad_fn=)}
An exception has occurred, use %tb to see the full traceback.
Thanks for tip!
Hi I am running in google collab. I managed to make the code to run but now I have new issue. I used pretrained weight with coco (101-fpn) from the maskrcnn benchmark, however the learning rate decay did not updated.
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
momentum=0.9, weight_decay=0.0005)
# and a learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=5,
gamma=0.1)
Thanks for your reply, please see below the snippet of my code that is based on train_one_epoch from detection engine.py.
Any help would be appreciated
eval = {'train_loss': [], 'valid_loss': []}
for epoch in range (num_epochs):
t_loss= 0
v_loss=0
total_img = 0
model.train()
metric_logger = utils.MetricLogger(delimiter=" ")
metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
header = 'Epoch: [{}]'.format(epoch)
lr_scheduler = None
if epoch == 0:
warmup_factor = 1. / 1000
warmup_iters = min(1000, len(data_loader) - 1)
lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)
print("lr", lr_scheduler)
for images, targets in metric_logger.log_every(data_loader, print_freq, header):
total_img +=1
images = list(image.to(device) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
# reduce losses over all GPUs for logging purposes
loss_dict_reduced = utils.reduce_dict(loss_dict)
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
loss_value = losses_reduced.item()
if not math.isfinite(loss_value):
print("Loss is {}, stopping training".format(loss_value))
print(loss_dict_reduced)
sys.exit(1)
optimizer.zero_grad()
losses.backward()
optimizer.step()
t_loss +=loss_value
if lr_scheduler is not None:
lr_scheduler.step()
metric_logger.update(loss=losses_reduced, **loss_dict_reduced)
metric_logger.update(lr=optimizer.param_groups[0]["lr"])
torch.save(model.state_dict(), os.path.join(out_dir, 'model{}.pth'.format(epoch)))
train_loss.append(t_loss/total_img)```
Hi Patrick
I found my mistake, I was naming the same way the lr scheduler inside and outside the loop. I just changed the name for scheduler outside the loop and it worked.