Having trouble migrating to torchvision FasterRCNN

emcp · April 30, 2020, 5:04pm

I am very excited to see a library supported implementation of Faster RCNN … and COCO dataset wrappers… however I cannot get mine to train

tutorial : https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

repo : https://github.com/EMCP/torchvision-faster-rcnn/blob/master/model_components/training.py

It seems that I need to fix a problem where there is a NAN loss… and I am wondering if the “mask” part of the coco data is to blame

I was stuck trying to get resnet 101 pretrained on here… instead of the smaller network… but that didn’t take either

/home/emcp/anaconda3/envs/pytorch_150/bin/python /home/emcp/Dev/git/EMCP/faster-rcnn-torchvision/model_components/training.py
Using hyperParameters:
{'hyperParameters': {'anchor_ratios': [0.5, 1, 2],
                     'anchor_scales': [4, 8, 16, 32],
                     'batch_size': 1,
                     'bbox_inside_weights': [1.0, 1.0, 1.0, 1.0],
                     'bbox_normalize_means': [0.0, 0.0, 0.0, 0.0],
                     'bbox_normalize_stds': [0.1, 0.1, 0.2, 0.2],
                     'bbox_normalize_targets_precomputed': True,
                     'bg_threshold_high': 0.5,
                     'bg_threshold_low': 0.0,
                     'bias_decay': False,
                     'checkpoint_interval': 10000,
                     'crop_resize_with_max_pool': False,
                     'display': 20,
                     'display_interval': 100,
                     'double_bias': False,
                     'epoch_max': 100,
                     'epoch_start': 0,
                     'feat_stride': [16],
                     'fg_fraction': 0.25,
                     'fg_threshold': 0.5,
                     'has_rpn': True,
                     'learning_decay_gamma': 0.1,
                     'learning_decay_step': 3,
                     'learning_rate': 0.005,
                     'learning_weight_decay': 0.0005,
                     'max_num_gt_boxes': 50,
                     'max_size': 1000,
                     'momentum': 0.9,
                     'net': 'res101',
                     'optimizer': 'sgd',
                     'pixel_means': [[[102.9801, 115.9465, 122.7717]]],
                     'pooling_mode': 'align',
                     'pooling_size': 7,
                     'pre_trained_model_path': '/home/emcp/Dev/git/EMCP/faster-rcnn-pytorch-native/model_components/models_pretrained/resnet101_caffe.pth',
                     'proposal_method': 'gt',
                     'random_seed': 3,
                     'rpn_batch_size': 256,
                     'rpn_bbox_inside_weights': [1.0, 1.0, 1.0, 1.0],
                     'rpn_clobber_positives': False,
                     'rpn_fg_fraction': 0.5,
                     'rpn_min_size': 8,
                     'rpn_negative_overlap': 0.3,
                     'rpn_nms_thresh': 0.7,
                     'rpn_positive_overlap': 0.7,
                     'rpn_positive_weight': -1.0,
                     'rpn_post_nms_top_n': 2000,
                     'rpn_pre_nms_top_n': 12000,
                     'scales': [1394],
                     'testing': {'bbox_reg': True,
                                 'check_epoch': 1,
                                 'check_point': 477,
                                 'check_session': 1,
                                 'enable_visualization': True,
                                 'has_rpn': True,
                                 'max_size': 400,
                                 'nms': 0.3,
                                 'parallelization_mode': 0,
                                 'rpn_post_nms_top_n': 1000,
                                 'scales': [400]},
                     'trim_height': 600,
                     'trim_width': 600,
                     'truncated': False,
                     'use_all_gt': True,
                     'use_class_agnostic_regression': False,
                     'use_flipped': True,
                     'use_pretrained_net': True},
 'pytorch_engine': {'enable_cuda': True,
                    'enable_multiple_gpus': False,
                    'enable_tfb': True,
                    'num_workers': 1,
                    'resume_checkpoint': False,
                    'resume_checkpoint_epoch': 1,
                    'resume_checkpoint_num': 0,
                    'resume_checkpoint_session': 1,
                    'session': 1},
 'resnet': {'fixed_blocks': 1}}
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
.
/home/emcp/anaconda3/envs/pytorch_150/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. 
  warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "
/opt/conda/conda-bld/pytorch_1587428266983/work/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
	nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
	nonzero(Tensor input, *, bool as_tuple)
Loss is nan, stopping training
{'loss_classifier': tensor(2.9877, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'loss_objectness': tensor(17.7084, device='cuda:0',
       grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)}

Process finished with exit code 1

emcp · May 4, 2020, 7:54am

I have realized my mistake and was under the assumption min, max annotations were part of the data in coco… when instead it is required I calculate that myself in my data loader

        # get bounding box coordinates for each mask
        num_targets = len(targets)
        boxes = []
        for i in range(num_targets):
            box = targets[i]["bbox"]
            xmin = box[0]
            xmax = box[0] + box[2]
            ymin = box[1]
            ymax = box[1] + box[3]
            boxes.append([xmin, ymin, xmax, ymax])