Using FasterRCNN model for object detection ( copy_if failed to synchronize: device-side assert triggered)

juanmed · June 21, 2019, 5:21pm

Hi,

I am a beginner in Pytorch. I am currently trying to use the recently released (torchvision 0.3) models for object detection.

I have read the examples here and here, but I am unable to make it work. Specifically, the example do segmentation, and I only want to do object detection. I think that is the reason why I cannot adequately adapt the examples to my script.

The error I constantly get is:

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "detector_totales.py", line 222, in <module>
    main()
  File "detector_totales.py", line 186, in main
    train_one_epoch(model, optimizer, train_loader, device, epoch, print_freq=1)
  File "/home/fer/git_clone/eleccionesGT2019/engine.py", line 30, in train_one_epoch
    loss_dict = model(images, targets)
  File "/home/fer/git_clone/eleccionesGT2019/.tse/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fer/git_clone/eleccionesGT2019/.tse/lib/python3.5/site-packages/torchvision/models/detection/generalized_rcnn.py", line 52, in forward
    detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
  File "/home/fer/git_clone/eleccionesGT2019/.tse/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fer/git_clone/eleccionesGT2019/.tse/lib/python3.5/site-packages/torchvision/models/detection/roi_heads.py", line 534, in forward
    class_logits, box_regression, labels, regression_targets)
  File "/home/fer/git_clone/eleccionesGT2019/.tse/lib/python3.5/site-packages/torchvision/models/detection/roi_heads.py", line 34, in fastrcnn_loss
    sampled_pos_inds_subset = torch.nonzero(labels > 0).squeeze(1)
RuntimeError: copy_if failed to synchronize: device-side assert triggered

I have tried creating the network following the documentation in h**ps://github.com/pytorch/vision/blob/master/torchvision/models/detection/faster_rcnn.py like this:

def get_mobilenet_model(num_classes):
    # Seguir ejemplo en 

    backbone = torchvision.models.mobilenet_v2(pretrained=True).features
    backbone.out_channels = 1280

    anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),), aspect_ratios=((0.5, 1.0, 2.0),))
    roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0], output_size=7,sampling_ratio=2)

    model = FasterRCNN(backbone, num_classes=num_classes,rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)
    return model

and also like this:

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True) 
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

I have in total 31 classes, but they do not start in 0 and are not sequential. I am not sure if this can be an issue.

Do anyone has any idea on what the issue could be? The full script is available in (h**ps://github.com/juanmed/eleccionesGT2019/blob/master/detector_totales.py). I will appreciate any help.

juanmed · June 23, 2019, 10:40am

I was able to get a little bit further. It seems the problem did was with the classes definition. Originally the classes did not start in 0 and were not sequential. I defined them from 0 and sequential and the previous error seems to be resolved.

However, after trying to run I got a CUDA out of memory error even in the on an NVIDIA RTX 24Gb GPU, which seems very strange since the model is based on mobilenet:

Traceback (most recent call last):
  File "detector_totales.py", line 230, in <module>
    main()
  File "detector_totales.py", line 166, in main
    model = model.to(device)
  File "/home/jfmy/Repositories/eleccionesGT2019/.tse/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to
    return self._apply(convert)
  File "/home/jfmy/Repositories/eleccionesGT2019/.tse/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/jfmy/Repositories/eleccionesGT2019/.tse/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/jfmy/Repositories/eleccionesGT2019/.tse/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/jfmy/Repositories/eleccionesGT2019/.tse/lib/python3.6/site-packages/torch/nn/modules/module.py", line 199, in _apply
    param.data = fn(param.data)
  File "/home/jfmy/Repositories/eleccionesGT2019/.tse/lib/python3.6/site-packages/torch/nn/modules/module.py", line 384, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 246.00 MiB (GPU 0; 23.62 GiB total capacity; 65.31 MiB already
allocated; 60.75 MiB free; 20.69 MiB cached)

After trying to run on CPU, the model does start training and there is some output but a new error, which I do not understand, appears:

Traceback (most recent call last):
  File "detector_totales.py", line 230, in <module>
    main()
  File "detector_totales.py", line 196, in main
    evaluate(model, test_loader, device=device)
  File "/home/fer/git_clone/eleccionesGT2019/.tse/lib/python3.5/site-packages/torch/autograd/grad_mode.py", line 43, in decorate_no_grad
    return func(*args, **kwargs)
  File "/home/fer/git_clone/eleccionesGT2019/engine.py", line 78, in evaluate
    coco = get_coco_api_from_dataset(data_loader.dataset)
  File "/home/fer/git_clone/eleccionesGT2019/coco_utils.py", line 205, in get_coco_api_from_dataset
    return convert_to_coco_api(dataset)
  File "/home/fer/git_clone/eleccionesGT2019/coco_utils.py", line 155, in convert_to_coco_api
    image_id = targets["image_id"].item()
KeyError: 'image_id'

Any help on both errors will be appreciated.

juanmed · June 23, 2019, 11:37am

I was able to solve the 2nd issue related to a key ‘image_id’ not available. It was my error and I noticed after double reading the example here. I only added the ‘boxes’ and ‘labels’ keys to the target dictionary but the keys ‘image_id’, ‘area’ and ‘iscrowd’ are also necessary. I created them by following the same example.

Any help on the CUDA Out of memory error will be welcomed.

juanmed · June 23, 2019, 12:59pm

After running training for 100 epochs, all IOU results are 0.0 and -1.0 which seems quite strange. I’ve got a learning rate = 0.005, momentum = 0.5 and weight_decay = 0.005, which, after reading some posts related to such results (AP = 0.0) might solve such results. Could anything in the model definition be leading to these results? Any comments will be appreciated.

Test:  [0/1]  eta: 0:00:27  model_time: 26.7939 (26.7939)  evaluator_time: 0.0214 (0.0214)  time: 27.9693  data: 1.1538  max mem: 0
Test: Total time: 0:00:28 (28.1102 s / it)
Averaged stats: model_time: 26.7939 (26.7939)  evaluator_time: 0.0214 (0.0214)
Accumulating evaluation results...
DONE (t=0.06s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000

juanmed · June 24, 2019, 10:56am

I think I might have some clues on what is the problem related to the CUDA Out of Memory error. My dataset has several bounding boxes (>20) per image and according to this andthis, calculation of IOU in the GPU is memory expensive.

In those issues, they suggest reducing batch size. So I though my batch size of 2 was low enough, but after I changed it to 1, the training process did started and has not stopped for 70 epochs. I wanted to implement thesuggestion in the thread on moving IOU calculation to CPU (which can severely slow down training).

Do anyone knows where the IOU calculation code is performed for the detection models? Any comments will be appreciated.

juanmed · June 27, 2019, 3:28am

I was able to overcome the issue with CUDA Out of memory error. I did not solve it since I have not yet tried moving IOU calculation to CPU but at least now I can train with batch_size > 1.

In the end I simple reduced the number of anchors created from 5 to 3, and reduced the size of the input images. This allowed me to increase the batch size from 1 to 8.

While doing this, I needed to resize both input images and targets (bounding box, area). Luckily, the detection models released with torchvision 0.3 already include resizing transforms that take care of both images and targets. Please take a look at this thread.

I will close this issue now.

Pcamellon · December 12, 2021, 11:54pm

Hi! The same for me. Can you share your code or point me to a good tutorial?