Need help debugging issue on fasterrcnn_mobilenet

SarthakJain · July 18, 2021, 7:28am

Hi,

For the last couple of weeks, I have been really struggling to debug this new problem when training my detection model. Couple of weeks before the training worked great, but now all of a sudden I get this new unheard error.

I make my model and train overall like this but still get an error.

#Model 
torch.set_default_dtype(torch.float)
backbone = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(pretrained=True)
backbone.roi_heads.box_predictor.cls_score.out_features = len(classes) 
backbone.roi_heads.box_predictor.bbox_pred.out_features = 4 * (len(classes))

#Training 
for epoch in range(epochs):
      train_one_epoch(net, optimizer, train_loader, device, epoch, print_freq=10)
      evaluate(net, test_loader, device=device)
    
    print("Time for Total Training {:0.2f}".format(time.time() - start_time))

    return net

I get some weird error that look like this:

 26     for epoch in range(epochs):

—> 27 train_one_epoch(net, optimizer, train_loader, device, epoch, print_freq=10)
28 evaluate(net, test_loader, device=device)
29

/content/engine.py in train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq)
44 if not math.isfinite(loss_value):
45 print(“Loss is {}, stopping training”.format(loss_value))
—> 46 print(loss_dict_reduced)
47 sys.exit(1)
48

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
253 create_graph=create_graph,
254 inputs=inputs)
→ 255 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
256
257 def register_hook(self, hook):

/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
147 Variable.execution_engine.run_backward(
148 tensors, grad_tensors, retain_graph, create_graph, inputs,
→ 149 allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
150
151

RuntimeError: Found dtype Double but expected Float.

I also have boxes and labels encoded like this boxes = torch.as_tensor(boxes, dtype = torch.float32) labels = torch.as_tensor(labels, dtype = torch.int64) and my images are float tensors.

How do I get rid of this runtime error?
For all of my code with my data class, imported libs, and train check over here

Thanks for the help,
Sarthak

ptrblck · July 20, 2021, 5:30am

Could you remove the torch.set_default_dtype line of code and explicitly cast the tensors to the desired dtype?

SarthakJain · July 20, 2021, 5:55pm

Thanks for the help. Explicitly casting tensors to desired dtype helped.