Torch.jit.trace devices

Hey i tried to save a pretrained model with torch.jit.trace and it says that all tensors are not on the same devices (cuda or cpu). But I don’t understand which tensors, and how can I fix this please ?
Thanks !

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-22-d126d5b7e7f6> in <module>()
      2 
      3 input_tensor = torch.rand(1,3,224,224)
----> 4 script_model = torch.jit.trace(model, input_tensor)
      5 script_model.save("models/fRCNN_resnet50.pt")

14 frames

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias)
    438                             _pair(0), self.dilation, self.groups)
    439         return F.conv2d(input, weight, bias, self.stride,
--> 440                         self.padding, self.dilation, self.groups)
    441 
    442     def forward(self, input: Tensor) -> Tensor:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument weight in method wrapper_thnn_conv2d_forward)

Is this a new issue or a double post from here where you might have forgotten to follow up on the last answer?

Hello, it’s a new one for me. I tried an other code with google collab this time. The error is smallest than the previous one. Do you think it’s the same thing ? And do you know if there is a solution?
Thank you !

It’s hard to tell, if the issues you are seeing are the same, as I wasn’t able to reproduce it using your previous code snippet and would need to get more information how to reproduce the error.

I built my model with this code. And I load a model I already have saved with torch.save(model, path_to_model).

CUDA_LAUNCH_BLOCKING="1"

torch.set_default_tensor_type(torch.DoubleTensor) 
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

# CUDA for PyTorch
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
print("device is ", device)

num_classes = 2 

# get number of input channels fGénial or the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

model.to(device)

params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.001,
                            momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                               step_size=10,
                                               gamma=0.5)

model.load_state_dict(torch.load(path_to_model))

model.eval()

input_tensor = torch.rand(1,3,224,224)
script_model = torch.jit.trace(model, input_tensor)
script_model.save("models/fRCNN_resnet50.pt")

If this is not enough what should I send you to reproduce this error ?
Thanks!

Hello @ptrblck , I don’t know if it can help but I tried to have a cleaner code, so I built a class who will build every object automatically. I just passed in input the json annotations. So I just copy/past my code that already worked, and I added all “self.variable”. But now that I’m launching the training, I got the same issue than the last time with the 2 devices founded, when I tried to save the model with torch.jit.trace.

here are my steps for my object detection code :
I defined a class called self.dataset = ToolDataset.
On this first class I have defined my input (image) and my output (target who is a dict with bboxes, labels, area …).
Then I built a data loader, and I used the function train_one_epoch of the engine librarie. On this function, I gave in input my model (a faster r cnn), my data loader, and the device who is cuda:0 (i printed it). And this function iterate on my data loader. This function defines a list of images and a list of targets, and converts the values of the list into the good device.
And then it calls model(images, targets). And on this step I got the error (i pasted the error at the end of the message).
I got the error even if every tensor (my images, and every values of my target dictionnary) returned True for the command tensor.is_cuda. So I really don’t understand why does the error say that I have also a cpu device. I show you the code and the error :

This is my code into my function train of my class :

    def train(self, num_epoch = 10, gpu = True):            
        if gpu :
            CUDA_LAUNCH_BLOCKING="1"
            #torch.set_default_tensor_type(torch.FloatTensor) 
            model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
            use_cuda = torch.cuda.is_available()
            device = torch.device("cuda:0" if use_cuda else "cpu")
            model.to(device)
            if self.multi_object_detection == False : 
                num_classes = 2 # ['Tool', 'background']
            else : 
                print("need to set a multi object detection code")

            in_features = model.roi_heads.box_predictor.cls_score.in_features
            model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
            params_name_to_update = []
            params_to_update = []
            for name,param in model.named_parameters():
                if name in list_param_to_not_update : 
                    param.requires_grad = False
                if param.requires_grad == True :
                    params_name_to_update.append(name)
                    params_to_update.append(param)
            model_parameters = filter(lambda p: p.requires_grad, model.parameters())
            #params = sum([np.prod(p.size()) for p in model_parameters])
            params = [p for p in model.parameters() if p.requires_grad]

            
            optimizer = torch.optim.SGD(params, lr=0.001, momentum=0.9, weight_decay=0.0005)
            lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)
            gc.collect()
            num_epochs = 5
            FILE_model_dict_gpu = "model_state_dict__gpu_lab2_and_lab7_5epoch.pth"
            list_of_list_losses = []
            print("device = ", device)
            for epoch in tqdm(range(num_epochs)):

                # Train for one epoch, printing every 10 iterations
                train_his_, list_losses, list_losses_dict = train_one_epoch(model, optimizer, self.data_loader, device, epoch, print_freq=10)
                list_of_list_losses.append(list_losses)

                lr_scheduler.step()
                print("lr after update : ", lr_scheduler)
                torch.cuda.empty_cache()
                gc.collect()

This is the code of my function train_one_epoch

    for i, values in tqdm(enumerate(metric_logger.log_every(data_loader, print_freq, header))):
        images, targets = values
        for image in images : 
            print("before the to(device) operation, image.is_cuda = {}".format(image.is_cuda))
        images = list(image.to(device, dtype=torch.float) for image in images)11
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        #images = [image.cuda() for image in images]
        for image in images : 
            print("after the to(device) operation, image.is_cuda = {}".format(image.is_cuda))
        for target in targets :
            for t, dict_value in target.items():
                print("after the to(device) operation, dict_value.is_cuda".format(dict_value.is_cuda))

        # Feed the training samples to the model and compute the losses
        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())

and this is my error :

--------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-108-51a35da5b1fe> in <module>
----> 1 class_model.train()

<ipython-input-106-380d5811e994> in train(self, num_epoch, gpu)
    138 
    139                 # Train for one epoch, printing every 10 iterations
--> 140                 train_his_, list_losses, list_losses_dict = train_one_epoch(model, optimizer, self.data_loader, device, epoch, print_freq=10)
    141                 list_of_list_losses.append(list_losses)
    142                 # Compute losses over the validation set

<ipython-input-105-ade37d0481f0> in train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq)
    517 
    518         # Feed the training samples to the model and compute the losses
--> 519         loss_dict = model(images, targets)
    520         losses = sum(loss for loss in loss_dict.values())
    521 

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/generalized_rcnn.py in forward(self, images, targets)
     95             features = OrderedDict([('0', features)])
     96         proposals, proposal_losses = self.rpn(images, features, targets)
---> 97         detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
     98         detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
     99 

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/roi_heads.py in forward(self, features, proposals, image_shapes, targets)
    752         box_features = self.box_roi_pool(features, proposals, image_shapes)
    753         box_features = self.box_head(box_features)
--> 754         class_logits, box_regression = self.box_predictor(box_features)
    755 
    756         result: List[Dict[str, torch.Tensor]] = []

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/faster_rcnn.py in forward(self, x)
    280             assert list(x.shape[2:]) == [1, 1]
    281         x = x.flatten(start_dim=1)
--> 282         scores = self.cls_score(x)
    283         bbox_deltas = self.bbox_pred(x)
    284 

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/linear.py in forward(self, input)
     94 
     95     def forward(self, input: Tensor) -> Tensor:
---> 96         return F.linear(input, self.weight, self.bias)
     97 
     98     def extra_repr(self) -> str:

~/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py in linear(input, weight, bias)
   1845     if has_torch_function_variadic(input, weight):
   1846         return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
-> 1847     return torch._C._nn.linear(input, weight, bias)
   1848 
   1849 

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument mat1 in method wrapper_addmm)

Thanks for your help ! If you need some other information let me know what should I share you please.

Thanks for pinging and sorry for the late reply, as I thought I’ve already answered.
It seems that the previously posted code already raises the error when trying to torch.jit.trace the model (not only during the saving), while torch.jit.script seems to work.
Could you verify it and if so, post an issue on GitHub?

Thanks ! The code run ! But the model I saved thanks to this solution is not worked on my c++ code. I got the error :

QQmlApplicationEngine failed to load component
qrc:/main.qml:-1 No such file or directory

[2021-07-13 17:25:03.218873] [0x00007f35f750e000] [info]    Starting a new event log file...
[2021-07-13 17:25:03.218951] [debug] [/home/nil/ws/maestro/libMoonVision/framework/videograbber.cpp] [68] []Video FPS for '/maestroData/chole.mp4' is 30,000000
terminate called after throwing an instance of 'torch::jit::ErrorReport'
  what():  
Unknown type name 'NoneType':
Serialized   File "code/__torch__/torchvision/models/detection/transform.py", line 11
  image_std : List[float]
  size_divisible : int
  fixed_size : NoneType
               ~~~~~~~~ <--- HERE
  def forward(self: __torch__.torchvision.models.detection.transform.GeneralizedRCNNTransform,
    images: List[Tensor],

Aborted (core dumped)

And on an other side I still have the issue mentioned on my previous message ( RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument mat1 in method wrapper_addmm) )when I’m calling model(images, targets) .
I printed images and targets and I got :

images = [tensor([[[0.0078, 0.0078, 0.0078,  ..., 0.0000, 0.0000, 0.0000],
         [0.0078, 0.0078, 0.0078,  ..., 0.0000, 0.0000, 0.0000],
         [0.0078, 0.0078, 0.0078,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.0078, 0.0078, 0.0078,  ..., 0.0118, 0.0118, 0.0118],
         [0.0235, 0.0235, 0.0235,  ..., 0.0235, 0.0235, 0.0235],
         [0.0353, 0.0353, 0.0353,  ..., 0.0314, 0.0314, 0.0314]],

        [[0.0078, 0.0078, 0.0078,  ..., 0.0000, 0.0000, 0.0000],
         [0.0078, 0.0078, 0.0078,  ..., 0.0000, 0.0000, 0.0000],
         [0.0078, 0.0078, 0.0078,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.0078, 0.0078, 0.0078,  ..., 0.0039, 0.0039, 0.0039],
         [0.0235, 0.0235, 0.0235,  ..., 0.0157, 0.0157, 0.0157],
         [0.0353, 0.0353, 0.0353,  ..., 0.0235, 0.0235, 0.0235]],

        [[0.0078, 0.0078, 0.0078,  ..., 0.0118, 0.0118, 0.0118],
         [0.0078, 0.0078, 0.0078,  ..., 0.0118, 0.0118, 0.0118],
         [0.0078, 0.0078, 0.0078,  ..., 0.0118, 0.0118, 0.0118],
         ...,
         [0.0078, 0.0078, 0.0078,  ..., 0.0078, 0.0078, 0.0078],
         [0.0235, 0.0235, 0.0235,  ..., 0.0196, 0.0196, 0.0196],
         [0.0353, 0.0353, 0.0353,  ..., 0.0275, 0.0275, 0.0275]]],
       device='cuda:0')]
targets = [{'boxes': tensor([[1118.8964,    0.0000, 1368.9186,  399.3243],
        [1043.0958,  111.4863, 1332.4319,  426.1295]], device='cuda:0',
       dtype=torch.float64), 'labels': tensor([1, 1], device='cuda:0'), 'index': tensor([311], device='cuda:0'), 'area': tensor([99839.9404, 91037.6485], device='cuda:0', dtype=torch.float64), 'iscrowd': tensor([0], device='cuda:0')}]

Let me know if you need some other informations. I really thank you !