QAT: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

Hello everyone,
I am trying to quantize the retinanet for QAT. Firstly I wanted to quantize only some parts of the network and only then the whole net. In order to save time, I am using the Detectron2, but I suppose this issue is related to pytorch.
First of all I tried to quantize RetinaNetHead (see the original one here - class RetinaNetHead: original retinanet in detectron2)
my implementation of RetinaNetHead based on the original one as in tutorial for quantization:

  1. Quant and Dequant Stubs. 2. corresponding forward
    q_retinanet.py:
class Q_RetinaNetHead(nn.Module):
    """
    The head used in RetinaNet for object classification and box regression.
    It has two subnets for the two tasks, with a common structure but separate parameters.
    """

    def __init__(self, cfg, input_shape: List[ShapeSpec]):
        super().__init__()
        # fmt: off
        in_channels = input_shape[0].channels
        num_classes = cfg.MODEL.RETINANET.NUM_CLASSES
        num_convs = cfg.MODEL.RETINANET.NUM_CONVS
        prior_prob = cfg.MODEL.RETINANET.PRIOR_PROB
        num_anchors = build_anchor_generator(cfg, input_shape).num_cell_anchors
        # fmt: on
        assert (
                len(set(num_anchors)) == 1
        ), "Using different number of anchors between levels is not currently supported!"
        num_anchors = num_anchors[0]

        cls_subnet = []
        # cls_subnet.append(QuantStub())
        bbox_subnet = []
        for _ in range(num_convs):
            cls_subnet.append(
                nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
            )
            cls_subnet.append(nn.ReLU())
            bbox_subnet.append(
                nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
            )
            bbox_subnet.append(nn.ReLU())

        # cls_subnet.append(DeQuantStub())
        self.quant = QuantStub()  # added line
        self.cls_subnet = nn.Sequential(*cls_subnet)
        # self.cls_dequant = DeQuantStub() #added line
        self.bbox_subnet = nn.Sequential(*bbox_subnet)
        self.cls_score = nn.Conv2d(
            in_channels, num_anchors * num_classes, kernel_size=3, stride=1, padding=1
        )
        self.bbox_pred = nn.Conv2d(in_channels, num_anchors * 4, kernel_size=3, stride=1, padding=1)

        self.dequant = DeQuantStub()  # added line
        # Initialization

        for modules in [self.cls_subnet, self.bbox_subnet, self.cls_score, self.bbox_pred]:
            for layer in modules.modules():
                if isinstance(layer, nn.Conv2d):
                    torch.nn.init.normal_(layer.weight, mean=0, std=0.01)
                    torch.nn.init.constant_(layer.bias, 0)

        # Use prior in model initialization to improve stability
        bias_value = -(math.log((1 - prior_prob) / prior_prob))
        torch.nn.init.constant_(self.cls_score.bias, bias_value)

    def forward(self, features):
        """
        Arguments:
            features (list[Tensor]): FPN feature map tensors in high to low resolution.
                Each tensor in the list correspond to different feature levels.

        Returns:
            logits (list[Tensor]): #lvl tensors, each has shape (N, AxK, Hi, Wi).
                The tensor predicts the classification probability
                at each spatial position for each of the A anchors and K object
                classes.
            bbox_reg (list[Tensor]): #lvl tensors, each has shape (N, Ax4, Hi, Wi).
                The tensor predicts 4-vector (dx,dy,dw,dh) box
                regression values for every anchor. These values are the
                relative offset between the anchor and the ground truth box.
        """
        logits = []
        bbox_reg = []
        for feature in features:
            logits.append(
                self.dequant(self.cls_score(self.cls_subnet(self.quant(feature)))))  # added line: self,cls_quant()
            bbox_reg.append(self.dequant(self.bbox_pred(self.bbox_subnet(self.quant(feature)))))
        return logits, bbox_reg
  1. Fuse modules and configuration
    train_net.py:
trainer.model.head.train()
trainer.model.head.qconfig = torch.quantization.get_default_qconfig('fbgemm')
modules_to_fuse = [['cls_subnet.0', 'cls_subnet.1'], ['cls_subnet.2', 'cls_subnet.3'], ['cls_subnet.4', 'cls_subnet.5'], ['cls_subnet.6', 'cls_subnet.7'],
                        ['bbox_subnet.0', 'bbox_subnet.1'], ['bbox_subnet.2', 'bbox_subnet.3'], ['bbox_subnet.4', 'bbox_subnet.5'], ['bbox_subnet.6', 'bbox_subnet.7']]
torch.quantization.fuse_modules(trainer.model.head, modules_to_fuse, inplace=True) 
torch.quantization.prepare_qat(trainer.model.head, inplace=True)

do_train(cfg, trainer)

trainer.model.head.eval()
print("Convert->")
torch.quantization.convert(trainer.model.head, inplace=True)

The training precess is done successfully, but the last line with convert give me an error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

although I tried everything, even:

cuda = torch.device('cuda:0') 
trainer.model.to(cuda)

I also checked all the tensors and they are in cuda (please see Q_RetinaNetHead file after training)
The entire architecture before training of the RetinaNet can be seen in (Q_RetinaNet file)

My questiona are:

  1. how to get rid of this error
  2. Am I right, that qat process was done successfully and conver is only like an export of the already trained model?

Best regard,
yayapa

files Q_RetinaNet and Q_RetinaNetHead can be found here as pdf

  1. can you try to move your model to CPU and see if that fixes the error? Currently quantized kernels are only supported on CPU.
  2. yes, that is correct
1 Like

Thank you for the answer! Yes, if I make it to cpu (model.to(torch.device(cpu))), it does work! Is there any workaround or something what I can do to transfer it on gpu?

great to hear. Currently the convert function only works on CPU, because we do not have support for running the quantized kernels on GPU.

Thank you for explanation. Will this feature appear in 1.8.0? Or maybe already in the nightly build?

We are not planning to work on quantized kernel support on CUDA for v1.8, but we definitely welcome OSS contributions!

@Vasiliy_Kuznetsov Does QAT supported on cuda deployment?