QAT: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

yayapa · December 21, 2020, 8:03am

Hello everyone,
I am trying to quantize the retinanet for QAT. Firstly I wanted to quantize only some parts of the network and only then the whole net. In order to save time, I am using the Detectron2, but I suppose this issue is related to pytorch.
First of all I tried to quantize RetinaNetHead (see the original one here - class RetinaNetHead: original retinanet in detectron2)
my implementation of RetinaNetHead based on the original one as in tutorial for quantization:

Quant and Dequant Stubs. 2. corresponding forward
q_retinanet.py:

class Q_RetinaNetHead(nn.Module):
    """
    The head used in RetinaNet for object classification and box regression.
    It has two subnets for the two tasks, with a common structure but separate parameters.
    """

    def __init__(self, cfg, input_shape: List[ShapeSpec]):
        super().__init__()
        # fmt: off
        in_channels = input_shape[0].channels
        num_classes = cfg.MODEL.RETINANET.NUM_CLASSES
        num_convs = cfg.MODEL.RETINANET.NUM_CONVS
        prior_prob = cfg.MODEL.RETINANET.PRIOR_PROB
        num_anchors = build_anchor_generator(cfg, input_shape).num_cell_anchors
        # fmt: on
        assert (
                len(set(num_anchors)) == 1
        ), "Using different number of anchors between levels is not currently supported!"
        num_anchors = num_anchors[0]

        cls_subnet = []
        # cls_subnet.append(QuantStub())
        bbox_subnet = []
        for _ in range(num_convs):
            cls_subnet.append(
                nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
            )
            cls_subnet.append(nn.ReLU())
            bbox_subnet.append(
                nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
            )
            bbox_subnet.append(nn.ReLU())

        # cls_subnet.append(DeQuantStub())
        self.quant = QuantStub()  # added line
        self.cls_subnet = nn.Sequential(*cls_subnet)
        # self.cls_dequant = DeQuantStub() #added line
        self.bbox_subnet = nn.Sequential(*bbox_subnet)
        self.cls_score = nn.Conv2d(
            in_channels, num_anchors * num_classes, kernel_size=3, stride=1, padding=1
        )
        self.bbox_pred = nn.Conv2d(in_channels, num_anchors * 4, kernel_size=3, stride=1, padding=1)

        self.dequant = DeQuantStub()  # added line
        # Initialization

        for modules in [self.cls_subnet, self.bbox_subnet, self.cls_score, self.bbox_pred]:
            for layer in modules.modules():
                if isinstance(layer, nn.Conv2d):
                    torch.nn.init.normal_(layer.weight, mean=0, std=0.01)
                    torch.nn.init.constant_(layer.bias, 0)

        # Use prior in model initialization to improve stability
        bias_value = -(math.log((1 - prior_prob) / prior_prob))
        torch.nn.init.constant_(self.cls_score.bias, bias_value)

    def forward(self, features):
        """
        Arguments:
            features (list[Tensor]): FPN feature map tensors in high to low resolution.
                Each tensor in the list correspond to different feature levels.

        Returns:
            logits (list[Tensor]): #lvl tensors, each has shape (N, AxK, Hi, Wi).
                The tensor predicts the classification probability
                at each spatial position for each of the A anchors and K object
                classes.
            bbox_reg (list[Tensor]): #lvl tensors, each has shape (N, Ax4, Hi, Wi).
                The tensor predicts 4-vector (dx,dy,dw,dh) box
                regression values for every anchor. These values are the
                relative offset between the anchor and the ground truth box.
        """
        logits = []
        bbox_reg = []
        for feature in features:
            logits.append(
                self.dequant(self.cls_score(self.cls_subnet(self.quant(feature)))))  # added line: self,cls_quant()
            bbox_reg.append(self.dequant(self.bbox_pred(self.bbox_subnet(self.quant(feature)))))
        return logits, bbox_reg

Fuse modules and configuration
train_net.py:

trainer.model.head.train()
trainer.model.head.qconfig = torch.quantization.get_default_qconfig('fbgemm')
modules_to_fuse = [['cls_subnet.0', 'cls_subnet.1'], ['cls_subnet.2', 'cls_subnet.3'], ['cls_subnet.4', 'cls_subnet.5'], ['cls_subnet.6', 'cls_subnet.7'],
                        ['bbox_subnet.0', 'bbox_subnet.1'], ['bbox_subnet.2', 'bbox_subnet.3'], ['bbox_subnet.4', 'bbox_subnet.5'], ['bbox_subnet.6', 'bbox_subnet.7']]
torch.quantization.fuse_modules(trainer.model.head, modules_to_fuse, inplace=True) 
torch.quantization.prepare_qat(trainer.model.head, inplace=True)

do_train(cfg, trainer)

trainer.model.head.eval()
print("Convert->")
torch.quantization.convert(trainer.model.head, inplace=True)

The training precess is done successfully, but the last line with convert give me an error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

although I tried everything, even:

cuda = torch.device('cuda:0') 
trainer.model.to(cuda)

I also checked all the tensors and they are in cuda (please see Q_RetinaNetHead file after training)
The entire architecture before training of the RetinaNet can be seen in (Q_RetinaNet file)

My questiona are:

how to get rid of this error
Am I right, that qat process was done successfully and conver is only like an export of the already trained model?

Best regard,
yayapa

files Q_RetinaNet and Q_RetinaNetHead can be found here as pdf

Vasiliy_Kuznetsov · December 22, 2020, 2:27am

can you try to move your model to CPU and see if that fixes the error? Currently quantized kernels are only supported on CPU.
yes, that is correct

yayapa · December 22, 2020, 6:09am

Thank you for the answer! Yes, if I make it to cpu (model.to(torch.device(cpu))), it does work! Is there any workaround or something what I can do to transfer it on gpu?

Vasiliy_Kuznetsov · December 22, 2020, 4:34pm

great to hear. Currently the convert function only works on CPU, because we do not have support for running the quantized kernels on GPU.

yayapa · December 22, 2020, 5:42pm

Thank you for explanation. Will this feature appear in 1.8.0? Or maybe already in the nightly build?

Vasiliy_Kuznetsov · December 22, 2020, 5:49pm

We are not planning to work on quantized kernel support on CUDA for v1.8, but we definitely welcome OSS contributions!

jinfagang · April 25, 2021, 7:47am

@Vasiliy_Kuznetsov Does QAT supported on cuda deployment?