Multi-gpu inference with torchscript model is slow

I traced some classification models from torchvision and then wanted to apply DataParallel for multi-gpu inference. While no error is thrown, for more than 1 gpu passed as device_ids to DataParallel, inference on the torchscript model is much slower than the regular torch model (5x-10x slower), even if batch_size=1, which results in effectively only 1 gpu being used.

If the batch size is set to 1, only a single GPU will be used, as the data cannot be chunked and send to each device.

so i know that. I was just giving an example of the issues I was encountering with DataParallel. The issue is the torchscript model inferences are significantly slower than the standard torch model inferences. I tested a few different parameters, like batch_size and different number of gpu device ids with DataParallel (as well as a few different architectures). I found with 1 device_id or when not using DataParallel at all, the model inference speed was similar between the torchscript and torch model. However, when there was more than 1 device_id passed to DataParallel, the inference speed for the torchscript model reduced dramatically in comparison to the standard torch model with the same parameters (5x-10x slower). I brought up the example because it was curious that even when only 1 gpu was effectively being used when batch_size=1 with 2 device_ids passed, the torchscript model was still significantly slower than the regular torch model. However, when 1 device_id was passed, the torchscript model was similar in speed to the torch model.