[DataParallel] What happens if forward path uses cpu?

I’m currently using dataparallel (not DistributedDataParallel yet, though it’s my ultimate goal).
I’m using 3 gpus, but it’s suspicious if the process is actually done in parallel.


Inside the model:

    def forward(self, input):
        print('Job done in Device:',input.device)
        return output

when I run model.forward path, as a result, I get below printed:
Job done in Device: cuda:0
Job done in Device: cuda:1
Job done in Device: cuda:2

The problem is, I expect those 3 lines to be printed at almost same time (since the process is parallel)
But, each of them is printed with fairly long time gap (>1sec), which suggests that they are not actually processed in parallel. FYI, the whole process should take ~1 sec.

I’m suspecting if I am using cpu somehow during the forward pass, but I am not sure if it’s even possible after wrapping the model with DataParallel.
Or Is there any way to check if the process are done in parallel?

Thanks for reading!

DataParallel has a few drawbacks compared to DistributedDataParallel and one of them is that a single process is driving all devices. If you are using CPU operations inside the model, this would additionally slow down your use case. The CUDA kernels could still overlap and you would see it when profiling the workload via the PyTorch profiler or e.g. Nsight Systems.


Ok, this gives another reason to go for DistributedDataparallel.

ptrblck, just want to say thank you that I always get a lot of help from your posts and answers.
Appreciate your time and effort for pytorch community.