How to use torch.nn.functional.conv2d with multi gpu?

Zhang_Chi · March 7, 2021, 8:38am

How to use torch.nn.functional.conv2d for computing with multi gpus?

Zhang_Chi · March 8, 2021, 6:19am

anyone knows about it?

ptrblck · March 8, 2021, 7:50am

You won’t be able to use this API on different GPUs directly and could split the input in e.g. the batch dimension, send each chunk to a GPU, and apply the convolution there.

Zhang_Chi · March 9, 2021, 6:00am

thanks for the reply. after sending each chunk to different gpus, how to apply the convolution in parallel?

ptrblck · March 9, 2021, 6:07am

You could directly call them:

out1 = F.conv2d(in1, weight1)
out2 = F.conv2d(in2, weight2)

where inX and weightX is on the cuda:x device.

Zhang_Chi · March 9, 2021, 9:18am

but these two operations are done sequentially, right? how to make it parallel?

ptrblck · March 10, 2021, 12:10am

The kernel scheduling would be executed sequentially by the CPU and thus the launches might be delayed, but the kernel execution on the GPU would be performed in parallel.
You would see this effect by profiling a sufficiently large workload using e.g. Nsight Systems.

Zhang_Chi · March 10, 2021, 5:23am

thanks for the explanation.
so out2 = F.conv2d(in2, weight2) will start before F.conv2d(in1, weight1) is finished on GPU 1, right?

BTW, what if the weights are shared in two lines, should I do it by:
out1 = F.conv2d(in1, weight.cuda(0))
out2 = F.conv2d(in2, weight.cuda(1))

ptrblck · March 10, 2021, 7:39am

If the CPU is fast enough to schedule it and the actual kernel execution takes more time than the launch, then yes.
You won’t be able to see any overlap with a tiny workload, as the kernel launch overheads would be larger than the actual GPU workload, i.e. kernel1 finishes before the CPU can schedule the launch of kernel2.

Yes, you have to move the parameters to the appropriate device before executing the operation.

Zhang_Chi · March 10, 2021, 8:17am

Thank you very much for the reply!