Concurrent forward pass on multiple GPUs

Let’s say I have 8 models hosted on 8 GPUs (same class, different initialization)

models = [MyModule().cuda(i) for i in range(8)]

And I have a CPU tensor

x = torch.randn(1000, 128)

If I run the forward pass for all 8 models in a for loop like this

predictions = [models[i](x.cuda(i, non_blocking=True)) for i in range(8)]

The run time is significantly slower, probably 6x~7x than just running it on one single GPU

models[0](x.cuda(0, non_blocking=True))

Is this expected? I was under the impression that CUDA operations are asynchronous, and it should remain non-blocking unless the result of the forward pass is used by someone else?

Context: I am trying to train 8 models (same class, different initialization) concurrently in a single-process program. So I need to run the forward and backward pass concurrently on the same batch. The 8 models are placed on 8 different GPUs.

Have you checked the relative cost of data transfer between CPU and GPU vs. the actual computation time of the model? I’m curious if the bulk of the time is spent on .cuda vs. the actual model computation on each GPU.

I haven’t. That s a good point I will do it later. I realized cuda(non_blocking=True) is not really non-blocking. There is a short delay before it returns. Do you have any idea how I can hide the latency of the cuda() calls? Also what is the idiosyncratic way to train multiple models on different gpus using the same input?

Many thanks!

You are right that the data transfer is indeed not 100% parallel. But even after I remove the data transfer from the comparison. The computation on multiple GPUs is still significantly (~6x) longer

model = torch.nn.Sequential(*[torch.nn.Linear(1000, 1000) for i in range(10)]).cuda(0)
x = torch.randn(100000, 1000).cuda(0)
timeit -r1 -n1 model(x)
1 loop, best of 1: 820 µs per loop
In [32]: models = [torch.nn.Sequential(*[torch.nn.Linear(1000, 1000) for i in range(10)]).cuda(i) for i in range(8)]
In [33]: xs = [x.cuda(i) for i in range(8)] # pre-distribute x to 8 GPUs
In [34]: timeit -r1 -n1 [model(x) for model, x in zip(models, xs)] # Is this supposed to run in parallel?
1 loop, best of 1: 5.17 ms per loop. # ~6x longer

Is there more to this timing script or is this the bulk of it? One issue that could be happening here is that the first run of the linear layer pulls in cuBLAS (and might do some light autotuning as well) on each GPU. It could be that these steps are synchronous and execute one at a time, incurring the serial time penalty you see here. Does the same thing happen when you run the linear layer on each GPU a few times as a “warmup” before doing the timing>

Note that you would want to do torch.cuda.synchronize before starting the timing and also before stopping the timing.

It seems you are comparing the execution of a single model on a single GPU vs. the execution of all 8 models on 8 different GPUs. Is my understanding correct or were you comparing the 8-GPU run vs. running 8 models sequentially on a single GPU?
I wouldn’t expect the latter case to be slower in the multi-GPU setup, but would expect to not see a perfect scaling in the former use case.
While the models will be executed in parallel, the actual kernel launch overhead might be visible for tiny models. To verify it, you could create a profile with e.g. Nsight Systems and check the achieved overlap.

I’m comparing running 1 model on one GPU vs running 8 models on 8 GPUs.
I would not expect perfect parallelization either, but the 6-7x slow down still surprises me. When you said tiny model, do you mean the size of the model is small? My batch size is actually very large. Anyway I will do profiling once I get a chance.
On a related note, do you know what is the best way to call

cuda(non_blocking=True)

to copy the same tensor to all 8 GPUs in parallel?
The cuda call has a significant delay even with non_blocking set to true.

Thanks!

1 Like

as ptrblck says, the kernel launch overhead is huge when your model is big. you might try to use parallel kernel launch by using python thread pool, some thing like:

pool = ThreadPoolExecutor(max_workers=10)

predictions = [pool.submit(models[i](x.cuda(i, non_blocking=True))) for i in range(8)]

Note that the kernel launch overheads might be visible if the actual workload is tiny (i.e. the GPU compute time is small) compared to the launch time (which is constant).
These use cases are often seen during deployment where the batch size might be set to a single sample. In this case, you might want to use CUDA Graphs (if applicable for your application) to capture the launches and replay it later to hide the launch overheads.

thanks for clarifications, CUDA Graphs is awesome.

1 Like