Slow init of GPUs

I know the init process takes more time on machines with more GPUs, but since we are calling the Pytorch script from external scripts it is kind of a bottleneck for our process.

See the basic example below:

import torch

cuda0 = torch.device(‘cuda:0’)
cuda1 = torch.device(‘cuda:1’)
cuda2 = torch.device(‘cuda:2’)
cuda3 = torch.device(‘cuda:3’)

x0 = torch.tensor([1], device=cuda0)
x1 = torch.tensor([1], device=cuda1)
x2 = torch.tensor([1], device=cuda2)
x3 = torch.tensor([1], device=cuda3)

So the example takes

  • ~24s on a machine with 8GPUs. Tested with 8x2080TI and 8xA6000.
  • ~15s on machine with 4GPUs. Tested with 4xA6000.
  • ~5s on machine with 2GPUs (in example above removed x2 and x3). Test on 2x3090.
  • ~8s on machine with 8GPUs when init only one GPU (i.e. x1,x2,x3 are removed). Test on 8xA6000

And since we call it again and again it becomes an issue.

My understanding is that it is that slow because the initialization process of the GPUs.

So I was wonder whether I can cache something so it is slow only the first time and then when we call the script again it is faster?

No, you won’t be able to cache anything as the CUDA context creation is taking some of the init time and is loading the driver, the native PyTorch kernels, the CUDA math lib kernels (cublas, cuDNN etc.) and needs time to load the actual data onto the device.
In case you are using CUDA 11.7+, you can activate lazy module loading via CUDA_MODULE_LOADING=LAZY , which will avoid pre-loading every kernel and will load it lazily once it’s needed. This will speed up the init process but will add a small overhead for each new kernel as it has to be loaded into the CUDA context before its first execution.

1 Like

Thanks. Tried the CUDA_MODULE_LOADING=LAZY on one of the machines which is with CUDA 11.7 but didn’t help.

I guess the PyTorch scripts will have to stay in memory and work like a server, get calls from outside instead of loading the scripts each time.

You would need to build PyTorch with CUDA 11.7 (or install the nightly binaries) to see the effect. Installing the CUDA toolkit 11.7 on your system with a PyTorch binary using another CUDA runtime (e.g. 11.6) will not work. You should also see a reduction in memory of the CUDA context size and could use it to double check if lazy loading is working.

I don’t fully understand the second point, so could you explain which scripts are supposed to stay in memory?

1 Like

Thank you for the suggestions. It makes sense, I will try it.

Currently we have a web server which loads PHP scripts and then they execute the python (PyTorch) scripts using a shell type interface. But this creates a bottleneck because each reload takes too much time. I.e. the actual work is done in 10s but load time is another 10s.

So it will be probably much more efficient to load the python and leave it in memory where it will take requests, do the compute and return the results.

Ah yes, thanks for the explanation.
You are right and I would also suggest to try to initialize the Python/PyTorch process once and reuse it later with new requests. I’m not sure which web server application you are using, but would it be possible to write a startup and teardown method which would keep the PyTorch process alive?

1 Like

Yeah, that’s what I will have to do.