Cuda initialisation stuck in a100 machine

I am using a A100 server from GCP with the latest NGC container from nvidia. However for the support of DCNV2 i have to downgrade my pytorch version to 1.4.0. Whenever i initialise a tensor in gpu like torch.randn(3).cuda() the interpreter gets stuck and never finishes that command. Any help??

hey @ptrblck, do you know if anyone can answer questions regarding using PyTorch cuda features on GCP/NGC?

cc @ngimel

Solved!. After close to 10 mins the tensor gets initialised in gpu and from thereon no problem

The long startup time is most likely create due to a JIT compilation of the CUDA code, if your installed PyTorch version wasn’t built for compute capability 8.0 (A100).
This would be the case, if you’ve installed the 1.4 binary instead of building it from source.
We are working towards building the PyTorch nightly binaries with the latest library stack and cc8.0.
For now, you could either build from source or let the JIT compiler run in the first CUDA call.

1 Like

What is the issue with dcnv2 support in the latest ngc container?

1 Like

Thanks for the reply. But I wonder that is it possible to build from source for old version PyTorch (for example version 1.4) with cuda11? Or is there any plan to support old version PyTorch for A100?

There is no plan on changing older PyTorch versions to enable CUDA11 and thus new GPU architectures, so you would have to use the latest PyTorch version.

You could try to cherry-pick all commits mentioning CUDA11 in an older version and try to build it.
However, while it might work, what’s your use case that you need to use an old PyTorch version?

Thanks for the reply. I think porting to the latest version PyTorch is the best choice.