Cuda initialisation stuck in a100 machine

gouthamvgk · July 24, 2020, 9:18pm

I am using a A100 server from GCP with the latest NGC container from nvidia. However for the support of DCNV2 i have to downgrade my pytorch version to 1.4.0. Whenever i initialise a tensor in gpu like torch.randn(3).cuda() the interpreter gets stuck and never finishes that command. Any help??

mrshenli · July 24, 2020, 9:37pm

hey @ptrblck, do you know if anyone can answer questions regarding using PyTorch cuda features on GCP/NGC?

cc @ngimel

gouthamvgk · July 24, 2020, 9:51pm

Solved!. After close to 10 mins the tensor gets initialised in gpu and from thereon no problem

ptrblck · July 25, 2020, 2:30am

The long startup time is most likely create due to a JIT compilation of the CUDA code, if your installed PyTorch version wasn’t built for compute capability 8.0 (A100).
This would be the case, if you’ve installed the 1.4 binary instead of building it from source.
We are working towards building the PyTorch nightly binaries with the latest library stack and cc8.0.
For now, you could either build from source or let the JIT compiler run in the first CUDA call.

ngimel · July 26, 2020, 7:22pm

What is the issue with dcnv2 support in the latest ngc container?

Shenggan · August 31, 2020, 4:45am

Thanks for the reply. But I wonder that is it possible to build from source for old version PyTorch (for example version 1.4) with cuda11? Or is there any plan to support old version PyTorch for A100?

ptrblck · August 31, 2020, 4:17pm

There is no plan on changing older PyTorch versions to enable CUDA11 and thus new GPU architectures, so you would have to use the latest PyTorch version.

You could try to cherry-pick all commits mentioning CUDA11 in an older version and try to build it.
However, while it might work, what’s your use case that you need to use an old PyTorch version?

Shenggan · September 2, 2020, 1:11am

Thanks for the reply. I think porting to the latest version PyTorch is the best choice.