Distributed data parallel freezes without error message

gemfield · September 3, 2020, 1:03pm

Env:

Ubuntu 18.04
Pytorch 1.6.0
CUDA 10.1

Actually, I am using Docker image gemfield/pytorch:1.6.0-devel which stated in https://github.com/DeepVAC/deepvac (same with above env), and use PyTorch DDP (by use the class DeepvacDDP in https://github.com/DeepVAC/deepvac/blob/master/deepvac/syszux_deepvac.py) to train my model, which the code worked perfect yesterday. But today when I launch the train program again, the DDP is stucked in loss.backward()， with cpu 100% and GPU 100%。
There has no code change and docker container change since yesterday, except the Ubuntu host got a system update today:

gemfield@ai03:~$ cat /var/log/apt/history.log | grep -C 3 nvidia

Start-Date: 2020-09-03  06:44:01
Commandline: /usr/bin/unattended-upgrade
Install: linux-modules-nvidia-440-5.4.0-45-generic:amd64 (5.4.0-45.49, automatic)
Upgrade: linux-modules-nvidia-440-generic-hwe-20.04:amd64 (5.4.0-42.46, 5.4.0-45.49)
End-Date: 2020-09-03  06:44:33

Obviously, the nvidia driver got update from 440.64 to 440.100, and I think these info may be useful for somebody.