Hello @ptrblck ,
I updated drivers to 470. The issue persists, attaching below the output of watch nvidia-smi,
Every 2.0s: nvidia-smi ampere.lix.polytechnique.fr: Thu Jul 22 09:21:53 2021
Thu Jul 22 09:21:53 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:21:00.0 Off | 0 |
| N/A 25C P0 59W / 250W | 1768MiB / 40536MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:81:00.0 Off | 0 |
| N/A 25C P0 58W / 250W | 1502MiB / 40536MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... On | 00000000:E2:00.0 Off | 0 |
| N/A 20C P0 32W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 5629 C .../envs/torch1p9/bin/python 1765MiB |
| 1 N/A N/A 5630 C .../envs/torch1p9/bin/python 1499MiB |
+-----------------------------------------------------------------------------+
I ran the same script which you suggested above. Using the latest release of pytorch. It gets stuck once again, but this time at least, ctrl+c can interrupt the script and I don’t have to kill it to stop.