I’m trying to train a pytorch model pix2pix. They have an option to speed up training with “Automatic Mixed Precision” (AMP). But when I do that with python -m torch.distributed.launch train.py
I get a CUDA error: “an illegal memory access was encountered”. The device is “pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0”.
It gives me 100 messages like:
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
And finally:
Traceback (most recent call last):
File "train.py", line 85, in <module>
with amp.scale_loss(loss_G, optimizer_G) as scaled_loss: scaled_loss.backward()
File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/handle.py", line 127, in scale_loss
should_skip = False if delay_overflow_check else loss_scaler.update_scale()
File "/usr/local/lib/python3.6/dist-packages/apex/amp/scaler.py", line 200, in update_scale
self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--fp16', '--num_D', '1', '--name', 'n', '--dataroot', './datasets/n/', '--label_nc', '0', '--no_instance', '--resize_or_crop', 'scale_width_and_crop', '--save_epoch_freq', '2', '--checkpoints_dir', '/content/drive/My Drive/checkpoints', '--tf_log']' returned non-zero exit status 1.
How can I fix this?