RuntimeError: reduce failed to synchronize: an illegal memory access was encountered

daodao · December 12, 2018, 12:31pm

When I run the pix2pix GAN which implemented by eriklindernoren in the Pytorch version 0.4.1, I got the RuntimeError as:

Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f0f8f5490b8>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 399, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
    self.worker_result_queue.get()
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get
    return ForkingPickler.loads(res)
  File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/usr/lib/python3.5/multiprocessing/reduction.py", line 181, in recv_handle
    return recvfds(s, 1)[0]
  File "/usr/lib/python3.5/multiprocessing/reduction.py", line 152, in recvfds
    msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size))
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "pix2pix.py", line 141, in <module>
    loss_GAN = criterion_GAN(pred_fake, valid)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/loss.py", line 421, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1716, in mse_loss
    return _pointwise_loss(lambda a, b: (a - b) ** 2, torch._C._nn.mse_loss, input, target, reduction)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1674, in _pointwise_loss
    return lambd_optimized(input, target, reduction)
RuntimeError: reduce failed to synchronize: an illegal memory access was encountered

Why encounter the error? Anyone can help me?

richard · December 12, 2018, 5:00pm

Try running your script with CUDA_LAUNCH_BLOCKING=1. That should give a more accurate description of the error

NightRainXiaoxiang · January 2, 2019, 7:42am

Have your solved this problem? I met the same issue.

Hui_Tang · July 6, 2019, 12:25am

i met the same problem, I found that it is because my tensors are not on the same gpu and after i set them to be same gpu id, the error is gone

ecdrid · July 6, 2019, 3:56am

Try something like this

CUDA_ViSIBLE_DEVICES=1 python *.py