Hi guys, I’m getting this kind of error in two places:"RuntimeError: cuda runtime error (59) : device-side assert triggered ".
- when calculating accuracy:
_, pred = outputs.topk(1, 1, True)
pred = pred.t()
correct = pred.eq(targets.view(1, -1))
n_correct_elems = correct.float().sum().data[0]
Tried to do this:
n_correct_elems = correct.float().sum().item
still didn’t help the exect error is:
/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion
t >= 0 && t < n_classes
failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/…/THCReduceAll.cuh line=317 error=59 : device-side assert triggered
Traceback (most recent call last):
File “main.py”, line 141, in
train_logger, train_batch_logger)
File “/path/train.py”, line 37, in train_epoch
acc = calculate_accuracy(outputs, targets)
File “/path/utils.py”, line 58, in calculate_accuracy
n_correct_elems = correct.float().sum().item()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/…/THCReduceAll.cuh:317
I wanted to go forward, so I skipped this function (just entered a number), and then I got, that he is falling in csv.writer (I checked the path for the file):
the exact error log is:
/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generic/THCTensorCopy.cpp line=70 error=59 : device-side assert triggered
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f80c09ac748>>
Traceback (most recent call last):
File "path/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 399, in __del__
self._shutdown_workers()
File "path/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
self.worker_result_queue.get()
File "path/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "path/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "path/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "path/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "path/anaconda3/lib/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "path/anaconda3/lib/python3.6/multiprocessing/connection.py", line 737, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File "path/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "path/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "path/anaconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "main.py", line 141, in <module>
train_logger, train_batch_logger)
File "path/train.py", line 55, in train_epoch
'lr': optimizer.param_groups[0]['lr']
File "path/utils.py", line 41, in log
self.logger.writerow(write_values)
File "path/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 57, in __repr__
return torch._tensor_str._str(self)
File "path/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 256, in _str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "path/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 82, in __init__
copy = torch.empty(tensor.size(), dtype=torch.float64).copy_(tensor).view(tensor.nelement())
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generic/THCTensorCopy.cpp:70
I must day the I debugged on the windows, and the program ran, now I’m running it on distance machine, and it’s killing me, every step different error, I’m not far from breaking down, please help.
Thanks.