I’m using a DataLoader and looping through my training data in the usual way:
for epoch in range(num_epochs):
for training_batch_idx, training_batch in enumerate(dataloader):
# forward/backward propagation code
Everything is fine during the first epoch. In the second epoch, when backward()
is called for the first time, I get the following error:
THCudaCheck FAIL file=/data/users/soumith/miniconda2/conda-bld/pytorch-cuda80-0.1.10_1488757768560/work/torch/lib/THCUNN/generic/PReLU.cu line=79 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File "trainer.py", line 115, in <module>
err.backward()
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/torch/autograd/variable.py", line 146, in backward
self._execution_engine.run_backward((self,), (gradient,), retain_variables)
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/torch/nn/_functions/thnn/activation.py", line 53, in backward
1
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /data/users/soumith/miniconda2/conda-bld/pytorch-cuda80-0.1.10_1488757768560/work/torch/lib/THCUNN/generic/PReLU.cu:79
The error points to some PReLU code. However, if I replace all the PReLU layers in my net with ReLU, I still get an illegal memory access error; it just points somewhere else:
THCudaCheck FAIL file=/data/users/soumith/miniconda2/conda-bld/pytorch-cuda80-0.1.10_1488757768560/work/torch/lib/THC/generic/THCTensorMath.cu line=26 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File "trainer.py", line 115, in <module>
err.backward()
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/torch/autograd/variable.py", line 146, in backward
self._execution_engine.run_backward((self,), (gradient,), retain_variables)
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/torch/nn/_functions/batchnorm.py", line 60, in backward
grad_bias = bias.new(bias.size()).zero_()
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /data/users/soumith/miniconda2/conda-bld/pytorch-cuda80-0.1.10_1488757768560/work/torch/lib/THC/generic/THCTensorMath.cu:26
Any thoughts on what might cause an error like this?