I am using pinned memory to speed up the training process. In my program, I use a thread to move the training data to the pinned memory. Occasionally, I encountered the following error in the middle of the training:
THCudaCheck FAIL file=/py/conda-bld/pytorch_1493677666423/work/torch/lib/THC/THCCachingHostAllocator.cpp line=258 error=11 : invalid argument
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/heilaw/.conda/envs/pytorch/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/heilaw/.conda/envs/pytorch/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "./train.py", line 43, in pin_memory
data["xs"] = [x.pin_memory() for x in data["xs"]]
File "./train.py", line 43, in <listcomp>
data["xs"] = [x.pin_memory() for x in data["xs"]]
File "/home/heilaw/.conda/envs/pytorch/lib/python3.5/site-packages/torch/tensor.py", line 78, in pin_memory
return type(self)().set_(storage.pin_memory()).view_as(self)
File "/home/heilaw/.conda/envs/pytorch/lib/python3.5/site-packages/torch/storage.py", line 84, in pin_memory
return type(self)(self.size(), allocator=allocator).copy_(self)
RuntimeError: cuda runtime error (11) : invalid argument at /py/conda-bld/pytorch_1493677666423/work/torch/lib/THC/THCCachingHostAllocator.cpp:258
Any idea why this would happen? How can I debug it?