Pin_memory() cuda runtime error

I am using pinned memory to speed up the training process. In my program, I use a thread to move the training data to the pinned memory. Occasionally, I encountered the following error in the middle of the training:

THCudaCheck FAIL file=/py/conda-bld/pytorch_1493677666423/work/torch/lib/THC/THCCachingHostAllocator.cpp line=258 error=11 : invalid argument
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/heilaw/.conda/envs/pytorch/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/home/heilaw/.conda/envs/pytorch/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "./train.py", line 43, in pin_memory
    data["xs"] = [x.pin_memory() for x in data["xs"]]
  File "./train.py", line 43, in <listcomp>
    data["xs"] = [x.pin_memory() for x in data["xs"]]
  File "/home/heilaw/.conda/envs/pytorch/lib/python3.5/site-packages/torch/tensor.py", line 78, in pin_memory
    return type(self)().set_(storage.pin_memory()).view_as(self)
  File "/home/heilaw/.conda/envs/pytorch/lib/python3.5/site-packages/torch/storage.py", line 84, in pin_memory
    return type(self)(self.size(), allocator=allocator).copy_(self)
RuntimeError: cuda runtime error (11) : invalid argument at /py/conda-bld/pytorch_1493677666423/work/torch/lib/THC/THCCachingHostAllocator.cpp:258

Any idea why this would happen? How can I debug it?

if you can give me a script to reproduce this, what I would do is put a printf in https://github.com/pytorch/pytorch/blob/master/torch/lib/THC/THCCachingHostAllocator.cpp#L258 and see what the size or ptr values are when the failure occurs. From there, I would backtrack to see why these failure values are being generated.

It’s an ongoing project so I cannot share that with you. But I will see if I can write a script for you to reproduce that. Thanks!

I haven’t had a second to write a self-contained, reproducing example script, but wanted to chime in to say that I resolved this issue by making the Tensor contiguous before pinning it. (I had taken a slice earlier.)

I ran into the same problem. Because I had a variable batch-size and limited memory on the GPU I had to split the batch into junks having a maximum allowed batch-size. Because ‘split’ only creates a view of the tensor and I still wanted to pin the memory, I had to make each junk contiguous.

z = torch.cat([
    self.feature_extractor(
        batch.contiguous().to(device=device)  # push batch to device
    )['flatten']  # extract the output of the flatten layer
    for batch in torch.split(x, max_batch_size)  # loop over junks
])