I am encountering the notorious cuda 59 error.
The error occurs after some normal iterations.
I think I do not have label mismatch problem.
I am using h5py for reading data to dataloader which was accused of multiprocessing problem previously.
Setting the workers to 1 and add
import torch import torch.multiprocessing torch.multiprocessing.set_start_method('spawn')
does not solve my problem.
Any replacement I can use other than h5py but preserve the speed or did i use numpy the wrong way?
The data loader is for resnet image features and text bow representations.
I try to replace h5py with numpy.load but the system got very slow using numpy load.
Is there any advice on data loading ?
https://github.com/Cyanogenoid/pytorch-vqa/blob/master/data.py
I used the code above and get (this is the trace i get using CUDA_LAUNCHING_BLOCK=1 trick):
/pytorch/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T,
IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexT
ype = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [57,0,0], thread: [95,0,0] Assertion `srcIndex <
srcSelectDimSize` failed.
torch.Size([100, 23, 1024])
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generated/../generic/THCTensorMathReduce.cu line=17 error=59 :
device-side assert triggered
Traceback (most recent call last):
File "training.py", line 39, in <module>
o = model(q=q,v=v)
File "/home/cs231n/myVE35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in __call__
result = self.forward(*input, **kwargs)
File "/home/leon_gao19/10707/10707_Project/model/baseline.py", line 16, in forward
embedding = torch.sum(embeds,dim=1)
File "/home/cs231n/myVE35/lib/python3.5/site-packages/torch/autograd/variable.py", line 476, in sum
return Sum.apply(self, dim, keepdim)
File "/home/cs231n/myVE35/lib/python3.5/site-packages/torch/autograd/_functions/reduce.py", line 21, in forward
return input.sum(dim)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /
pytorch/torch/lib/THC/generated/../generic/THCTensorMathReduce.cu:17