I am encountering the notorious cuda 59 error.
The error occurs after some normal iterations.
I think I do not have label mismatch problem.
I am using h5py for reading data to dataloader which was accused of multiprocessing problem previously.
Setting the workers to 1 and add
import torch import torch.multiprocessing torch.multiprocessing.set_start_method('spawn')
does not solve my problem.
Any replacement I can use other than h5py but preserve the speed or did i use numpy the wrong way?
The data loader is for resnet image features and text bow representations.
I try to replace h5py with numpy.load but the system got very slow using numpy load.
Is there any advice on data loading ?
I used the code above and get (this is the trace i get using CUDA_LAUNCHING_BLOCK=1 trick):
/pytorch/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexT ype = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [57,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed. torch.Size([100, 23, 1024]) THCudaCheck FAIL file=/pytorch/torch/lib/THC/generated/../generic/THCTensorMathReduce.cu line=17 error=59 : device-side assert triggered Traceback (most recent call last): File "training.py", line 39, in <module> o = model(q=q,v=v) File "/home/cs231n/myVE35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/home/leon_gao19/10707/10707_Project/model/baseline.py", line 16, in forward embedding = torch.sum(embeds,dim=1) File "/home/cs231n/myVE35/lib/python3.5/site-packages/torch/autograd/variable.py", line 476, in sum return Sum.apply(self, dim, keepdim) File "/home/cs231n/myVE35/lib/python3.5/site-packages/torch/autograd/_functions/reduce.py", line 21, in forward return input.sum(dim) RuntimeError: cuda runtime error (59) : device-side assert triggered at / pytorch/torch/lib/THC/generated/../generic/THCTensorMathReduce.cu:17