Cuda runtime error 59 using h5py

I am encountering the notorious cuda 59 error.
The error occurs after some normal iterations.
I think I do not have label mismatch problem.
I am using h5py for reading data to dataloader which was accused of multiprocessing problem previously.
Setting the workers to 1 and add
import torch import torch.multiprocessing torch.multiprocessing.set_start_method('spawn')
does not solve my problem.
Any replacement I can use other than h5py but preserve the speed or did i use numpy the wrong way?
The data loader is for resnet image features and text bow representations.

I try to replace h5py with numpy.load but the system got very slow using numpy load.
Is there any advice on data loading ?
I used the code above and get (this is the trace i get using CUDA_LAUNCHING_BLOCK=1 trick):

/pytorch/torch/lib/THC/ void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, 
IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexT
ype = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [57,0,0], thread: [95,0,0] Assertion `srcIndex < 
srcSelectDimSize` failed.
torch.Size([100, 23, 1024])
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generated/../generic/ line=17 error=59 : 
device-side assert triggered
Traceback (most recent call last):
File "", line 39, in <module>
  o = model(q=q,v=v)
File "/home/cs231n/myVE35/lib/python3.5/site-packages/torch/nn/modules/", line 224, in __call__
  result = self.forward(*input, **kwargs)
File "/home/leon_gao19/10707/10707_Project/model/", line 16, in forward
  embedding = torch.sum(embeds,dim=1)
File "/home/cs231n/myVE35/lib/python3.5/site-packages/torch/autograd/", line 476, in sum
  return Sum.apply(self, dim, keepdim)
File "/home/cs231n/myVE35/lib/python3.5/site-packages/torch/autograd/_functions/", line 21, in forward
  return input.sum(dim)
RuntimeError: cuda runtime error (59) : device-side assert triggered at / 

Yazhi_Gao, did you resolve the problem? Thanks.