RuntimeError: cublas runtime error : library not initialized at

After I tried the solution from issue 1375, i.e. sudo rm -r ~/.nv, I still got the same error like before.
But the problem happened when I called the model’s forward() function instead of the steps of placing the input tensors on GPU.
The version of python I use now is 3.5, pytorch 0.3.1, cudann 7.0.5. I got two GPUs on my computer and used the below snippet for specifying GPU:

opt_gpu = 0 
os.environ["CUDA_VISIBLE_DEVICES"] = opt_gpu

The detailed errors are:

RuntimeError: cublas runtime error : library not initialized at /opt/conda/conda-bld/pytorch_1518241081361/work/torch/lib/THC/THCGeneral.c:405
/opt/conda/conda-bld/pytorch_1518241081361/work/torch/lib/THC/THCTensorScatterGather.cu:176: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 3]: block: [0,0,0], thread: [106,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1518241081361/work/torch/lib/THC/THCTensorScatterGather.cu:176: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 3]: block: [0,0,0], thread: [360,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1518241081361/work/torch/lib/THC/THCTensorScatterGather.cu:176: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 3]: block: [1,0,0], thread: [80,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.

Any advices, thanks.

I got same error as yours. Can yu tell me how to fit it?