RuntimeError: shmem_size <= sharedMemPerBlock INTERNAL ASSERT FAILED

When running the codes from GitHub - CSUBioGroup/DeepLncLoc: A deep learning-based lncRNA subcellular localization predictor, I got the error message:
Traceback (most recent call last):
File “train.py”, line 34, in
model.cv_train(dataClass, trainSize=1, batchSize=batchSize, stopRounds=-1, earlyStop=10,
File “D:\GitHub\DeepLncLoc\model\DL_ClassifierModel.py”, line 24, in cv_train
res = self.train(dataClass,trainSize,batchSize,epoch,stopRounds,earlyStop,saveRounds,optimType,lr,weightDecay,
File “D:\GitHub\DeepLncLoc\model\DL_ClassifierModel.py”, line 57, in train
loss = self._train_step(X,Y, optimizer)
File “D:\GitHub\DeepLncLoc\model\DL_ClassifierModel.py”, line 175, in train_step
loss.backward(retain_graph=True)
File “C:\ProgramData\Anaconda3\envs\newenv\lib\site-packages\torch\tensor.py”, line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\ProgramData\Anaconda3\envs\newenv\lib\site-packages\torch\autograd_init
.py", line 97, in backward
Variable._execution_engine.run_backward(
RuntimeError: shmem_size <= sharedMemPerBlock INTERNAL ASSERT FAILED at C:/w/1/s/tmp_conda_3.8_075542/conda/conda-bld/pytorch_1579852615070/work/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu:668, please report a bug to PyTorch.
I use a 1060 6G graphic card with 16G memory.
I tried Pytorch 1.4 with cuda 9.2 and 1.8 with cuda 10.2, but it seems that the version is not a problem.
Could anyone help me to know how I could decrease the shmem_size or any methods to get rid of this error? Thank you!