Calling nn.functional.kl_div triggers RuntimeError: too many resources requested for launch

The code tests well on my desktop with nvidia K40, but when I move it to the server with nvidia K80, this error occurs. It seems very confusing. Could someone give me advice about what should I do? Thanks in advance! Here is the error log:

Traceback (most recent call last):
File “train-hg-asn.py”, line 605, in
main()
File “train-hg-asn.py”, line 145, in main
optimizer_hg, aug, optimizer_aug, dropout, optimizer_dropout, epoch, visualizer, idx, opt)
File “train-hg-asn.py”, line 329, in train
loss_scale = F.kl_div(logsoftmax_pred_scale_distri, grnd_scale_distri) * grnd_scale_distri.size(1)
File “/gpu/homedirs/zt53/python-env3/local/lib/python2.7/site-packages/torch/nn/functional.py”, line 507, in kl_div
return _functions.thnn.KLDivLoss(size_average)(input, target)
File “/gpu/homedirs/zt53/python-env3/local/lib/python2.7/site-packages/torch/nn/_functions/thnn/auto.py”, line 41, in forward
output, *self.additional_args)
RuntimeError: after cudaLaunch in triple_chevron_launcher::launch(): too many resources requested for launch

can you tell me what are the sizes of the tensors that are input to nn.functional.kl_div

for me same error happens and I have the same size as which I had run same program with different lr and initialization properly, No idea why it occurs and what the source of error is
RuntimeError: after cudaLaunch in triple_chevron_launcher::launch(): too many resources requested for launch
Thanks,