Pytorch training not stable with different batch size

cosinepi · October 3, 2020, 6:50pm

I have tried to run this pytorch code [https://github.com/oudalab/eeqa/blob/master/code/run_trigger_qa.py]
https://github.com/oudalab/eeqa/blob/master/code/script_args_qa.sh

for event classification task with 8 K80 GPUs, it runs well with batch size set to 32, but when it set to batch size 16 or 8 it failed randomly at some epoch and steps.

Wonder any of you guys have similar issue and how to solve this problem. the torch I use is 1.4.0, since there is error when update to pytorch 1.6.0 with this code which is know issue.

Is the code written above has any suspicious part?

Thank you guys for the help!

ptrblck · October 4, 2020, 9:17am

What is the know issue preventing the update to 1.6 and what kind of errors are you getting?