I have tried to run this pytorch code [https://github.com/oudalab/eeqa/blob/master/code/run_trigger_qa.py]
for event classification task with 8 K80 GPUs, it runs well with batch size set to 32, but when it set to batch size 16 or 8 it failed randomly at some epoch and steps.
Wonder any of you guys have similar issue and how to solve this problem. the torch I use is 1.4.0, since there is error when update to pytorch 1.6.0 with this code which is know issue.
Is the code written above has any suspicious part?
Thank you guys for the help!