I try to catch the except so that model can change to next gpu automatically, but it do not work
Have a look at the FairSeq example on how to recover from OOM errors.
They just skip the batch and try to continue the training. You could try to adapt this example to move your model.
1 Like