I use torch.multiprocessing to launch distributed training, but some batches may raise cuda_out_of_memory exception. I just wanna skip these batches. I can successfully skip them when using only one GPU for traning by using try and except.
But it dosen’t work for distributed training case, the training process will just stuck. I guess it may caused by the communication between different threads.
I’d be appreciated if someone could help me.