When training with the DDP and syncBatchNorm, one process runing on one GPU, When I catch the gpu OOM exception, the training is blocked. What should we do?
My code is following, when OOM exception occurs in one process, I just ignore this batch, the training phase continue.
for i, (inputs, targets) in enumerate(train_loader): try: # do forward and backprop except RuntimeError as e: if 'out of memory' in str(e): print('| WARNING: ran out of memory, skipping this batch.') if hasattr(torch.cuda, 'empty_cache'): torch.cuda.empty_cache() optimizer.zero_grad() else: raise e
when one process catch the exception, the others get blocked.