Distributed training with Apex fails at opt_level O0

lvshumang · May 15, 2020, 12:05pm

Hi, I am training a UNet on a single machine with 2 GPU. The codes work fine when only using PyTorch’s distributed package (with DistributedDataParallel). However, when training with Apex, I am getting weird errors if I set opt_level=‘O0’, but trainings with opt_level=‘O1’ are always fine. The error always happen on process 1 at scaled_loss.backward(). Any suggestion is appreciated. Thanks.

Error message:

Traceback (most recent call last):
File “main_V4.py”, line 339, in
mp.spawn(YYY_Proj, args=(2,opt), nprocs=2, join=True)
File “/home/ZZZ/anaconda3/envs/Apex/lib/python3.8/site-packages/torch/multiprocessing/spawn.py”, line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’)
File “/home/ZZZ/anaconda3/envs/Apex/lib/python3.8/site-packages/torch/multiprocessing/spawn.py”, line 158, in start_processes
while not context.join():
File “/home/ZZZ/anaconda3/envs/Apex/lib/python3.8/site-packages/torch/multiprocessing/spawn.py”, line 119, in join
raise Exception(msg)
Exception:

– Process 1 terminated with the following error:
Traceback (most recent call last):
File “/home/ZZZ/anaconda3/envs/Apex/lib/python3.8/site-packages/torch/multiprocessing/spawn.py”, line 20, in _wrap
fn(i, *args)
File “/home/ZZZ/Documents/YYY_XXX/Apex_Distributed/main_V4.py”, line 191, in YYY_Proj
scaled_loss.backward()
File “/home/ZZZ/anaconda3/envs/Apex/lib/python3.8/site-packages/torch/tensor.py”, line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/ZZZ/anaconda3/envs/Apex/lib/python3.8/site-packages/torch/autograd/init.py”, line 98, in backward
Variable._execution_engine.run_backward(
RuntimeError

set +x
/home/ZZZ/anaconda3/envs/Apex/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

ptrblck · May 15, 2020, 12:13pm

Could you post the complete RuntimeError?
We also recommend to use the native mixed-precision training support via torch.cuda.amp, which is available in the nightly binaries and current master, if it’s an option to update.

lvshumang · May 15, 2020, 12:15pm

Thanks for your reply. The strange thing is, what I posted is all the error message I got.

ptrblck · May 16, 2020, 9:26am

It’s a bit weird that you are seeing this issue with opt_level='O0', since O0 is pure FP32 training and shouldn’t do anything.
Are you calling half() on any tensors or models? Also, I understand that your code is running fine without using apex at all?

lvshumang · May 16, 2020, 10:35am

I didn’t call half() on any tensors or models. The code is running fine without using apex at all, or using apex with opt_level=‘O1’

Please note that by using apex, I changed from torch’s Adam optimizer to the FusedAdam in apex.optimizers. I will test if by using torch’s adam the error will go away.

I am attaching a screenshot of the error message (sorry I have to blackout the user information). The exact time this error happen seems to be random: some times it happens after 100 epoch.