Dataparallel Crashes the system

Sn_T · June 19, 2018, 1:58pm

I have two 1080Ti GPUs.

If I train my network on one GPU, everything seems to work fine. I use something like:

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=200, shuffle=True, num_workers=1,pin_memory=True) for x in ['train', 'val','test']}

device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

model_ft.to(device)

When I use the same model, and the same data wrapped in Dataparallel, the system crashes in the first epoch.

My code is:

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=200, shuffle=True, num_workers=1,pin_memory=True) for x in ['train', 'val','test']}

if torch.cuda.device_count() > 1:
          device = torch.device("cuda")
          print("Let's use", torch.cuda.device_count(), "GPUs!")
          model_ft = nn.DataParallel(model_ft)


model_ft.to(device)

I can make the Dataparallel work if I reduce the batch size to something as small as 40. And I don’t feel this speeds up my training.

Any thoughts?

Thanks!
S

richard · June 19, 2018, 2:03pm

How does it crash? Does it run out of memory?

Sn_T · June 19, 2018, 2:06pm

System reboots without any warning!

I don’t think there is any memory issues, I was tracking nvidia-smi. There isn’t any power issues either.

PS: I am using PyTorch 0.4 with Ubuntu 16.04

TonyC · August 30, 2018, 8:14am

I have also had this issue today. Did you happen to work out what was causing it?

Sn_T · August 30, 2018, 8:39am

No, I still find training on one GPU without dataparallel works better for me.

TonyC · August 30, 2018, 9:07am

ah ok… It must be possible… I have read some papers where the researchers have trained on 3 and 4 GPU’s (evidenced in their code) so it must be possible… unless they too had problems with it too. I initially thought it was system specs (PSU, thermal throttling etc) but when I looked at all that it seemed fine. One of life’s mysteries I guess.

Sn_T · August 30, 2018, 9:53am

I am sure it is possible.

I tried upgrading to Ubuntu 18.04, changed slots of my GPUs etc. It never really worked. But I was able to afford (Rich employer ) another PC and a hard disk, so running two different models in two PCs is faster for me than using Dataparallel.

Let me know if you manage solve this issue somehow.

Swarchal · August 30, 2018, 6:34pm

What happens if you run two models concurrently on separate GPUs? If this also crashes your system it might be PSU related, I had a similar issue when both GPUs were under heavy load, and a new PSU fixed it.

Sn_T · September 10, 2018, 10:00am

Thanks Scott!
It looks my issue is the same, a faulty PSU may be, I ordered for a new one.

I really appreciate your help!

S

ekmungi · January 13, 2019, 11:59am

Hey all, I have encountered the same issue on a dual GPU. I am using Pytorch 1.0 and have experienced random crashes. One observation as @Sn_T noted that if I increase the number of batches more than what the total GPU memory allows, it definitely crashes instead of throwing a cuda out of memory error.

Other times I have experienced crashes after having to force terminate the training mid way and restart. Just a side note than num_workers = 8

Was a faulty PSU really the problem? I am using “700 Watt be quiet! Pure Power 10 CM Modular 80+”

https://www.mindfactory.de/product_info.php/700-Watt-be-quiet--Pure-Power-10-CM-Modular-80--Silver_1138276.html

Anyone found the source of crash? Thanks in advance.