Dataparallel Crashes the system

I have two 1080Ti GPUs.

If I train my network on one GPU, everything seems to work fine. I use something like:

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=200, shuffle=True, num_workers=1,pin_memory=True) for x in ['train', 'val','test']}

device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

model_ft.to(device)

When I use the same model, and the same data wrapped in Dataparallel, the system crashes in the first epoch.

My code is:

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=200, shuffle=True, num_workers=1,pin_memory=True) for x in ['train', 'val','test']}

if torch.cuda.device_count() > 1:
          device = torch.device("cuda")
          print("Let's use", torch.cuda.device_count(), "GPUs!")
          model_ft = nn.DataParallel(model_ft)


model_ft.to(device)

I can make the Dataparallel work if I reduce the batch size to something as small as 40. And I don’t feel this speeds up my training.

Any thoughts?

Thanks!
S

How does it crash? Does it run out of memory?

System reboots without any warning!

I don’t think there is any memory issues, I was tracking nvidia-smi. There isn’t any power issues either.

PS: I am using PyTorch 0.4 with Ubuntu 16.04

I have also had this issue today. Did you happen to work out what was causing it?

No, I still find training on one GPU without dataparallel works better for me.

ah ok… It must be possible… I have read some papers where the researchers have trained on 3 and 4 GPU’s (evidenced in their code) so it must be possible… unless they too had problems with it too. I initially thought it was system specs (PSU, thermal throttling etc) but when I looked at all that it seemed fine. One of life’s mysteries I guess.

I am sure it is possible.

I tried upgrading to Ubuntu 18.04, changed slots of my GPUs etc. It never really worked. But I was able to afford (Rich employer :wink: ) another PC and a hard disk, so running two different models in two PCs is faster for me than using Dataparallel.

Let me know if you manage solve this issue somehow.

What happens if you run two models concurrently on separate GPUs? If this also crashes your system it might be PSU related, I had a similar issue when both GPUs were under heavy load, and a new PSU fixed it.

Thanks Scott!
It looks my issue is the same, a faulty PSU may be, I ordered for a new one.

I really appreciate your help!

S

Hey all, I have encountered the same issue on a dual GPU. I am using Pytorch 1.0 and have experienced random crashes. One observation as @Sn_T noted that if I increase the number of batches more than what the total GPU memory allows, it definitely crashes instead of throwing a cuda out of memory error.

Other times I have experienced crashes after having to force terminate the training mid way and restart. Just a side note than num_workers = 8

Was a faulty PSU really the problem? I am using “700 Watt be quiet! Pure Power 10 CM Modular 80+”

https://www.mindfactory.de/product_info.php/700-Watt-be-quiet--Pure-Power-10-CM-Modular-80--Silver_1138276.html

Anyone found the source of crash? Thanks in advance.