Run two different jobs in parallel on same GPUs, I got my GPU locked up

hbkunn · March 21, 2017, 5:14am

firstly I’ve training on a regular python file using 4 GPUs using function Dataparallel.

Then I loaded saved parameters using ipython notebook through SSH while the previous job is still running.

when I load it on a single GPU instead of Dataparallel, it shows that weight doesn’t exist.So instead I use Dataparallel function on the same GPUs just like training process, then the problem occurred.

The ipython froze, and I immediately kill the job. Then My GPU is locked up like the picture shows. I can restart any job but it only shows 1MB memory whatever I tried.

I’ve ran into the same problem before, reboot can do but my peers are also using this remote server. What can I do ;-(, I searched through ‘ps -ef’ but still cannot find relevant jobs that caused the problem.

What have I done ;-(.

apaszke · March 21, 2017, 2:40pm

So the problem is because the NVIDIA libraries we’re using for inter-GPU communication in DataParallel do some funky stuff and they can leave the driver in some inconsistent state. Just remember to never launch multiple DataParallel jobs that share some of the GPUs (it’s ok to run one job on GPU 0, 1 and anoher on 2, 3).

hbkunn · March 21, 2017, 3:35pm

I have rebooted my server, but still some strange errors occur occasionally

like ‘RuntimeError: cuda runtime error (4) : unspecified launch failure at /home/soumith/local/builder/wheel/pytorch-src/torch/lib/THC/generic/THCTensorCopy.c:18’.

And when I ran ‘nvidia-smi’ sometimes it’s extremely slow…

Thanks for your replying now I won’t commit the same mistake again.

hbkunn · March 22, 2017, 2:18am

e… it seems like a serious problem.

unspecified launch failure constantly occurs about 1 or 2 hours after I launch my code. I can’t find any solution related. Should I reinstall my NVIDIA driver…?

smth · March 22, 2017, 5:26am

is your GPU becoming too hot? occurs after 1 to 2 hours of launching your code sounds like that might be a problem? (because Unspecified Launch Failure might sometimes be that)

yzhu · March 22, 2017, 5:48am

maybe try a complete power cycle? shut down the machine for a few seconds?

cjmcmurtrie · March 22, 2017, 6:10pm

Not sure why but this comment was addressed to me? I received an email notification.

hbkunn · March 27, 2017, 7:56am

sorry for my late reply!
I restart my system again and the problem is solved I think. I can’t reproduce it now.
Though I don’t think it’s due to the temperature, our school’s GPUs are deposed in a exclusive area where two air-conditioners are working. But still thanks for your replying. Admire your group’s work!

hbkunn · March 27, 2017, 7:57am

Yes I restart the whole system, the problem seems to be solved!
Thank you!