Run two different jobs in parallel on same GPUs, I got my GPU locked up

firstly I’ve training on a regular python file using 4 GPUs using function Dataparallel.

Then I loaded saved parameters using ipython notebook through SSH while the previous job is still running.

when I load it on a single GPU instead of Dataparallel, it shows that weight doesn’t exist.So instead I use Dataparallel function on the same GPUs just like training process, then the problem occurred.

The ipython froze, and I immediately kill the job. Then My GPU is locked up like the picture shows. I can restart any job but it only shows 1MB memory whatever I tried.

I’ve ran into the same problem before, reboot can do but my peers are also using this remote server. What can I do ;-(, I searched through ‘ps -ef’ but still cannot find relevant jobs that caused the problem.

What have I done ;-(.

So the problem is because the NVIDIA libraries we’re using for inter-GPU communication in DataParallel do some funky stuff and they can leave the driver in some inconsistent state. Just remember to never launch multiple DataParallel jobs that share some of the GPUs (it’s ok to run one job on GPU 0, 1 and anoher on 2, 3).

1 Like

I have rebooted my server, but still some strange errors occur occasionally

like ‘RuntimeError: cuda runtime error (4) : unspecified launch failure at /home/soumith/local/builder/wheel/pytorch-src/torch/lib/THC/generic/THCTensorCopy.c:18’.

And when I ran ‘nvidia-smi’ sometimes it’s extremely slow…

Thanks for your replying :slight_smile: now I won’t commit the same mistake again.

e… it seems like a serious problem.

unspecified launch failure constantly occurs about 1 or 2 hours after I launch my code. I can’t find any solution related. Should I reinstall my NVIDIA driver…?

is your GPU becoming too hot? occurs after 1 to 2 hours of launching your code sounds like that might be a problem? (because Unspecified Launch Failure might sometimes be that)

maybe try a complete power cycle? shut down the machine for a few seconds?

Not sure why but this comment was addressed to me? I received an email notification.

sorry for my late reply!
I restart my system again and the problem is solved I think. I can’t reproduce it now.
Though I don’t think it’s due to the temperature, our school’s GPUs are deposed in a exclusive area where two air-conditioners are working. But still thanks for your replying. Admire your group’s work!

Yes I restart the whole system, the problem seems to be solved!
Thank you!