Automatically reboot when set cudnn.benchmark to True

zed · April 20, 2017, 9:01am

When I train the resnet-18 model in pytorch imagenent example
https://github.com/pytorch/examples/blob/master/imagenet/main.py
there are two lines
import torch.backends.cudnn as cudnn
cudnn.benchmark = True
As I run the program, the server with 4 gtx 1080 gpus automatically reboot after several seconds. And after I set the value to False, everything is fine.
So what’s the mean of those codes and what happens when the value is set to True?

smth · April 26, 2017, 11:06pm

this can happen if you have hardware power issues. cudnn benchmark mode pushes GPUs to their limits and they might be tripping power.

zed · April 27, 2017, 6:42am

Thanks for your reply.
The server have 1 x intel core i7-6900k cpu,4 x NVIDIA GTX 1080 and a power supply unit of 1600W rated power.
Is the power supply sufficient? And how much speedup will I get if I set cudnn.benchmark=True?

Eric_K · March 21, 2018, 6:44am

Hi, zed

Have you figure this issue out that how much speedup will you get when you set cudnn.bencnmark=True?
I confront the same problem. \

Thank you!

zed · March 21, 2018, 8:24am

Hi, Eric

I didn’t test on how much speedup I could get, since I eventually fixed my problem.

I found that wrong label sequences were generated by my code. At first, I set the label as a number from 1 to number of classes, but it seems that pytorch only deals with labels start from 0 to num_classes-1. So everything works fine now.

Pleased if my reply helps you.

Eric_K · March 21, 2018, 1:04pm

Thank you very much!!