System reboot when increase batchsize


(liyukun) #1

I tried to implement SSD by using pytorch code from https://github.com/amdegroot/ssd.pytorch, strange things happened. When I set batch_size=1/2/4, cpu utilization is nearly 99%.(GPUs are used normally)When batch_size increased to 8 or more ,system crashed and reboot automatically. So I only can train it by using 4 batch_size , but finally I got lower mAP of 72.74 (compared with 77.43 from author), I think batch_size influence much but I cannot increase it because of the problem.

So I am really confused why it uses too much CPU in a GPU-used situation? How can I solve it.
When I train other pytorch net , this weird thing doesn’t happened(about 1% utilization CPU used)
Hardware:8x1080Ti, 56 cores ,ubuntu 16.04.


#2

Could the increased batch size also increase the power usage of your GPUs which might crash (a too weak or faulty) PSU?
Could you create a dummy model using all GPUs at high utilizations?
Did you notice something else, e.g. your memory filling up etc.?