Tmux session die when DataParallel is used

June30 · August 4, 2021, 6:59am

I have a server with 4 GPUs. When I train a model, I connect to the server using ssh and make a tmux session on the server and train the model on the tmux session. If I use just 1 GPU for training, there’s no problem. But if I use 4 GPUs for training using DataParallel, connection to the server is sometimes disconnected and I cannot connect to the server for 1~2 minutes. After I reconnect to the server, there’s no the tmux session. This disconnection does not appear routinely. The training works well for over 24 hours, but disconnection could appear just in 1 hour.

I think that the session is killed by an unknown reason because the server is not rebooted (no record of reboot).

I tried to connect to the server with ssh except forwarding X11, the disconnection still appears.

I’m using pytorch 1.4 and tmux 2.6.

Is there any idea?

ptrblck · August 9, 2021, 6:02am

Based on the description I think the server might be indeed rebooting e.g. due to an underpowered PSU if all GPUs are used. Once you are losing the ssh connection you could try to ping the server and should see a timeout until the node reboots.