I have a server with 4 GPUs. When I train a model, I connect to the server using ssh and make a tmux session on the server and train the model on the tmux session. If I use just 1 GPU for training, there’s no problem. But if I use 4 GPUs for training using DataParallel, connection to the server is sometimes disconnected and I cannot connect to the server for 1~2 minutes. After I reconnect to the server, there’s no the tmux session. This disconnection does not appear routinely. The training works well for over 24 hours, but disconnection could appear just in 1 hour.
I think that the session is killed by an unknown reason because the server is not rebooted (no record of reboot).
I tried to connect to the server with ssh except forwarding X11, the disconnection still appears.
I’m using pytorch 1.4 and tmux 2.6.
Is there any idea?