RuntimeError: [/home/coulombc/wheels_builder/tmp.17382/python-3.8/torch/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer []: srun: error: : task 16: Exited with exit code 1

ZSheikhb · February 24, 2023, 4:16pm

I have written a model which has an attention encoder. Since training my model was taking a long time, I distributed the data and ran the code on 6 nodes each with 4 GPUs on a remote cluster. I tested my code 2 times before and it worked. The only change I made this time, I increased the latent dimension of the encoder. After running it this time, I got this error message:

Train-loss: 7.711e-01; Valid-loss: 3.047e+01; LR: 2.000e-04:   2%|▏         | 14/750 [40:49<28:58:21, 141.71s/it]
Train-loss: 7.711e-01; Valid-loss: 3.047e+01; LR: 2.000e-04:   2%|▏         | 15/750 [43:08<28:46:44, 140.96s/it]
cme/dm_control/lib/python3.8/site-packages/sklearn/preprocessing/_data.py:462: RuntimeWarning: All-NaN slice encountered
  data_max = np.nanmax(X, axis=0)
/home/dm_control/lib/python3.8/site-packages/sklearn/preprocessing/_data.py:461: RuntimeWarning: All-NaN slice encountered
  data_min = np.nanmin(X, axis=0)
/home/dm_control/lib/python3.8/site-packages/sklearn/preprocessing/_data.py:462: RuntimeWarning: All-NaN slice encountered
  data_max = np.nanmax(X, axis=0)
Traceback (most recent call last):
  File "training_vrnn.py", line 618, in <module>
    main()
  File "training_vrnn.py", line 586, in main
    df = run_train(modelstate=modelstate,
  File "training_vrnn.py", line 229, in run_train
    train(epoch)  # model, train_options, loader_train, optimizer, epoch, lr)
  File "training_vrnn.py", line 155, in train
    scaler.scale(loss_).backward(retain_graph=True, inputs=list(modelstate.model.parameters()))
  File "/home/dm_control/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/dm_control/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [/home/coulombc/wheels_builder/tmp.17382/python-3.8/torch/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.70.12.103]:46811
srun: error: bg11203: task 5: Exited with exit code 1
/home/dm_control/lib/python3.8/site-packages/sklearn/preprocessing/_data.py:462: RuntimeWarning: All-NaN slice encountered
  data_max = np.nanmax(X, axis=0)
/home/dm_control/lib/python3.8/site-packages/sklearn/preprocessing/_data.py:461: RuntimeWarning: All-NaN slice encountered
  data_min = np.nanmin(X, axis=0)
/home/dm_control/lib/python3.8/site-packages/sklearn/preprocessing/_data.py:462: RuntimeWarning: All-NaN slice encountered
  data_max = np.nanmax(X, axis=0)
Traceback (most recent call last):
  File "training_vrnn.py", line 618, in <module>
    main()
  File "training_vrnn.py", line 586, in main
    df = run_train(modelstate=modelstate,
  File "training_vrnn.py", line 229, in run_train
    train(epoch)  # model, train_options, loader_train, optimizer, epoch, lr)
  File "training_vrnn.py", line 155, in train
    scaler.scale(loss_).backward(retain_graph=True, inputs=list(modelstate.model.parameters()))
  File "/home/dm_control/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/dm_control/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [/home/coulombc/wheels_builder/tmp.17382/python-3.8/torch/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.70.12.103]:4791
srun: error: bg11202: task 3: Exited with exit code 1

The reason for this line data_max = np.nanmax(X, axis=0) in the log file is that I am training a model on some time-series data with different lengths and I used MinMaxScaler from sklearn library to normalise the data while I changed the zero pad values to nan in order to exclude them from computation.

I will appreciate for any thoughts on why I received above error message?