Torch.distributed.launch hanged

Saichandra_Pandraju · February 18, 2021, 7:35am

Hi,

I am trying to leverage parallelism with distributed training but my process seems to be hanging or getting into ‘deadlock’ sort of issue.
So I ran the below code snippet to test it and it is hanging again. Can anyone plz help on this…


import os
import torch
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9994'
os.environ['RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"

torch.distributed.init_process_group(backend='nccl')  ##hanging here

I’m running my code in a kubeflow server’s jupyter notebook and here are the versions that I’m using -

PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 450.51.06
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] kubeflow-pytorchjob==0.1.3
[pip3] numpy==1.18.5
[pip3] torch==1.7.1
[pip3] torchvision==0.8.2
[conda] Could not collect

Plz suggest how to proceed further…

Thanks

Saichandra_Pandraju · February 18, 2021, 7:21pm

I changed torch from 1.7.1 to 1.7.1+cu101 as shown below -

Collecting environment information...
PyTorch version: 1.7.1+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 450.51.06
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] kubeflow-pytorchjob==0.1.3
[pip3] numpy==1.18.5
[pip3] torch==1.7.1+cu101
[pip3] torchaudio==0.7.2
[pip3] torchvision==0.8.2+cu101
[conda] Could not collect

After this, torch.distributed.init_process_group(backend=‘nccl’) worked but it again hanged with script. Below is the complete log -

2021-02-18 19:00:28.946359: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Some weights of the model checkpoint at /home/jovyan/models/roberta-large/ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /home/jovyan/models/roberta-large/ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
loaded df
Encoding done
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO Bootstrap : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:13993:13993 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

fastai-c2-0:13993:13993 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:13993:13993 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1

Note : At the time, we created this VM, we named it as fastai-c2-0. But fastai has nothing to do with this issue as I’m not using it at all

I’m using this command to launch from notebook -

!python -m torch.distributed.launch --nproc_per_node=1 ./Deepspeed.py --output_dir ./out_dir/results --overwrite_output_dir --do_train \
--do_eval --per_device_train_batch_size 10 --per_device_eval_batch_size 10 --learning_rate 3e-5 --weight_decay 0.01 \
--num_train_epochs 1 --load_best_model_at_end

Here is the simple script that I’m using -

from transformers import RobertaForSequenceClassification, RobertaTokenizerFast, Trainer, TrainingArguments, HfArgumentParser
import pandas as pd
import numpy as np
import torch
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['NCCL_DEBUG']='INFO'
os.environ['NCCL_DEBUG_SUBSYS']='ALL'
os.environ['NCCL_IB_DISABLE']='1'
os.environ['NCCL_SOCKET_IFNAME']='eth0'

tok = RobertaTokenizerFast.from_pretrained('/home/jovyan/models/roberta-large/')
model = RobertaForSequenceClassification.from_pretrained('/home/jovyan/models/roberta-large/', num_labels=2)

df_full = pd.read_csv('IMDB_Dataset.csv')
print("loaded df")
df_full = df_full.sample(frac=1).reset_index(drop=True)
df_req =  df_full.head(1000)
df_train = df_req.head(800)
df_eval = df_req.tail(200)
train_text, train_labels_raw, val_text, val_labels_raw = df_train.review.values.tolist(), df_train.sentiment.values.tolist(), df_eval.review.values.tolist(), df_eval.sentiment.values.tolist(),


train_encodings = tok(train_text, padding=True, truncation=True, max_length=512)
val_encodings = tok(val_text, padding=True, truncation=True, max_length=512)
train_labels = [1 if i=='positive' else 0 for i in train_labels_raw]
val_labels = [1 if i=='positive' else 0 for i in val_labels_raw]


class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)

print("Encoding done")


parser = HfArgumentParser(TrainingArguments)
train_args = parser.parse_args_into_dataclasses()
print('parser and args created')


trainer = Trainer(
             model=model,
             args=train_args[0],
             train_dataset=train_dataset,
             eval_dataset=val_dataset
             )
if train_args[0].do_train:
    print('------------TRAINING-------------')
    trainer.train() 
if train_args[0].do_eval:
    print('------------EVALUATING-------------')
    trainer.evaluate()

Plz someone suggest how to proceed further…

Saichandra_Pandraju · February 18, 2021, 7:46pm

Also tried with nightly build(1.9.0.dev20210218+cu101) , but again hanged after processing 2 more lines -

2021-02-18 19:28:13.170701: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Some weights of the model checkpoint at /home/jovyan/models/roberta-large/ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /home/jovyan/models/roberta-large/ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
loaded df
Encoding done
parser and args created
------------TRAINING-------------
fastai-c2-0:14431:14431 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fastai-c2-0:14431:14431 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0
fastai-c2-0:14431:14431 [0] NCCL INFO Bootstrap : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:14431:14431 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

fastai-c2-0:14431:14431 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
fastai-c2-0:14431:14431 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fastai-c2-0:14431:14431 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0
fastai-c2-0:14431:14431 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:14431:14431 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1

used the same script for both .

Am I making any mistake ? plz let me know…

Yanli_Zhao · February 18, 2021, 9:39pm

Could you try to add following codes in your script?

torch.distributed.init_process_group(backend=‘YOUR BACKEND’,
init_method=‘env://’)
torch.cuda.set_device(args.local_rank) # before your code runs

also python -m torch.distributed.launch --help to understand more about how to use torch.distributed.launch

stas · February 18, 2021, 9:46pm

@Yanli_Zhao , I suggested to @Saichandra_Pandraju to run this simple test:

echo 'import os, torch; print(os.environ["LOCAL_RANK"]); torch.distributed.init_process_group("nccl")' >  test.py
python -m torch.distributed.launch --nproc_per_node=1 test.py

and it hangs in his kubeflow environment, whereas it should work just fine.

This is the simplest reproducible script I can think of that doesn’t involve any other parts.

For context I have been trying to help @Saichandra_Pandraju in this Issue: Model Parallelism for Bert Models · Issue #10151 · huggingface/transformers · GitHub but no matter what I suggested init just hangs there.

DeepSpeed on single gpu with ZeRO-Offload still needs distributed env, due to its implementation, so that’s why we are trying to figure out this problem. And I don’t think gloo which does work for @Saichandra_Pandraju will work with deepspeed as it’s lacking features from nccl.

But reading his last follow up, once he matched cuda versions of pytorch and system-wide one the basic launcher now works. Which is odd that he needed to match the two, as pytorch’s distributed shouldn’t be impacted by the system-wide cuda install.

But then it gets stuck at a later stage will investigate the rest now.

ptrblck · February 19, 2021, 7:59am

If you’ve installed the PyTorch binaries, the shipped CUDA runtime should be used by NCCL.
However, based on the output it seems that TF is again loaded (which created other issues in the past, but we never got an update from TF):

2021-02-18 19:28:13.170701: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

Do you know, if TF is using the locally installed CUDA runtime? If so, I would assume that you need to install PyTorch with the matching CUDA runtime if TF also imports the system-wise installation.

stas · February 19, 2021, 4:50pm

Good observation, @ptrblck! TF most likely uses a system-wide CUDA runtime - at least it did last time I checked.

USE_TF=0 should prevent TF from loading in the current incarnation of transformers, it’s very possible that this was at least a partial culprit.

The issue got fully resolved using NCCL_SOCKET_IFNAME=lo as reported here based on this thread.