I changed torch from 1.7.1 to 1.7.1+cu101 as shown below -
Collecting environment information...
PyTorch version: 1.7.1+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 450.51.06
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] kubeflow-pytorchjob==0.1.3
[pip3] numpy==1.18.5
[pip3] torch==1.7.1+cu101
[pip3] torchaudio==0.7.2
[pip3] torchvision==0.8.2+cu101
[conda] Could not collect
After this, torch.distributed.init_process_group(backend=‘nccl’)
worked but it again hanged with script. Below is the complete log -
2021-02-18 19:00:28.946359: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Some weights of the model checkpoint at /home/jovyan/models/roberta-large/ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /home/jovyan/models/roberta-large/ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
loaded df
Encoding done
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO Bootstrap : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:13993:13993 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
fastai-c2-0:13993:13993 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:13993:13993 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1
Note : At the time, we created this VM, we named it as fastai-c2-0. But fastai has nothing to do with this issue as I’m not using it at all
I’m using this command to launch from notebook -
!python -m torch.distributed.launch --nproc_per_node=1 ./Deepspeed.py --output_dir ./out_dir/results --overwrite_output_dir --do_train \
--do_eval --per_device_train_batch_size 10 --per_device_eval_batch_size 10 --learning_rate 3e-5 --weight_decay 0.01 \
--num_train_epochs 1 --load_best_model_at_end
Here is the simple script that I’m using -
from transformers import RobertaForSequenceClassification, RobertaTokenizerFast, Trainer, TrainingArguments, HfArgumentParser
import pandas as pd
import numpy as np
import torch
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['NCCL_DEBUG']='INFO'
os.environ['NCCL_DEBUG_SUBSYS']='ALL'
os.environ['NCCL_IB_DISABLE']='1'
os.environ['NCCL_SOCKET_IFNAME']='eth0'
tok = RobertaTokenizerFast.from_pretrained('/home/jovyan/models/roberta-large/')
model = RobertaForSequenceClassification.from_pretrained('/home/jovyan/models/roberta-large/', num_labels=2)
df_full = pd.read_csv('IMDB_Dataset.csv')
print("loaded df")
df_full = df_full.sample(frac=1).reset_index(drop=True)
df_req = df_full.head(1000)
df_train = df_req.head(800)
df_eval = df_req.tail(200)
train_text, train_labels_raw, val_text, val_labels_raw = df_train.review.values.tolist(), df_train.sentiment.values.tolist(), df_eval.review.values.tolist(), df_eval.sentiment.values.tolist(),
train_encodings = tok(train_text, padding=True, truncation=True, max_length=512)
val_encodings = tok(val_text, padding=True, truncation=True, max_length=512)
train_labels = [1 if i=='positive' else 0 for i in train_labels_raw]
val_labels = [1 if i=='positive' else 0 for i in val_labels_raw]
class IMDbDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
print("Encoding done")
parser = HfArgumentParser(TrainingArguments)
train_args = parser.parse_args_into_dataclasses()
print('parser and args created')
trainer = Trainer(
model=model,
args=train_args[0],
train_dataset=train_dataset,
eval_dataset=val_dataset
)
if train_args[0].do_train:
print('------------TRAINING-------------')
trainer.train()
if train_args[0].do_eval:
print('------------EVALUATING-------------')
trainer.evaluate()
Plz someone suggest how to proceed further…