Error with DistributedDataParallel()

gnadaf · November 19, 2020, 10:20pm

Hi,
I am trying to wrap DistributedDataParallel() with the Transformer model.

But, I am facing the below error

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class ‘torchtext.data.example.Example’>

Code:

def train_1(args):
#init_process_group
#rank = args.nr * args.gpus + gpu
rank = int(os.environ[‘LOCAL_RANK’])
gpu = torch.device(f’cuda:{rank}')
torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
torch.cuda.set_device(gpu)

TEXT = torchtext.data.Field(tokenize=get_tokenizer("basic_english"),
                            init_token='<sos>',
                            eos_token='<eos>',
                            lower=True)
train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(TEXT)
TEXT.build_vocab(train_txt)

batch_size = 20
eval_batch_size = 10

sampler =  torch.utils.data.distributed.DistributedSampler(train_txt);
loader = torch.utils.data.DataLoader(train_txt, shuffle=(sampler is None), sampler=sampler)

bptt = 35


ntokens = len(TEXT.vocab.stoi)  # the size of vocabulary
emsize = 200  # embedding dimension
nhid = 200  # the dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2  # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2  # the number of heads in the multiheadattention models
dropout = 0.2  # the dropout value
model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout).to(gpu)

#DDP
model = nn.parallel.DistributedDataParallel(model,device_ids=[gpu])

criterion = nn.CrossEntropyLoss()
lr = 5.0  # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)


def train():
    model.train()  # Turn on the train mode
    total_loss = 0.
    start_time = time.time()
    ntokens = len(TEXT.vocab.stoi)
    for i, (data, targets) in enumerate(loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
        total_loss += loss.item()
        log_interval = 200
        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            
            print('| epoch {:3d} | {:5d}/{:5d} batches | '
                'lr {:02.2f} | ms/batch {:5.2f} | '
                'loss {:5.2f} | ppl {:8.2f} | device{:3d}'.format(
                epoch, batch, len(train_data) // bptt, scheduler.get_lr()[0],
                            elapsed * 1000 / log_interval,
                cur_loss, math.exp(cur_loss), torch.cuda.current_device()))
            total_loss = 0
            start_time = time.time()

Abhilash_Srivastava · November 19, 2020, 10:55pm

Please share the full error log. Most likely this error is not due to DistributedDataParallel.

gnadaf · November 19, 2020, 10:56pm

Traceback (most recent call last):
File “standard_ddp_7.py”, line 219, in
main()
File “standard_ddp_7.py”, line 216, in main
train_1(args)
File “standard_ddp_7.py”, line 174, in train_1
train()
File “standard_ddp_7.py”, line 127, in train
for i, (data, targets) in enumerate(loader):
File “/home/nadaf/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 435, in next
data = self._next_data()
File “/home/nadaf/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File “/home/nadaf/anaconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 47, in fetch
return self.collate_fn(data)
File “/home/nadaf/anaconda3/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py”, line 85, in default_collate
raise TypeError(default_collate_err_msg_format.format(elem_type))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class ‘torchtext.data.example.Example’>
Traceback (most recent call last):
File “standard_ddp_7.py”, line 219, in
main()
File “standard_ddp_7.py”, line 216, in main
train_1(args)
File “standard_ddp_7.py”, line 174, in train_1
train()
File “standard_ddp_7.py”, line 127, in train
for i, (data, targets) in enumerate(loader):
File “/home/nadaf/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 435, in next
data = self._next_data()
File “/home/nadaf/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File “/home/nadaf/anaconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 47, in fetch
return self.collate_fn(data)
File “/home/nadaf/anaconda3/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py”, line 85, in default_collate
raise TypeError(default_collate_err_msg_format.format(elem_type))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class ‘torchtext.data.example.Example’>
Traceback (most recent call last):
File “/home/nadaf/anaconda3/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/nadaf/anaconda3/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/nadaf/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py”, line 260, in
main()
File “/home/nadaf/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py”, line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command ‘[’/home/nadaf/anaconda3/bin/python’, ‘-u’, ‘standard_ddp_7.py’]’ returned non-zero exit status 1.

gnadaf · November 19, 2020, 10:58pm

Hi @Abhilash_Srivastava,
Attached error log

Abhilash_Srivastava · November 19, 2020, 11:15pm

Try modifying this to:

    for i, batch in enumerate(loader):
        data, targets = batch.text, batch.target
        optimizer.zero_grad()
        output = model(data)
        ....

gnadaf · November 19, 2020, 11:34pm

@Abhilash_Srivastava

I modified the code as you suggested, but still facing the same error

gnadaf · November 19, 2020, 11:44pm

@ptrblck
@albanD
Could you please have a look

ptrblck · November 20, 2020, 12:22am

The error doesn’t seem to be related to DDP, but the DataLoader and torchtext:

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class ‘torchtext.data.example.Example’>

I’m unsure how the torchtext.data.example.Example class is implemented, but could you try to return tensors in the Dataset.__getitem__ instead of this object?

gnadaf · November 20, 2020, 12:28am

I am loading datset like this

train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(TEXT)

gnadaf · November 20, 2020, 5:45pm

@Abhilash_Srivastava @ptrblck
Still facing the same error:

I do not see any online example where we wrap the transformer model with DDP.

Transformer mode:

Abhilash_Srivastava · November 20, 2020, 6:50pm

Are you able to run and train your model successfully (say for a smaller dataset) without DDP?

gnadaf · November 20, 2020, 7:01pm

Yes, without DDP, the model works fine