How to correct TypeError: zip argument #1 must support iteration training in multiple GPU

:bug: Describe the bug

I am doing a creating custom pytorch layer and model training using Trainer API function on top of Hugging face model.

When I run on single GPU, it trains fine. But when I train it on multiple GPU it throws me error.

TypeError: zip argument #1 must support iteration training in multiple GPU

Training Code

bert_model = BertForTokenClassification.from_pretrained( model_checkpoint,id2label=id2label,label2id=label2id)

class BERT_CUSTOM(nn.Module):
    def __init__(self, bert_model,id2label,num_labels):
        super(BERT_CUSTOM, self).__init__()
        self.bert = bert_model
        self.dropout = nn.Dropout(0.25)
        self.classifier = nn.Linear(768, num_labels)
        self.crf = CRF(num_labels, batch_first = True)
    def forward(self, input_ids, attention_mask,  labels=None, token_type_ids=None):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        sequence_output = torch.stack((outputs[1][-1], outputs[1][-2], outputs[1][-3], outputs[1][-4])).mean(dim=0)
        sequence_output = self.dropout(sequence_output)
        emission = self.classifier(sequence_output) # [32,256,21] logits
        if labels is not None:
            loss = -self.crf(log_soft(emission, 2), labels, mask=attention_mask.type(torch.uint8), reduction='mean')
            prediction = self.crf.decode(emission, mask=attention_mask.type(torch.uint8))
            return [loss, prediction]
            prediction = self.crf.decode(emission, mask=attention_mask.type(torch.uint8))
            prediction=[id2label[k] for k in prediction]
            return prediction

Training API

model = BERT_CUSTOM(bert_model, id2label,num_labels=len(label2id))

args = TrainingArguments(

trainer = Trainer(




Hi ptrblck, could you please help.?

Here is the data creation code:

train_ex ={'texts':[x[0] for x in train_set],'tag_names':[x[1] for x in train_set]}
train_data = tokenize_and_align_labels(train_ex,label2id)

class MyDataset(
    def __init__(self, examples):
        self.encodings = examples        
        self.labels = examples['labels']
    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item    def __len__(self):
        return len(self.labels)

And I use this train_data directly in model.

I’m not deeply familiar with the HuggingFace Trainer class, but could you post the full stacktrace here, please?

1 Like

Hi @ptrblck

Traceback (most recent call last):
  File "", line 263, in <module>
  File "/opt/conda/lib/python3.7/site-packages/transformers/", line 1531, in train
  File "/opt/conda/lib/python3.7/site-packages/transformers/", line 1775, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/", line 2523, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/", line 2555, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/", line 162, in forward
    return self.gather(outputs, self.output_device)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/", line 174, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/", line 68, in gather
    res = gather_map(outputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  [Previous line repeated 1 more time]
TypeError: zip argument #1 must support iteration

Thanks for the stacktrace! I don’t see any obvious issue but it seems an internal DataParallel call fails. I would recommend updating PyTorch and transformers to the latest release to see if you are still running into the same issue.
I also see you’ve cross-posted the issue in the HF discussion board but didn’t receive any responses yet.
A related issue can be found in the FastAI discussion board where a user claimed:

Turns out multi-gpu works perfectly with fastai. All one has to do is model=torch.nn.DataParallel(model) and then pass this model to the Learner object.
The error was because I was passing a list instead of tensors which was causing the error reported.