I’m using a BERT model from the transformers package (by huggingface). I’ve now got access to more than 1 GPU and so am wrapping my model in nn.DataParallel() to take advantage of them.
I get the following warning: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
This is generated by this bit of code
train_output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
This returns loss tuple. I extract the loss and take the mean as I’m spreading the batches across 2 GPUs, and so receive 2 losses.
loss = train_output[0]
loss = loss.mean()
One thing I want to understand is why the forward pass is giving me this warning? I’m assuming that nn.DataParallel is doing something to the inputs that wasn’t happening when it was a single GPU.