nn.DataParallel - "Was asked to gather along dimension 0, but all input tensors were scalars"

param_eter · October 14, 2019, 2:49pm

I’m using a BERT model from the transformers package (by huggingface). I’ve now got access to more than 1 GPU and so am wrapping my model in nn.DataParallel() to take advantage of them.

I get the following warning: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
This is generated by this bit of code
train_output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)

This returns loss tuple. I extract the loss and take the mean as I’m spreading the batches across 2 GPUs, and so receive 2 losses.

loss = train_output[0]
loss = loss.mean()

One thing I want to understand is why the forward pass is giving me this warning? I’m assuming that nn.DataParallel is doing something to the inputs that wasn’t happening when it was a single GPU.

ptrblck · October 14, 2019, 3:12pm

As the warning says, nn.DataParallel will chunk the data in dim0 (batch dimension) and send each chunk to the corresponding device.
E.g. if you are dealing with an input of [16, 3, 100] and two GPUs, each GPU will receive an input of [8, 3, 100].

b_input_ids doesn’t seem to have the right shape, so could you check, what type and shape you are using at the moment?

param_eter · October 17, 2019, 4:41pm

So my initial input is a dataset of around 800 rows and 64 columns (tokenised texts of length 64).

I’m using a batch size of 64, and my data consists of tokenised texts of length 64, so the I print the size of b_input_ids, I get [64,64].

Any ideas what Pytorch is doing to the batch given it doesn’t have this third dimension like in your example?