RuntimeError on backward() on custom loss -- something to do with reductions?

I’m getting the following error when I try to backpropagate an error:

RuntimeError: The expanded size of the tensor (36) must match the existing size (3) at non-singleton dimension 0

(The 3 above is because I have it running in parallel on 3 GPUs. On a single GPU that becomes a 1.)

I suspect this has something to do with reductions, but I don’t quite understand why…

The above error only occurs when I add a new component to my error. That component is defined in a function that takes as input a batch of 36 variables, mutates them a bit (preserving their variable state), computes their distance from some ground truth, and returns a scalar: the average of the distances in the batch. It’s just an additional loss term.

The error occurs when I try to backpropagate as

one = torch.ones(num_gpus).cuda().double()

# ...compute err...
err = network(input)
# err.backward(one) <-- no problems here!

err = err + my_function(other_input, gt)

err.backward(one) # <--problems!

Note that everything is scalar by the time the backward computation is done:

print err.shape
>> ()

How do the loss functions that accept batches of variables address this? There doesn’t seem to be anything particularly obvious in the source code (here or here). What is being “expanded” here and why is it causing this error?

I suspect this is because I am applying the function to a tensor of BATCH_SIZE x SAMPLE_ERROR rather than using a reduction that applies the function to each sample independently within the batch (a la reduce=True in the built-in losses). How can I resolve this?

Did you define a custom backward() function? (Note that defining your own backward function is not required in most cases because autograd takes care of it). If you didn’t then this is most likely a pytorch bug.

@richard I did not. It shouldn’t be required in this case, though, as the series of operations are all primitives (+,-, etc.) or built-in Pytorch functions. I am doing a NN search using PyFLANN (offline), then using those indices to sample my dataset, and subsequently finding the Euclidean distance between data points. Ultimately I take the average over all.

The operations I use are torch.index_select(), torch.sqrt(), torch.mean() and torch.sum(), along with +, -, and ** (power).

If absolutely necessary I can share the function here if that helps (although I’m avoiding posting it here as it’s research related).

Given that err is a scalar, you simply can’t pass three values to err.backward(). Passing in a single value should be enough. PyTorch should backpropagate it properly.

That said, if I understand you correctly, you are running one model in parallel on 3 GPUs, in which case you will need to somehow synchronise the updates between the three of them, otherwise you will have three models that diverge from each other.

@jpeg729 When you run one model in parallel on 3 GPUs the error is duplicated I think. The reason for 3 values is due to an error telling me it was necessary ;-).

But you’re right about synchronizing the updates. I take the mean (torch.mean()) of the 3 outputs.

Did you mean that you take the mean of the 3 sets of gradients before applying the update?

If you meant to say that you take the mean of the 3 model outputs, then I am fairly certain that won’t work because PyTorch is smart enough to know where the 3 parts came from and backpropagate accordingly. In other words, if model1 produces a large loss on sample1, and model2 produces a small loss on sample2, then model1 will get larger gradients even if you average the losses before backpropagating.

I take the mean of the model outputs. It hasn’t caused any issues so far–the network converges and training seems to be working. It’s only when I add a custom term to the loss that things get wonky.

This type of setup is a little too far outside what I am used to, I’d better shut up before I say something stupid.

Hopefully someone else can help.

@jpeg729 No worries. I get your point: the three parallel networks ought to have the same same gradients to train identically, but I think PyTorch’s DataParallel module automatically sums the gradients before backprop. I will check that out.

All of this being said, this is exchange has tangential to my original question.

Anyone know why torch.mean(A) over when A is a BATCH_SIZE x 1 variable with requires_grad=True doesn’t seem to stay reduced on backward()?

Hi, Marc,

I got the same error during loss.backward().
RuntimeError: The size of tensor a (501) must match the size of tensor b (434) at non-singleton dimension 0

Have you solved this error? Any hint?