How does batch's element are processed by Pytorch?

Torcione · December 9, 2021, 10:18am

I have a generic network without random element in his structure (e.g. no dropout) so that if I forward a given image input through the network, I put gradient to zero and repeat again the forward with the same image input I get the same result (same gradient vector, output,…)
Now let’s say that we have a batch of N elements (data, label) and I perform the following experiment:

forward the whole batch and store the gradient vector (using reduction='sum' in my criterion), use backward to generate the corresponding gradient, save it in a second object (that we’ll refer to as Batch_Grad)

output = model(data)
loss = criterion(output,torch.reshape(label, (-1,)))
loss.backward() 

Batch_Grad= []
for p in model.parameters():     
    Batch_Grad.append(p.grad.clone())

reset the gradient

optimizer.zero_grad()

repeat the first point giving in input batch’s elements one by one and collect after each backward the corresponding element’s gradient (resetting the gradient every time after that)

for i in range(0, len(label)):
    #repeat the procedure of point 1. for each data[i] input
    #...
    optimizer.zero_grad()

Sum up togheter gradient vectors of the previous point corresponding to each element of the given batch in a single object (that we’ll refer to as Single_Grad)
compare the objects of point 4. and 1. (Batch_Grad and Single_Grad)

Following the above procedure I find that tensor from point 1. and 5. are equal only if the batch size (N) is equal to 1, but they are different for N>1.

With the method of point 3. and 4. I’m manually summing gradients associated to single image propagation (which as pointed in the above comment are equals to the ones calculated automatically by SGD, with N=1). Since automatic SGD approach (point 1.)is also expected to perform the same sum:
Why do I observe this difference?

EDIT
From the answer to this post I understood that a possible reason is that proceeding with point 1. I could not be doing a mini batch forward but using, instead, an iterative method where the element of the batch are notprocessed in parallel but one after the other.
I want to understand if it is my case; since I’m using torch.optim.SGD() does it perform mini-batch or the iterative method?

ptrblck · December 10, 2021, 8:45am

I don’t think the answer from the cross-post is correct, as your code doesn’t show any optimizer.step() usage so I assume you are purely comparing the gradients computed via using the entire dataset vs. single samples.
If so, I would assume the final difference would show a relative error of ~1e-5 or ~1e-6, which is expected due to the limited floating point precision and which is caused by a different order of operations.

Torcione · December 10, 2021, 10:05am

Thank you for the answer! I have the same doubt indeed.

I assume you are purely comparing the gradients computed via using the entire dataset vs. single samples.

Exactly

if I consider a batch size of 2 element, for example, I get, for each single componenet of the gradient, an error of ~1e-5 or ~1e-6, as you predicted.
May I ask you how did you guess the order of magnitude of the difference in the 2 cases? (Sorry for the stupid question)

Is it possible to pass to a double precision for the gradients component in order to verify that this is the right source of error? (in that case the order of magnitude of the difference should decrease and, in particular, it should be possible to predict the new order of magnitude as you just did for the floating precision)

ptrblck · December 10, 2021, 10:15am

That’s the typical relative error in float32, so it was just my guess as you would see it in a lot of other operations as well.

Yes, absolutely! If you are using float64 you should see a relative error of ~1e-13 or so (my gut feeling is way worse for float64 ).

Torcione · December 10, 2021, 10:31am

How can I set a float64 precision for the gradient tensors computed by autograd (during the backward step)?

ptrblck · December 10, 2021, 10:33am

Call .double() on the model and input, which will transform all parameters to float64:

model.double()
x = x.double()

# forward/backward pass as before

Torcione · December 10, 2021, 3:17pm

Thank you for the clarification!
for the input should I call like this:

for x,label in train_loader:
    x = x.double()
    x = x.to(device)
    label = label.to(device) 
    # forward/backward pass as before

or is there a way to specify the double calling directly during the dataset creation (for example using the transform option in torchvision.dataset)

ptrblck · December 10, 2021, 7:55pm

Your approach for testing should be fine. If you really want to use double during the entire training, you could certainly check if moving the transformation to the Dataset (or keeping the data directly in float64) could work.