[newbie] - only minibatches of size 1 work, but accumulating gradients still work?

roro2882 · May 21, 2022, 6:05pm

I am trying to implement a neural network approximating the logical XOR function, however, the network only converge when using a batch size of 1.

I don’t understand why : when I use gradient accumulation with multiple minibatches of size 1, the convergence is very smooth, but minibatches of size 2 or more don’t work at all.

This issue arise, whatever the learning rate, and I have the same issue with another problem(more complex) than XOR.

I join my code for reference:

import numpy as np
import torch.nn as nn
import torch
import torch.optim as optim
import copy

#very simple network
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(2,3,True)
        self.fc1 = nn.Linear(3,1, True)

    def forward(self, x):
        x = torch.sigmoid(self.fc(x))
        x = self.fc1(x)
        return x

def data(n): # return n sets of random XOR inputs and output
    inputs = np.random.randint(0,2,2*n)
    inputs = np.reshape(inputs,(-1,2))
    outputs = np.logical_xor(inputs[:,0], inputs[:,1])
    return torch.tensor(inputs, dtype = torch.float32),torch.tensor(outputs, dtype = torch.float32)


N = 4
net = Net() # first network, is updated with minibatches of size N
net1 = copy.deepcopy(net) # second network, updated with N minibatches of size 1
inputs = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype = torch.float32)
labels = torch.tensor([0,1,1,0], dtype = torch.float32)
optimizer = optim.SGD(net.parameters(), lr=0.01)
optimizer1 = optim.SGD(net1.parameters(), lr=0.01)
running_loss = 0
running_loss1 = 0
for epoch in range(25000):  # loop over the dataset multiple times
    # get the inputs; data is a list of [inputs, labels]
    input, labels = data(N)

    # zero the parameter gradients
    optimizer.zero_grad()
    optimizer1.zero_grad()
    # forward + backward + optimize
    loss1_total = 0
    for i in range(N):
        outputs1 = net1(input[i])
        loss1 = (outputs1-labels[i]).pow(2)/N # I divide by N to get the effective mean
        loss1.backward()
        loss1_total += loss1.item()


    outputs = net(input)
    loss = (outputs-labels).pow(2).mean()
    loss.backward()
    
    # optimization
    optimizer.step()
    optimizer1.step()

    # print statistics
    running_loss += loss.item()
    running_loss1 += loss1_total
    if epoch % 1000 == 999:    # print every 1000 mini-batches
        print(f'[{epoch + 1},  loss: {running_loss/1000 :.3f}, loss1: {running_loss1/1000 :.3f}')
        running_loss1 = 0.0
        running_loss = 0.0
        
print('Finished Training')
 # exemples of data and outputs for reference ; network 2 always converge to the sub-optimal point(0.5,0.5)
datatest = data(4)
outputs = net(datatest[0])
outputs1 = net1(datatest[0])
inputs = datatest[0]
labels = datatest[1]
print("input",inputs)
print("target",labels)
print("net output",outputs)
print("net output",outputs1)

[EDIT] Improved readability and updated the code

result :

[1000,  loss: 0.259, loss1: 0.258
[2000,  loss: 0.252, loss1: 0.251
[3000,  loss: 0.251, loss1: 0.250
[4000,  loss: 0.252, loss1: 0.250
[5000,  loss: 0.251, loss1: 0.249
[6000,  loss: 0.251, loss1: 0.247
[7000,  loss: 0.252, loss1: 0.246
[8000,  loss: 0.251, loss1: 0.244
[9000,  loss: 0.252, loss1: 0.241
[10000,  loss: 0.251, loss1: 0.236
[11000,  loss: 0.252, loss1: 0.230
[12000,  loss: 0.252, loss1: 0.221
[13000,  loss: 0.250, loss1: 0.208
[14000,  loss: 0.251, loss1: 0.193
[15000,  loss: 0.251, loss1: 0.175
[16000,  loss: 0.251, loss1: 0.152
[17000,  loss: 0.252, loss1: 0.127
[18000,  loss: 0.251, loss1: 0.099
[19000,  loss: 0.251, loss1: 0.071
[20000,  loss: 0.251, loss1: 0.048
[21000,  loss: 0.251, loss1: 0.029
[22000,  loss: 0.251, loss1: 0.016
[23000,  loss: 0.250, loss1: 0.008
[24000,  loss: 0.251, loss1: 0.004
[25000,  loss: 0.251, loss1: 0.002

Finished Training

input tensor([[1., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]])
target tensor([1., 0., 0., 0.])
net output tensor([[0.4686],
        [0.4472],
        [0.4472],
        [0.4472]], grad_fn=<AddmmBackward0>)
net1 output tensor([[0.9665],
        [0.0193],
        [0.0193],
        [0.0193]], grad_fn=<AddmmBackward0>)

Please, could you explain me why this strange phenomena is appearing ? I searched for a long time on the net, without success…

Excuse me if my question is not well formatted.

EDIT :
I found, comparing accumulated gradients of size 1 minibatches and gradients from minibatches of size N, that the computed gradients are mostly the same, only small(but noticeable) differences appear probably due to approximation errors, so my implementation looks fine at first sight. I still don’t get where does this strong convergence property of minibatches of size 1 come from.

KFrank · May 22, 2022, 12:57am

Hi Roro!

roro2882:

I don’t understand why : when I use gradient accumulation with multiple minibatches of size 1, the convergence is very smooth, but minibatches of size 2 or more don’t work at all.
…
        outputs1 = net1(input[i])
        loss1 = (outputs1-labels[i]).pow(2)/N # I divide by N to get the effective mean
...
    outputs = net(input)
    loss = (outputs-labels).pow(2).mean()

At issue is that your outputs and labels don’t have the same shape.
This is usually a bad idea.

In the loss1 case, outputs1 has shape [1] and labels[i] has shape
[], i.e., is a so-called zero-dimensional tensor. Such a tensor is basically
a scalar, and it turns out that it can often be used interchangeably with
a one-dimensional tensor of length 1 (i.e., a tensor of shape [1]), so in
this case you get away with what might have been a mistake.

However, in the loss case, outputs has shape [4, 1] while labels
has shape [4]. Even though both tensors contain 4 elements, this time
the difference in the shapes is a mistake. Consider:

>>> torch.ones (1) - torch.ones (1)[0]
tensor([0.])
>>> torch.ones (4, 1) - torch.ones (4)
tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

That is, because of broadcasting, in the loss case, outputs-labels turns
into a tensor of shape [4, 4], so your calculated loss is incorrect.

This simplest fix will be to replace outputs with outputs.squeeze(),
killing off the trailing singleton dimension:

>>> torch.ones (4, 1).squeeze() - torch.ones (4)
tensor([0., 0., 0., 0.])

(For completeness and “symmetry,” I would probably also replace
outputs1 with outputs1.squeeze(), but you don’t need to.)

Best.

K. Frank

roro2882 · May 22, 2022, 3:41pm

Thank you a lot for your answer,
I feel very ashamed because pytorch was giving me warnings about it all this time. I guess that I learned something at least, to read warnings .
Have a nice day !