# [newbie] - only minibatches of size 1 work, but accumulating gradients still work?

I am trying to implement a neural network approximating the logical XOR function, however, the network only converge when using a batch size of 1.

I don’t understand why : when I use gradient accumulation with multiple minibatches of size 1, the convergence is very smooth, but minibatches of size 2 or more don’t work at all.

This issue arise, whatever the learning rate, and I have the same issue with another problem(more complex) than XOR.

I join my code for reference:

``````import numpy as np
import torch.nn as nn
import torch
import torch.optim as optim
import copy

#very simple network
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(2,3,True)
self.fc1 = nn.Linear(3,1, True)

def forward(self, x):
x = torch.sigmoid(self.fc(x))
x = self.fc1(x)
return x

def data(n): # return n sets of random XOR inputs and output
inputs = np.random.randint(0,2,2*n)
inputs = np.reshape(inputs,(-1,2))
outputs = np.logical_xor(inputs[:,0], inputs[:,1])

N = 4
net = Net() # first network, is updated with minibatches of size N
net1 = copy.deepcopy(net) # second network, updated with N minibatches of size 1
inputs = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype = torch.float32)
labels = torch.tensor([0,1,1,0], dtype = torch.float32)
optimizer = optim.SGD(net.parameters(), lr=0.01)
optimizer1 = optim.SGD(net1.parameters(), lr=0.01)
running_loss = 0
running_loss1 = 0
for epoch in range(25000):  # loop over the dataset multiple times
# get the inputs; data is a list of [inputs, labels]
input, labels = data(N)

# forward + backward + optimize
loss1_total = 0
for i in range(N):
outputs1 = net1(input[i])
loss1 = (outputs1-labels[i]).pow(2)/N # I divide by N to get the effective mean
loss1.backward()
loss1_total += loss1.item()

outputs = net(input)
loss = (outputs-labels).pow(2).mean()
loss.backward()

# optimization
optimizer.step()
optimizer1.step()

# print statistics
running_loss += loss.item()
running_loss1 += loss1_total
if epoch % 1000 == 999:    # print every 1000 mini-batches
print(f'[{epoch + 1},  loss: {running_loss/1000 :.3f}, loss1: {running_loss1/1000 :.3f}')
running_loss1 = 0.0
running_loss = 0.0

print('Finished Training')
# exemples of data and outputs for reference ; network 2 always converge to the sub-optimal point(0.5,0.5)
datatest = data(4)
outputs = net(datatest)
outputs1 = net1(datatest)
inputs = datatest
labels = datatest
print("input",inputs)
print("target",labels)
print("net output",outputs)
print("net output",outputs1)
``````

[EDIT] Improved readability and updated the code

result :

``````[1000,  loss: 0.259, loss1: 0.258
[2000,  loss: 0.252, loss1: 0.251
[3000,  loss: 0.251, loss1: 0.250
[4000,  loss: 0.252, loss1: 0.250
[5000,  loss: 0.251, loss1: 0.249
[6000,  loss: 0.251, loss1: 0.247
[7000,  loss: 0.252, loss1: 0.246
[8000,  loss: 0.251, loss1: 0.244
[9000,  loss: 0.252, loss1: 0.241
[10000,  loss: 0.251, loss1: 0.236
[11000,  loss: 0.252, loss1: 0.230
[12000,  loss: 0.252, loss1: 0.221
[13000,  loss: 0.250, loss1: 0.208
[14000,  loss: 0.251, loss1: 0.193
[15000,  loss: 0.251, loss1: 0.175
[16000,  loss: 0.251, loss1: 0.152
[17000,  loss: 0.252, loss1: 0.127
[18000,  loss: 0.251, loss1: 0.099
[19000,  loss: 0.251, loss1: 0.071
[20000,  loss: 0.251, loss1: 0.048
[21000,  loss: 0.251, loss1: 0.029
[22000,  loss: 0.251, loss1: 0.016
[23000,  loss: 0.250, loss1: 0.008
[24000,  loss: 0.251, loss1: 0.004
[25000,  loss: 0.251, loss1: 0.002

Finished Training

input tensor([[1., 0.],
[0., 0.],
[0., 0.],
[0., 0.]])
target tensor([1., 0., 0., 0.])
net output tensor([[0.4686],
[0.4472],
[0.4472],
net1 output tensor([[0.9665],
[0.0193],
[0.0193],

``````

Please, could you explain me why this strange phenomena is appearing ? I searched for a long time on the net, without success…

Excuse me if my question is not well formatted.

EDIT :
I found, comparing accumulated gradients of size 1 minibatches and gradients from minibatches of size N, that the computed gradients are mostly the same, only small(but noticeable) differences appear probably due to approximation errors, so my implementation looks fine at first sight. I still don’t get where does this strong convergence property of minibatches of size 1 come from.

Hi Roro!

At issue is that your `outputs` and `labels` don’t have the same shape.
This is usually a bad idea.

In the `loss1` case, `outputs1` has shape `` and `labels[i]` has shape
`[]`, i.e., is a so-called zero-dimensional tensor. Such a tensor is basically
a scalar, and it turns out that it can often be used interchangeably with
a one-dimensional tensor of length `1` (i.e., a tensor of shape ``), so in
this case you get away with what might have been a mistake.

However, in the `loss` case, `outputs` has shape `[4, 1]` while `labels`
has shape ``. Even though both tensors contain 4 elements, this time
the difference in the shapes is a mistake. Consider:

``````>>> torch.ones (1) - torch.ones (1)
tensor([0.])
>>> torch.ones (4, 1) - torch.ones (4)
tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
``````

That is, because of broadcasting, in the `loss` case, `outputs-labels` turns
into a tensor of shape `[4, 4]`, so your calculated loss is incorrect.

This simplest fix will be to replace `outputs` with `outputs.squeeze()`,
killing off the trailing singleton dimension:

``````>>> torch.ones (4, 1).squeeze() - torch.ones (4)
tensor([0., 0., 0., 0.])
``````

(For completeness and “symmetry,” I would probably also replace
`outputs1` with `outputs1.squeeze()`, but you don’t need to.)

Best.

K. Frank

I feel very ashamed because pytorch was giving me warnings about it all this time. I guess that I learned something at least, to read warnings .