I use Alexnet and MNIST dataset, loss is BCEWithLogitsLoss, reduction is SUM, no optimizer operations. I use the same data samples to calculate loss and gradient of two same model. Use 32 samples to get loss and loss.backward(), the sum of loss is 248.0272. Then I split these samples into multi batches and use them to calculate loss and loss.backward() sequentially, the sum of loss is 245.8620. These two loss without reduction is still different on each class.

this is my code. It’s very kind of you to reproduce.

I have tried that feed the same batch into these two model, there is no difference of loss and gradient.

```
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor, Scale, Compose
from torchvision.models import resnet18
from torch.utils.data import DataLoader, TensorDataset
import torch
from torch.nn.functional import one_hot
import numpy as np
import copy
import random
def setup_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
setup_seed(20)
model = resnet18(pretrained=False, num_classes=10)
model1 = copy.deepcopy(model)
opt = torch.optim.Adam(model.parameters(), lr=0.001)
loss_func = torch.nn.BCEWithLogitsLoss(reduction='sum')
mnist = MNIST('./data/', download=True, transform=Compose([ToTensor()]))
dataloader = DataLoader(mnist, batch_size=32, shuffle=False)
dataloader1 = DataLoader(mnist, batch_size=2, shuffle=False)
grads_set = set()
data_list = []
for data in dataloader:
data, label = data
data = data.repeat(1, 3, 1, 1)
print(label)
y_one_hot = one_hot(label, 10).float()
output = model(data)
loss = loss_func(output, y_one_hot)
print(loss)
loss = torch.sum(loss)
loss.backward()
grads = []
data_list = data.numpy()
break
print()
num = 0
grads = []
data_list1 = []
loss_sum = 0
for data in dataloader1:
data, label = data
data = data.repeat(1, 3, 1, 1)
print(label)
y_one_hot = one_hot(label, 10).float()
output = model1(data)
loss = loss_func(output, y_one_hot)
loss = torch.sum(loss)
loss.backward()
loss_sum = loss_sum + loss.detach().numpy()
if len(grads) == 0:
for name, params in model1.named_parameters():
grads.append(params.grad.data.numpy())
else:
for idx, (name, params) in enumerate(model1.named_parameters()):
grads[idx] += params.grad.data.numpy()
data_list1.extend(data.numpy())
num += 1
if num == 16:
break
print(loss_sum)
grad_mean = []
for name, param in model.named_parameters():
grad_mean.append(np.mean(param.grad.data.numpy()))
grad_mean1 = []
for name, param in model1.named_parameters():
grad_mean1.append(np.mean(param.grad.data.numpy()))
print(np.array(grad_mean) - np.array(grad_mean1))
opt.zero_grad()
for data in dataloader:
data, label = data
data = data.repeat(1, 3, 1, 1)
print(label)
y_one_hot = one_hot(label, 10).float()
output = model(data)
loss = loss_func(output, y_one_hot)
loss.backward()
grads = []
data_list = data.numpy()
break
grad_mean = []
for name, param in model.named_parameters():
grad_mean.append(np.mean(param.grad.data.numpy()))
print(np.array(grad_mean) - np.array(grad_mean1))
```

Is it caused by pooling, dropout or Batch normalization?

If the model uses dropout (or other layers with random behavior) I would expect to see a larger difference in the output, so I assume you are running into the expected limited floating point precision due to a different order of operations as seen e.g. here:

```
x = torch.randn(100, 100, 100)
y1 = x.sum()
y2 = x.sum(0).sum(0).sum(0)
print(y1 - y2)
> tensor(6.1035e-05)
```

With dropping Dropout layer, the loss value and gradients of two methods are almost the same. Thanks for your help!

- As @ptrblck mentioned, you might have to put the network into eval mode to fix BN.
- There is a small thing in your observation code which actually modifies the gradients. .numpy() doesn’t make a copy but stores a reference, and then you += more to it. Also, if you accumulate the gradients across batches (by not calling zero_grad), you don’t have to accumulate them manually or you count the earlier ones multiple times. If you comment out the accumulation code and just compare the p.grad of parameters afterwards, it seems to be the same.

Thanks for your reply! BN and Dropout can change the output of the layer, so I get different loss value. In my observation code, the sum of loss is used to measure the loss and a full batch and the loss across the small batches, no gradient contribution. You remind me that *opt.zero_grad()* is used for avoiding accumulating grads!. In my experiment now, the absolute error of grads is caused by the limit of floating point precision. When the reduction of *BCEWithLogitsLoss* is *mean*, the average error is around *e-10*, and the error is around *e-8* when the reduction is *sum*.