Use the same data, but get different loss

XJ_LI · July 14, 2021, 2:29pm

I use Alexnet and MNIST dataset, loss is BCEWithLogitsLoss, reduction is SUM, no optimizer operations. I use the same data samples to calculate loss and gradient of two same model. Use 32 samples to get loss and loss.backward(), the sum of loss is 248.0272. Then I split these samples into multi batches and use them to calculate loss and loss.backward() sequentially, the sum of loss is 245.8620. These two loss without reduction is still different on each class.

XJ_LI · July 15, 2021, 2:05am

this is my code. It’s very kind of you to reproduce.
I have tried that feed the same batch into these two model, there is no difference of loss and gradient.

from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor, Scale, Compose
from torchvision.models import resnet18
from torch.utils.data import DataLoader, TensorDataset
import torch
from torch.nn.functional import one_hot
import numpy as np
import copy
import random


def setup_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True



setup_seed(20)

model = resnet18(pretrained=False, num_classes=10)
model1 = copy.deepcopy(model)

opt = torch.optim.Adam(model.parameters(), lr=0.001)
loss_func = torch.nn.BCEWithLogitsLoss(reduction='sum')

mnist = MNIST('./data/', download=True, transform=Compose([ToTensor()]))

dataloader = DataLoader(mnist, batch_size=32, shuffle=False)
dataloader1 = DataLoader(mnist, batch_size=2, shuffle=False)



grads_set = set()
data_list = []
for data in dataloader:
    data, label = data
    data = data.repeat(1, 3, 1, 1)
    print(label)
    y_one_hot = one_hot(label, 10).float()
    output = model(data)
    loss = loss_func(output, y_one_hot)
    print(loss)
    loss = torch.sum(loss)
    loss.backward()

    grads = []
    data_list = data.numpy()

    break

print()
num = 0
grads = []
data_list1 = []
loss_sum = 0
for data in dataloader1:
    data, label = data
    data = data.repeat(1, 3, 1, 1)
    print(label)
    y_one_hot = one_hot(label, 10).float()
    output = model1(data)
    loss = loss_func(output, y_one_hot)
    loss = torch.sum(loss)
    loss.backward()
    loss_sum = loss_sum + loss.detach().numpy()
    if len(grads) == 0:
        for name, params in model1.named_parameters():
            grads.append(params.grad.data.numpy())
    else:
        for idx, (name, params) in enumerate(model1.named_parameters()):
            grads[idx] += params.grad.data.numpy()
    data_list1.extend(data.numpy())
    num += 1
    if num == 16:
        break

print(loss_sum)
grad_mean = []
for name, param in model.named_parameters():
    grad_mean.append(np.mean(param.grad.data.numpy()))

grad_mean1 = []
for name, param in model1.named_parameters():
    grad_mean1.append(np.mean(param.grad.data.numpy()))

print(np.array(grad_mean) - np.array(grad_mean1))

opt.zero_grad()
for data in dataloader:
    data, label = data
    data = data.repeat(1, 3, 1, 1)
    print(label)
    y_one_hot = one_hot(label, 10).float()
    output = model(data)
    loss = loss_func(output, y_one_hot)
    loss.backward()

    grads = []
    data_list = data.numpy()

    break


grad_mean = []
for name, param in model.named_parameters():
    grad_mean.append(np.mean(param.grad.data.numpy()))


print(np.array(grad_mean) - np.array(grad_mean1))

XJ_LI · July 15, 2021, 3:17am

Is it caused by pooling, dropout or Batch normalization?

ptrblck · July 15, 2021, 5:48am

If the model uses dropout (or other layers with random behavior) I would expect to see a larger difference in the output, so I assume you are running into the expected limited floating point precision due to a different order of operations as seen e.g. here:

x = torch.randn(100, 100, 100)
y1 = x.sum()
y2 = x.sum(0).sum(0).sum(0)
print(y1 - y2)
> tensor(6.1035e-05)

XJ_LI · July 15, 2021, 6:04am

With dropping Dropout layer, the loss value and gradients of two methods are almost the same. Thanks for your help!

tom · July 15, 2021, 6:44am

As @ptrblck mentioned, you might have to put the network into eval mode to fix BN.
There is a small thing in your observation code which actually modifies the gradients. .numpy() doesn’t make a copy but stores a reference, and then you += more to it. Also, if you accumulate the gradients across batches (by not calling zero_grad), you don’t have to accumulate them manually or you count the earlier ones multiple times. If you comment out the accumulation code and just compare the p.grad of parameters afterwards, it seems to be the same.

XJ_LI · July 15, 2021, 7:28am

Thanks for your reply! BN and Dropout can change the output of the layer, so I get different loss value. In my observation code, the sum of loss is used to measure the loss and a full batch and the loss across the small batches, no gradient contribution. You remind me that opt.zero_grad() is used for avoiding accumulating grads!. In my experiment now, the absolute error of grads is caused by the limit of floating point precision. When the reduction of BCEWithLogitsLoss is mean, the average error is around e-10, and the error is around e-8 when the reduction is sum.