Does the MSE_loss have to use size average?

Hi,

Sigmoid for the last layer and MSE_loss are used in my model, however, the model don’t convergence and loss don’t decrease in training . Therefore, I did some test in snippet .
In the test one :

class Net(nn.Module):
    def __init__(self, input_size, output_size):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_size, output_size)

    def forward(self, x):
        out = self.fc1(x)
        return F.sigmoid(out)

net = Net(1000, 1)

for name, param in net.named_parameters():
    if "weight" in name or "bias" in name:
        param.data.uniform_(-0.1, 0.1)

optimizer = torch.optim.SGD(net.parameters(), lr=0.5, momentum=0.9)

input_net = torch.randn(100, 100, 1000)
target = torch.ones(100, 100)
mask = torch.randn(100, 100).ge(0.5)


for epoch in range(1000):
    optimizer.zero_grad()

    outputs = []
    for i in range(input_net.size(0)):
        output = net(input_net[i])
        outputs += [output.squeeze(1)]

    outputs = torch.stack(outputs)
    loss = F.mse_loss(outputs, target, reduce=False)[mask]
    total_loss = loss.sum()

    print(total_loss)
    total_loss.backward()
    optimizer.step()

In this snippet, the total_loss couldn’t decrease hugely,which is similar to my model mentioned before .
Then, I do some changes for this

class Net(nn.Module):
    def __init__(self, input_size, output_size):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_size, output_size)

    def forward(self, x):
        out = self.fc1(x)
        return F.sigmoid(out)

net = Net(1000, 1)

for name, param in net.named_parameters():
    if "weight" in name or "bias" in name:
        param.data.uniform_(-0.1, 0.1)

optimizer = torch.optim.SGD(net.parameters(), lr=0.5, momentum=0.9)

input_net = torch.randn(100, 100, 1000)
target = torch.ones(100, 100)
mask = torch.randn(100, 100).ge(0.5)


for epoch in range(1000):
    optimizer.zero_grad()

    outputs = []
    for i in range(input_net.size(0)):
        output = net(input_net[i])
        outputs += [output.squeeze(1)]

    outputs = torch.stack(outputs)
    loss = F.mse_loss(outputs, target, reduce=False)[mask]
    total_loss = loss.sum() / mask.sum().float()   # change:   average loss 

    print(total_loss)
    total_loss.backward()
    optimizer.step()

In this snippet, I did a size average for loss, which cause the loss had decrease rapidly.

I couldn’t absolutely understand reasons of the change, Can any one explain that ?

if I don’t size average, what should I do can make the model converge ?

Because your losses are most likely in a different range (sum vs. mean), you would have to change your learning rate accordingly to get the same weight updates.
In your first example your learning rate might just be too high for the high loss values.

Thanks for your reply.
I tested the first example in different learning rate , while the result of experiment is terrible.
What’s wrong with it ?