Hi,

Sigmoid for the last layer and MSE_loss are used in my model, however, the model don’t convergence and loss don’t decrease in training . Therefore, I did some test in snippet .

In the test one :

```
class Net(nn.Module):
def __init__(self, input_size, output_size):
super(Net, self).__init__()
self.fc1 = nn.Linear(input_size, output_size)
def forward(self, x):
out = self.fc1(x)
return F.sigmoid(out)
net = Net(1000, 1)
for name, param in net.named_parameters():
if "weight" in name or "bias" in name:
param.data.uniform_(-0.1, 0.1)
optimizer = torch.optim.SGD(net.parameters(), lr=0.5, momentum=0.9)
input_net = torch.randn(100, 100, 1000)
target = torch.ones(100, 100)
mask = torch.randn(100, 100).ge(0.5)
for epoch in range(1000):
optimizer.zero_grad()
outputs = []
for i in range(input_net.size(0)):
output = net(input_net[i])
outputs += [output.squeeze(1)]
outputs = torch.stack(outputs)
loss = F.mse_loss(outputs, target, reduce=False)[mask]
total_loss = loss.sum()
print(total_loss)
total_loss.backward()
optimizer.step()
```

In this snippet, the total_loss couldn’t decrease hugely，which is similar to my model mentioned before .

Then, I do some changes for this

```
class Net(nn.Module):
def __init__(self, input_size, output_size):
super(Net, self).__init__()
self.fc1 = nn.Linear(input_size, output_size)
def forward(self, x):
out = self.fc1(x)
return F.sigmoid(out)
net = Net(1000, 1)
for name, param in net.named_parameters():
if "weight" in name or "bias" in name:
param.data.uniform_(-0.1, 0.1)
optimizer = torch.optim.SGD(net.parameters(), lr=0.5, momentum=0.9)
input_net = torch.randn(100, 100, 1000)
target = torch.ones(100, 100)
mask = torch.randn(100, 100).ge(0.5)
for epoch in range(1000):
optimizer.zero_grad()
outputs = []
for i in range(input_net.size(0)):
output = net(input_net[i])
outputs += [output.squeeze(1)]
outputs = torch.stack(outputs)
loss = F.mse_loss(outputs, target, reduce=False)[mask]
total_loss = loss.sum() / mask.sum().float() # change: average loss
print(total_loss)
total_loss.backward()
optimizer.step()
```

In this snippet, I did a size average for loss, which cause the loss had decrease rapidly.

I couldn’t absolutely understand reasons of the change, Can any one explain that ?

if I don’t size average, what should I do can make the model converge ?