Simply adding an nn.Linear to a module changes SGD behavior

I have observed what I regard as strange/undesirable behavior. This is observed in python 3.7.4; I have not verified on other versions.

In essence, simply adding an nn.Linear module which is not used anywhere in the network changes the behavior of SGD and Adam.

Here’s a simple example showing the behavior:

import torch
from torch import nn, optim
import numpy as np
import as data_utils
from torch.nn.parameter import Parameter


class test(nn.Module):
    def __init__(self):
        super(test, self).__init__()
        self.a = Parameter(torch.zeros(1,1),requires_grad=True)
#        self.nuisance = torch.nn.Linear(10,10,bias=True)

    def forward(self, x):
        return self.a*x

    def loss_function(self, x):
        return torch.sum(self.forward(x))

def get_loader():
    x_train = torch.rand(1024,1)
    y_train = torch.rand(1024,1)

    train = data_utils.TensorDataset(x_train, y_train)
    train_loader = data_utils.DataLoader(train, batch_size=128, shuffle=True)
    return train_loader

device = "cpu"
model = test().to(device)
train_loader = get_loader()
optimizer = optim.SGD(model.parameters(),lr=0.0001)

def train():

    # set the the model to train mode

    train_loss = 0

    for batch_idx, (data, _) in enumerate(train_loader):
        data =
        loss_total = model.loss_function(data)
        train_loss += loss_total.item()
        print(loss_total.item() / len(data) )

    train_loss /= len(train_loader.dataset)

    return train_loss

if __name__ == "__main__":

Commenting/uncommenting the “nuisance” Linear module results in a different printout. With “nuisance” commented out:


while with “nuisance” added to the network:


In my opinion, this is highly undesirable. In my original (much more complex) example, this behavior resulted in markedly different values for the optimized loss function.

What is happening here?


This happens because when you add this module, it initialises its weights and so uses the random number generator. If you try to use the random number generator again afterward, you will get different results as more random numbers were drawn before.

Thanks!! Your diagnosis does make a lot of sense. I was working under the assumption that there was some kind of static graph analysis even before things like initialization ever took place. Learn something new every day!