What's the difference between those two simple nn models?

guijun_ren · February 14, 2019, 8:26pm

Two nn models are defined below. for simplicity, let’s call them LeNet1 and LeNet2. From the code snippets, we can see that their forward functions are completely same, the only difference is the order of the layers in class initialization method. LeNet2 simply gets out of the normal order.

Since the forward functions are identical, I’d expect those two networks would generate the same output if we feed the same input to them. However, the experimentation result does not support that.

Any thoughts on this? Much appreciated.

class LeNet1(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5) ##output: (1,6,24,24)
        self.conv2 = nn.Conv2d(6,16, 5) 
        self.fc1 = nn.Linear(16*4*4,120)
        self.fc2 = nn.Linear(120,84)
        self.fc3 = nn.Linear(84,10)

    def forward(self,x):
        x = F.max_pool2d(F.relu(self.conv1(x)),(2,2)) 
        x = F.max_pool2d(F.relu(self.conv2(x)),2) 
        x = x.view(x.size()[0], -1) 
        x = F.relu(self.fc1(x)) 
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

class LeNet2(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(16*4*4,120)
        self.conv2 = nn.Conv2d(6,16, 5)
        self.fc3 = nn.Linear(84,10)
        self.conv1 = nn.Conv2d(3, 6, 5) 
        self.fc2 = nn.Linear(120,84)
    
    def forward(self,x):
        x = F.max_pool2d(F.relu(self.conv1(x)),(2,2)) 
        x = F.max_pool2d(F.relu(self.conv2(x)),2) 
        x = x.view(x.size()[0], -1) 
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

def seed_everything(seed=123456):
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

seed_everything()
inp = torch.randn(1,3,28,28)
test1 = LeNet1()
test2 = LeNet2()

test1(inp),test2(inp)

Outputs:
(tensor([[ 0.1197, -0.0631,  0.0227, -0.0620,  0.0760,  0.0856,  0.0775, -0.0713,
      -0.0762,  0.0417]], grad_fn=<AddmmBackward>),
 tensor([[-0.0264,  0.0233,  0.0904, -0.0755,  0.0279, -0.0459,  0.0838, -0.0263,
      -0.0738,  0.0075]], grad_fn=<AddmmBackward>))

I was also trying to print out their network architectures to see if there is any minor difference that I ignored. However, the result shows they have exactly same network graph. see below for reference.

ptrblck · February 14, 2019, 10:02pm

I don’t know, if you are seeding your script, but even for the same architecture, you would have to set the seed before the initialization of each model.
However, in your case this would also lead to different results, as the order of the layer initializations is different. Instead you should transfer the state_dict of one model to the other to get the same results:

model1 = LeNet1()
model2 = LeNet2()
model2.load_state_dict(model1.state_dict())

x = torch.randn(1, 3, 28, 28)
output1 = model1(x)
output2 = model2(x)
print(output1, '\n', output2)
> tensor([[ 0.0253, -0.0044, -0.0149, -0.0370, -0.1327,  0.0809,  0.1206, -0.0214,
         -0.0096,  0.0013]], grad_fn=<AddmmBackward>) 
 tensor([[ 0.0253, -0.0044, -0.0149, -0.0370, -0.1327,  0.0809,  0.1206, -0.0214,
         -0.0096,  0.0013]], grad_fn=<AddmmBackward>)
print(output1 == output2)
> tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=torch.uint8)

guijun_ren · February 14, 2019, 10:55pm

@ ptrblck, Thanks for your response.

It is true that we can get the exactly same outputs if transferring the state of one model to the other model. However, my real confusion is specifically for my example, why the order of the layer initializations can lead to different weights/bias values. My understanding is there shouldn’t be any dependencies between the parameter initialization of any two different layers. For example, Conv2 initialization doesn’t depend on any outputs from Conv1. Any thoughts?

Best
GR

ptrblck · February 14, 2019, 11:06pm

The initialization does not depend on the outputs, you are right about it.
However, usually you initialize your layers using some “random” numbers.
These random numbers are created by the pseudo-random number generator.
In fact, we can seed the PRNG so that after seeding we get the same “random” numbers.

Now let’s create a small dummy model with just two layers, one conv layer and one linear layer.
By changing the order of these layers, we can create two models like in your example:

# model1:
- seed the PRNG
- init conv layer
- init linear layer

# model2:
- seed the PRNG
- init linear layer
- init conv layer

Both model will have the desired initialization, e.g. xavier_uniform. However, their parameters won’t have the exact same numbers.
The reason for this is, that the PRNG was called in a different order.

Have a look at this small example:

torch.manual_seed(0)
print(torch.empty(5).uniform_())
print(torch.empty(5).normal_())

# Same results
torch.manual_seed(0)
print(torch.empty(5).uniform_())
print(torch.empty(5).normal_())

# Different
torch.manual_seed(0)
print(torch.empty(5).normal_())
print(torch.empty(5).uniform_())

guijun_ren · February 15, 2019, 12:10am

@ ptrblck

Great answer. It helps a lot.

So we can conclude that the two models(LeNet1 and LeNet2) provided in my example are essentially identical in terms of model architecture. LeNet1 might be the normal way to define the model, but that doesn’t mean we can say LeNet2 is logically wrong. Is my understanding correct? Pls help confirm.

As a side topic, why the different order of calling PRNG will lead to different values? Thanks,

ptrblck · February 15, 2019, 12:27am

Yes, both models are identical.
Even though the exact parameter values might be different, you should get approx. the same training result using these models.

After seeding the PRNG you’ll get the same sequence of random numbers.
The layer which gets initialized first will get the first “random” numbers for its initialization, while the second layer will get the subsequent ones.
Now if you change the layer orders (like in my example changing the function call order), the random number assignment will also change. This is a pretty simple example, as your layers might also have different numbers of parameters.

guijun_ren · February 15, 2019, 12:48am

@ptrblck

Thanks for your detailed explanation. I’m pretty clear now.