Unused model parameters affect optimization for Adam

dem123456789 · October 24, 2018, 11:06am

They are in the state_dict().

ptrblck · October 24, 2018, 11:09am

Sorry! Now I get the issue.
Your parameters are all properly registered but unused.
In the case where unused parameters (never called in forward) are in the model, the performance is worse compared to the plain model without the unnecessary parameters. Let me try to debug it.

dem123456789 · October 24, 2018, 11:11am

Thanks you so much.
Sorry for the confusion I made.
I am not sure this phenomenon exists or this is just an illusion.
I use Adam as optimizer and pytorch 0.4.1.

ptrblck · October 24, 2018, 2:14pm

I created a small code snippet comparing two models. One model uses all its modules, while the other one has some unused modules.
The code passes for 0.4.1 and 1.0.0.dev20181014.
You can find the code here.

Let me know, if you can spot some differences between your and mine implementation.

dem123456789 · October 24, 2018, 4:22pm

I actually modify a little bit of your code and I can reproduce the error with nn.ModuelList().
I also try SGD, and it has the same error.
Can you help me verifying this?
You can find the code here

ptrblck · October 24, 2018, 4:46pm

Thanks for the code update!
The difference in your implementation is due to the order of instantiation of the layers. Since you are creating the unused conv layers before the linear layer, the PRNG will be have additional calls and the weights of the linear layers will differ.
If you change the __init__ method of MyModuleUnused to

        super(MyModelUnused, self).__init__()
        self.conv_list = nn.ModuleList()
        self.conv_list.append(nn.Conv2d(3, 6, 3, 1, 1))
        self.conv_list.append(nn.Conv2d(6, 12, 3, 1, 1))
        self.pool1 = nn.MaxPool2d(2)
        self.pool2 = nn.MaxPool2d(2)
        self.fc = nn.Linear(12*6*6, 2)
        self.conv_list.append(nn.Conv2d(12, 24, 3, 1, 1))
        self.conv_list.append(nn.Conv2d(24, 12, 3, 1, 1))

, you’ll get the same results again.
Alternatively, you could set the seed before instantiating each layer.

dem123456789 · October 24, 2018, 4:59pm

Can you explan what is PRNG? I am still confused. Why the ordering of the initialization matters?
I never saw a documentation on the ordering of module intialization. Should I always initilize ModuleList first? Is there a disciplinary way of doing this so that I can avoid this kind of problem?

ptrblck · October 24, 2018, 5:07pm

Sorry for being not clear enough.
By PRNG I mean the Pseudorandom Number Generator.
The ordering just matters for the sake of debugging, as we are dealing with pseudorandom numbers.

In order to compare the weights and gradients, we should make sure both models have the same parameters.
One way would be to initialize one model and copy the parameters into the other.
Another way is to seed the PRNG for both models and just sample the same “random” numbers.

You can think about seeding the random number generation as setting a start value. All “random” numbers will be the same after setting the same seed:

torch.manual_seed(2809)
print(torch.randn(5))
> tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471])
print(torch.randn(5))
> tensor([-0.5538, -0.8776, -0.5635,  0.5434, -0.8192])

torch.manual_seed(2809)
print(torch.randn(5))
> tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471])
print(torch.randn(5))
> tensor([-0.5538, -0.8776, -0.5635,  0.5434, -0.8192])

Although we call torch.randn, we get the same “random” numbers in the successive calls.
Now if you add the unused layers before the linear layer, the PRNG will get an additional call to sample the parameters of these layers, which will influence the linear layer parameters.

Ususally, you don’t have to think about these issues. As I said, it’s just to debug your issue.

dem123456789 · October 24, 2018, 5:13pm

Thank you so much.
I move the module_list after all other modules and it works.
If I understand it correctly, since it only affects PRNG, it shoud not create performance issue. However, the performance will slightly differ for each run because of PRNG even if we seed in advance.

ptrblck · October 25, 2018, 8:47am

Are you getting the same or comparable results now?
I still think the unused parameters are no problem for the optimizer and your results should be comparable.
Note that I performed the tests on CPU and seeded unusually often just for the sake of debugging.
If you don’t properly seed or use the GPU with non-deterministic operations, you will get slight differences.

dem123456789 · October 25, 2018, 10:59am

After I move conv_list behind all other modules, it passes the test.
However, I m not sure this is valid in other cases.
For example I have an encoder and decoder and either of them will have variant depth. Then I have a large wrapper module for the encoder and decoder. Even if I move the initialization of conv_list in encoder and decoder itself. It might still have problems because of the wrapper module.

For the seed before instantiating each layer, could you give me an example?

I understand I will get differences but I am just wondering will this effect degrades the performance in general. If this only affects debugging, it might not be a big issue.

ptrblck · October 25, 2018, 11:53am

I think the seeding approach is getting cumbersome for more complex use cases.
Let’s just initialize the model with unused modules and load the parameters in the “complete” model.
Could you add this code to the gist and compare the results again?

def copy_params(modelA, modelB):
    modelA_dict = modelA.state_dict()
    modelB_dict = modelB.state_dict()
    equal_dict = {k: v for k, v in modelB_dict.items() if k in modelA_dict}
    modelA.load_state_dict(equal_dict)



modelA = MyModel()
modelB = MyModelUnused()

copy_params(modelA, modelB)

# Check weights for equality
check_params(modelA, modelB)

dem123456789 · October 26, 2018, 9:37am

The results are the same if we copy the parameters.
I find another problem. If I have two module_list each initialized with two different methods, moving the module_list behind fc layer still has errors.
You can see the code here

ptrblck · October 26, 2018, 11:52am

I’ve checked your code and it seems you are copying the parameters between models and the seeding approach later. Just remove the second approach, as it’s not a good fit anymore regarding the construction of your models.
Remove these lines and you’ll get the same results:

torch.manual_seed(2809)
modelA = MyModel()
torch.manual_seed(2809)
modelB = MyModelUnused()

dem123456789 · October 26, 2018, 12:05pm

So if I do not manually seed, there will not be any problems?

ptrblck · October 26, 2018, 12:11pm

No, the manual seed is not the issue. I’ve just used it in my first example to show, that the optimizer does not have any problems optimizing a model with unused parameters.
Even if we copy all parameters between models, the optimizer works identically.

So back to your original question. The discrepancy you are observing is not due to some unused parameters in your model. However, if your whole training procedure is very sensitive to the initialization, and therefore to the seeding as well, you might get these results.

To debug the problem I would suggest to try the following:

Compare the results of your “good” model using several random seeds at the beginning of your script. If tht accuracy stays approx. the same, it should be alright. If you see an accuracy drop for different seeds, I would suggest to use some weight init functions and see it we can stabilize the performance.
Create your good model and save the state_dict after initializing the model. Create another script using your model containing unused layers and load the state_dict for the common layers. Then train this model and see, how the performance is compared to the initial model.

I’m still in doubt the optimizer is causing this issue.

dem123456789 · October 26, 2018, 12:18pm

I think PRNG is the reason. We can specify our own intialization, but I believe this kind of behaviour is somewhat hidden and undesirable.

forkkr · July 5, 2020, 6:39am

Hi @ptrblck, I have just checked that a simple model with or without unused registered parameters give the same results in either case. That’s great. But I have trapped into confusion when I’m going to check whether the model is updating its parameters. To do this, have used the following lines:

        a = list(model.parameters())[0].clone()
        loss.backward()
        optimizer.step()
        b = list(model.parameters())[0].clone()
        print(torch.equal(a.data, b.data))

What should be the output when the model will update its parameters? It should be ‘false’ because after loss backward the parameters would be updated. So, a and b would not be same. When I run the model without any unused parameters, it gives ‘false’. But when there’re any unused paramters. it gives ‘true’. Could please tell me what are reasons in there? Due to lack of knowledge I couldn’t find out the reasons.

ptrblck · July 6, 2020, 1:39am

Could you check, which parameter you are comparing via:

list(model.parameters())[0].clone()

If the first parameter (accessed via [0]) is the unused parameter, it should not be updated and your code would work.
You could use dict(model.named_parameters()) to also get the name of the parameter besides the values.

forkkr · July 6, 2020, 3:18am

Thanks @ptrblck. Got the point.