Unused model parameters affect optimization for Adam

I recently encounter a situation where some of the model parameters will not be updated during certain iterations. The unused parameters are those are not in computation graph (after backward(), the gradients of those unused parameters is None)

I find the training result is different when I do not have those unused parameters. The only reason I could think of is the optimizer Adam and maybe other adaptive learning rate optimization. Because those optimizer take all model parameters as input when initialized but I only update part of them.

Anyone knows how should I resolve this issue. I am not sure the reason I propose above is correct or not.


Could you post a small code snippet showing how these “unused” parameters are created?
I would like to reproduce this issue and debug a bit as currently I can only speculate about the reason.

For example, I have a maximum number of layers of nn.Conv2d. However, the forward pass can use first few layers in computation. How many layers I need to use is an input of the forward pass. In other words, the depth of model is dynamic. For now, I only test with a fixed depth which is smaller than the maximum depth. Those layers of nn.Conv2d are created with a for loop and stored into a list. Then I use nn.ModuleList to wrap the list. I am not sure if this phenomenon indeed exists.

As long as you store the Modules in a ModuleList they should be properly registered.
Are these modules missing in the state_dict?
If so, could you post the model definition?

They are in the state_dict().

Sorry! Now I get the issue.
Your parameters are all properly registered but unused.
In the case where unused parameters (never called in forward) are in the model, the performance is worse compared to the plain model without the unnecessary parameters. Let me try to debug it.

Thanks you so much.
Sorry for the confusion I made.
I am not sure this phenomenon exists or this is just an illusion.
I use Adam as optimizer and pytorch 0.4.1.

I created a small code snippet comparing two models. One model uses all its modules, while the other one has some unused modules.
The code passes for 0.4.1 and 1.0.0.dev20181014.
You can find the code here.

Let me know, if you can spot some differences between your and mine implementation.

I actually modify a little bit of your code and I can reproduce the error with nn.ModuelList().
I also try SGD, and it has the same error.
Can you help me verifying this?
You can find the code here

Thanks for the code update!
The difference in your implementation is due to the order of instantiation of the layers. Since you are creating the unused conv layers before the linear layer, the PRNG will be have additional calls and the weights of the linear layers will differ.
If you change the __init__ method of MyModuleUnused to

        super(MyModelUnused, self).__init__()
        self.conv_list = nn.ModuleList()
        self.conv_list.append(nn.Conv2d(3, 6, 3, 1, 1))
        self.conv_list.append(nn.Conv2d(6, 12, 3, 1, 1))
        self.pool1 = nn.MaxPool2d(2)
        self.pool2 = nn.MaxPool2d(2)
        self.fc = nn.Linear(12*6*6, 2)
        self.conv_list.append(nn.Conv2d(12, 24, 3, 1, 1))
        self.conv_list.append(nn.Conv2d(24, 12, 3, 1, 1))

, you’ll get the same results again.
Alternatively, you could set the seed before instantiating each layer.


Can you explan what is PRNG? I am still confused. Why the ordering of the initialization matters?
I never saw a documentation on the ordering of module intialization. Should I always initilize ModuleList first? Is there a disciplinary way of doing this so that I can avoid this kind of problem?

Sorry for being not clear enough.
By PRNG I mean the Pseudorandom Number Generator.
The ordering just matters for the sake of debugging, as we are dealing with pseudorandom numbers.

In order to compare the weights and gradients, we should make sure both models have the same parameters.
One way would be to initialize one model and copy the parameters into the other.
Another way is to seed the PRNG for both models and just sample the same “random” numbers.

You can think about seeding the random number generation as setting a start value. All “random” numbers will be the same after setting the same seed:

> tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471])
> tensor([-0.5538, -0.8776, -0.5635,  0.5434, -0.8192])

> tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471])
> tensor([-0.5538, -0.8776, -0.5635,  0.5434, -0.8192])

Although we call torch.randn, we get the same “random” numbers in the successive calls.
Now if you add the unused layers before the linear layer, the PRNG will get an additional call to sample the parameters of these layers, which will influence the linear layer parameters.

Ususally, you don’t have to think about these issues. As I said, it’s just to debug your issue.

1 Like

Thank you so much.
I move the module_list after all other modules and it works.
If I understand it correctly, since it only affects PRNG, it shoud not create performance issue. However, the performance will slightly differ for each run because of PRNG even if we seed in advance.

1 Like

Are you getting the same or comparable results now?
I still think the unused parameters are no problem for the optimizer and your results should be comparable.
Note that I performed the tests on CPU and seeded unusually often just for the sake of debugging.
If you don’t properly seed or use the GPU with non-deterministic operations, you will get slight differences.

After I move conv_list behind all other modules, it passes the test.
However, I m not sure this is valid in other cases.
For example I have an encoder and decoder and either of them will have variant depth. Then I have a large wrapper module for the encoder and decoder. Even if I move the initialization of conv_list in encoder and decoder itself. It might still have problems because of the wrapper module.

For the seed before instantiating each layer, could you give me an example?

I understand I will get differences but I am just wondering will this effect degrades the performance in general. If this only affects debugging, it might not be a big issue.

I think the seeding approach is getting cumbersome for more complex use cases.
Let’s just initialize the model with unused modules and load the parameters in the “complete” model.
Could you add this code to the gist and compare the results again?

def copy_params(modelA, modelB):
    modelA_dict = modelA.state_dict()
    modelB_dict = modelB.state_dict()
    equal_dict = {k: v for k, v in modelB_dict.items() if k in modelA_dict}

modelA = MyModel()
modelB = MyModelUnused()

copy_params(modelA, modelB)

# Check weights for equality
check_params(modelA, modelB)

The results are the same if we copy the parameters.
I find another problem. If I have two module_list each initialized with two different methods, moving the module_list behind fc layer still has errors.
You can see the code here

I’ve checked your code and it seems you are copying the parameters between models and the seeding approach later. Just remove the second approach, as it’s not a good fit anymore regarding the construction of your models.
Remove these lines and you’ll get the same results:

modelA = MyModel()
modelB = MyModelUnused()

So if I do not manually seed, there will not be any problems?

No, the manual seed is not the issue. I’ve just used it in my first example to show, that the optimizer does not have any problems optimizing a model with unused parameters.
Even if we copy all parameters between models, the optimizer works identically.

So back to your original question. The discrepancy you are observing is not due to some unused parameters in your model. However, if your whole training procedure is very sensitive to the initialization, and therefore to the seeding as well, you might get these results.

To debug the problem I would suggest to try the following:

  • Compare the results of your “good” model using several random seeds at the beginning of your script. If tht accuracy stays approx. the same, it should be alright. If you see an accuracy drop for different seeds, I would suggest to use some weight init functions and see it we can stabilize the performance.
  • Create your good model and save the state_dict after initializing the model. Create another script using your model containing unused layers and load the state_dict for the common layers. Then train this model and see, how the performance is compared to the initial model.

I’m still in doubt the optimizer is causing this issue.