Unused model parameters affect optimization for Adam

dem123456789 · October 19, 2018, 6:17am

I recently encounter a situation where some of the model parameters will not be updated during certain iterations. The unused parameters are those are not in computation graph (after backward(), the gradients of those unused parameters is None)

I find the training result is different when I do not have those unused parameters. The only reason I could think of is the optimizer Adam and maybe other adaptive learning rate optimization. Because those optimizer take all model parameters as input when initialized but I only update part of them.

Anyone knows how should I resolve this issue. I am not sure the reason I propose above is correct or not.

ptrblck · October 19, 2018, 1:36pm

Could you post a small code snippet showing how these “unused” parameters are created?
I would like to reproduce this issue and debug a bit as currently I can only speculate about the reason.

dem123456789 · October 24, 2018, 10:59am

For example, I have a maximum number of layers of nn.Conv2d. However, the forward pass can use first few layers in computation. How many layers I need to use is an input of the forward pass. In other words, the depth of model is dynamic. For now, I only test with a fixed depth which is smaller than the maximum depth. Those layers of nn.Conv2d are created with a for loop and stored into a list. Then I use nn.ModuleList to wrap the list. I am not sure if this phenomenon indeed exists.

ptrblck · October 24, 2018, 11:01am

As long as you store the Modules in a ModuleList they should be properly registered.
Are these modules missing in the state_dict?
If so, could you post the model definition?

dem123456789 · October 24, 2018, 11:06am

They are in the state_dict().

ptrblck · October 24, 2018, 11:09am

Sorry! Now I get the issue.
Your parameters are all properly registered but unused.
In the case where unused parameters (never called in forward) are in the model, the performance is worse compared to the plain model without the unnecessary parameters. Let me try to debug it.

dem123456789 · October 24, 2018, 11:11am

Thanks you so much.
Sorry for the confusion I made.
I am not sure this phenomenon exists or this is just an illusion.
I use Adam as optimizer and pytorch 0.4.1.

ptrblck · October 24, 2018, 2:14pm

I created a small code snippet comparing two models. One model uses all its modules, while the other one has some unused modules.
The code passes for 0.4.1 and 1.0.0.dev20181014.
You can find the code here.

Let me know, if you can spot some differences between your and mine implementation.

dem123456789 · October 24, 2018, 4:22pm

I actually modify a little bit of your code and I can reproduce the error with nn.ModuelList().
I also try SGD, and it has the same error.
Can you help me verifying this?
You can find the code here

ptrblck · October 24, 2018, 4:46pm

Thanks for the code update!
The difference in your implementation is due to the order of instantiation of the layers. Since you are creating the unused conv layers before the linear layer, the PRNG will be have additional calls and the weights of the linear layers will differ.
If you change the __init__ method of MyModuleUnused to

        super(MyModelUnused, self).__init__()
        self.conv_list = nn.ModuleList()
        self.conv_list.append(nn.Conv2d(3, 6, 3, 1, 1))
        self.conv_list.append(nn.Conv2d(6, 12, 3, 1, 1))
        self.pool1 = nn.MaxPool2d(2)
        self.pool2 = nn.MaxPool2d(2)
        self.fc = nn.Linear(12*6*6, 2)
        self.conv_list.append(nn.Conv2d(12, 24, 3, 1, 1))
        self.conv_list.append(nn.Conv2d(24, 12, 3, 1, 1))

, you’ll get the same results again.
Alternatively, you could set the seed before instantiating each layer.

dem123456789 · October 24, 2018, 4:59pm

Can you explan what is PRNG? I am still confused. Why the ordering of the initialization matters?
I never saw a documentation on the ordering of module intialization. Should I always initilize ModuleList first? Is there a disciplinary way of doing this so that I can avoid this kind of problem?

ptrblck · October 24, 2018, 5:07pm

Sorry for being not clear enough.
By PRNG I mean the Pseudorandom Number Generator.
The ordering just matters for the sake of debugging, as we are dealing with pseudorandom numbers.

In order to compare the weights and gradients, we should make sure both models have the same parameters.
One way would be to initialize one model and copy the parameters into the other.
Another way is to seed the PRNG for both models and just sample the same “random” numbers.

You can think about seeding the random number generation as setting a start value. All “random” numbers will be the same after setting the same seed:

torch.manual_seed(2809)
print(torch.randn(5))
> tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471])
print(torch.randn(5))
> tensor([-0.5538, -0.8776, -0.5635,  0.5434, -0.8192])

torch.manual_seed(2809)
print(torch.randn(5))
> tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471])
print(torch.randn(5))
> tensor([-0.5538, -0.8776, -0.5635,  0.5434, -0.8192])

Although we call torch.randn, we get the same “random” numbers in the successive calls.
Now if you add the unused layers before the linear layer, the PRNG will get an additional call to sample the parameters of these layers, which will influence the linear layer parameters.

Ususally, you don’t have to think about these issues. As I said, it’s just to debug your issue.

dem123456789 · October 24, 2018, 5:13pm

Thank you so much.
I move the module_list after all other modules and it works.
If I understand it correctly, since it only affects PRNG, it shoud not create performance issue. However, the performance will slightly differ for each run because of PRNG even if we seed in advance.

ptrblck · October 25, 2018, 8:47am

Are you getting the same or comparable results now?
I still think the unused parameters are no problem for the optimizer and your results should be comparable.
Note that I performed the tests on CPU and seeded unusually often just for the sake of debugging.
If you don’t properly seed or use the GPU with non-deterministic operations, you will get slight differences.

dem123456789 · October 25, 2018, 10:59am

After I move conv_list behind all other modules, it passes the test.
However, I m not sure this is valid in other cases.
For example I have an encoder and decoder and either of them will have variant depth. Then I have a large wrapper module for the encoder and decoder. Even if I move the initialization of conv_list in encoder and decoder itself. It might still have problems because of the wrapper module.

For the seed before instantiating each layer, could you give me an example?

I understand I will get differences but I am just wondering will this effect degrades the performance in general. If this only affects debugging, it might not be a big issue.

ptrblck · October 25, 2018, 11:53am

I think the seeding approach is getting cumbersome for more complex use cases.
Let’s just initialize the model with unused modules and load the parameters in the “complete” model.
Could you add this code to the gist and compare the results again?

def copy_params(modelA, modelB):
    modelA_dict = modelA.state_dict()
    modelB_dict = modelB.state_dict()
    equal_dict = {k: v for k, v in modelB_dict.items() if k in modelA_dict}
    modelA.load_state_dict(equal_dict)



modelA = MyModel()
modelB = MyModelUnused()

copy_params(modelA, modelB)

# Check weights for equality
check_params(modelA, modelB)

dem123456789 · October 26, 2018, 9:37am

The results are the same if we copy the parameters.
I find another problem. If I have two module_list each initialized with two different methods, moving the module_list behind fc layer still has errors.
You can see the code here

ptrblck · October 26, 2018, 11:52am

I’ve checked your code and it seems you are copying the parameters between models and the seeding approach later. Just remove the second approach, as it’s not a good fit anymore regarding the construction of your models.
Remove these lines and you’ll get the same results:

torch.manual_seed(2809)
modelA = MyModel()
torch.manual_seed(2809)
modelB = MyModelUnused()

dem123456789 · October 26, 2018, 12:05pm

So if I do not manually seed, there will not be any problems?

ptrblck · October 26, 2018, 12:11pm

No, the manual seed is not the issue. I’ve just used it in my first example to show, that the optimizer does not have any problems optimizing a model with unused parameters.
Even if we copy all parameters between models, the optimizer works identically.

So back to your original question. The discrepancy you are observing is not due to some unused parameters in your model. However, if your whole training procedure is very sensitive to the initialization, and therefore to the seeding as well, you might get these results.

To debug the problem I would suggest to try the following:

Compare the results of your “good” model using several random seeds at the beginning of your script. If tht accuracy stays approx. the same, it should be alright. If you see an accuracy drop for different seeds, I would suggest to use some weight init functions and see it we can stabilize the performance.
Create your good model and save the state_dict after initializing the model. Create another script using your model containing unused layers and load the state_dict for the common layers. Then train this model and see, how the performance is compared to the initial model.

I’m still in doubt the optimizer is causing this issue.