Understanding nn.Module.parameters()

David_Alford · September 2, 2020, 3:21am

I am reading in the book Deep Learning with PyTorch that by calling the nn.Module.parameters() method that it will call submodules defined in the module’s init constructor. To understand and help visualize the processes I would like to use an ensemble as an example from ptrblck:

class MyEnsemble(nn.Module):
    def __init__(self, modelA, modelB, nb_classes=10):
        super(MyEnsemble, self).__init__()
        self.modelA = modelA
        self.modelB = modelB
        # Remove last linear layer
        self.modelA.fc = nn.Identity()
        self.modelB.fc = nn.Identity()
        
        # Create new classifier
        self.classifier = nn.Linear(2048+512, nb_classes)
        
    def forward(self, x):
        x1 = self.modelA(x.clone())  # clone to make sure x is not changed by inplace methods
        x1 = x1.view(x1.size(0), -1)
        x2 = self.modelB(x)
        x2 = x2.view(x2.size(0), -1)
        x = torch.cat((x1, x2), dim=1)
        
        x = self.classifier(F.relu(x))
        return x

In this nn.Module both self.modelA = modelA and self.modelB = modelB are being called in the init constructor. Therefore, by calling MyEnsemble.parameters() we would be returned the params which autograd would calculate the gradients wrt the parameters of the models MyEnsemble, modelA, and modelB?

Unless, of course, requires_grad=False for self.modelA and self.modelB - in which case autograd would not calculate the gradients wrt the parameters of these models and only that of MyEnsemble?

Is this thinking correct? Please correct my language if it is off. I am trying to learn as best I can and all help is appreciated.

David A.

ptrblck · September 2, 2020, 9:33am

Yes, you are correct that the gradients won’t be calculated for parameters, which are using requires_grad=False. However, model.parameters() would still return all parameters, if you are not filtering them out.

David_Alford · September 2, 2020, 2:28pm

Ok. Thank you for the clarification.

So if I call Module.parameters().grad I will be able to see the gradients? And I should call optimizer.zero_grad() each epoch to clear the gradients so they don’t accumulate? (Doesnt matter where you call zero_grad()).

EDIT: Also, what do you think is the area of PyTorch most people have trouble with? My focus is Computer Vision. Is there a concept/area in PyTorch that I can focus my energy on where you think people commonly make mistakes?

Thanks

ptrblck · September 2, 2020, 5:43pm

Module.parameters().grad won’t work directly and you would have to create a list or iterate it:

for param in Module.parameters():
    print(param.grad)

The usual workflow is to zero out the gradients after each iteration, if you don’t want to accumulate the gradients.

As long as zero_grad() is not called after the backward() call and before optimizer.step(), it should be OK.

Do you mean here in the discussion board or in the general framework usage?

David_Alford · September 2, 2020, 7:03pm

That makes sense.

In the general framework usage. I see you are very knowledgeable about PyTorch. Where would you recommend someone focus their energy when learning PyTorch? Any areas you see more mistakes than others that I could focus on?

Thank you for taking the time for the responses.

ptrblck · September 2, 2020, 10:25pm

When you are trying to learn PyTorch, I would suggest to pick an interesting (and personal) project you could spend some time on. E.g. if you are interested in photography and would like to experiment with some style transfer approaches, this would be a great way to learn more about GANs and other architectures. You would learn the framework just by working on the project.

On the other hand, if you would like to contribute to PyTorch, I would recommend to have a look at the usability / simple-fixes or misc category here.
Also, the good first issue label is useful to check for some starter PRs. The Contribution Guide is a good way to get started.

Generally, I would say that a lot of new users would have some trouble with language models, i.e. how the shapes are used in RNNs, where and when to detach() the activations etc.