The module is the class made to hold your NN model, including its parameters. But only the parameters used during the forward pass will accumulate some gradients.
Usually you’ll need to define the
__init__ method where you’ll register the parameters, buffers and define the sub-modules, which can be accessed during the forward pass through attributes.
forward call producing an output tensor, you’ll compute a loss from the output, and finally perform the backward pass. Gradients are updated through the backward pass with an autograd mechanism based on the computational graph associated to the loss (see this tutorial for more information on this mechanism). Every tensors that needs gradients (as the model parameters) and that were used during the forward pass will accumulate some gradients. When you call the
step method of your optimizer, the parameters tracked by the optimizer are updated based the accumulated gradients.
Here is a little example:
# Tensor not registered, you should avoid doing that.
self.tensor = torch.randn(1)
# Tensor registered (aka Buffer)
# Parameters registered (parameters are automatically registered)
self.parameter = nn.Parameter(torch.randn(1))
# Sub-module (convolution layer)
self.conv = torch.nn.Conv1d(1,1,1)
# Parameters registered but unused in forward
self.parameter_unused = nn.Parameter(torch.randn(1))
def forward(self, x):
# registered tensor used in forward
x = x + self.registered_buffer
# registered parameter used in forward
x = x + self.parameter
# sub-module used in forward
x = self.conv(x)
Now, if you do a training iteration as follow:
# Model init:
model = MyModule()
# Create optimizer
optimizer = optim.Adam(model.parameters(), lr=0.0001)
#### TRAINING ITERATION
# Set zero grad (to perform at the begining of every training iteration)
# Get input data and target
dummy_input = torch.randn(1,1,1)
dummy_target = torch.randn(1,1,1)
# get model prediction
model_out = model(dummy_input)
# loss computation
loss = (model_out - dummy_target)**2
loss = loss.sum()
# Perform backward pass
# Update parameters
Then only the tensors (and tensors of the sub-modules) that were used during the forward pass and that needs gradients get some gradient during the backward and are updated during the
step call. So, in this example:
model.tensor won’t get any gradient obviously
model.registered_buffer will not get any gradient, because a buffer doesn’t requires gradients (by default)
model.parameter will get gradient and be updated, as parameters need gradients (by default) and this one is used during the forward pass
model.parameter_unused won’t get any gradient because it was not used during the forward pass.
model.conv will get gradients and be updated, because this sub-module is used, the parameters it updates are
Last thing, you see I commented that you should not use unregistered tensor as I did in this example with
self.tensor = torch.randn(1) in the model constructor. Because if you use unregistered tensor you will get some trouble later, for instance when loading/saving your parameters, or when changing the dtype and device of your module, as only the registered buffer, parameters, and sub-modules will be tracked.
That was a long answer, hope you don’t get more confused ^^