The module is the class made to hold your NN model, including its parameters. But only the parameters used during the forward pass will accumulate some gradients.
Usually you’ll need to define the __init__
method where you’ll register the parameters, buffers and define the sub-modules, which can be accessed during the forward pass through attributes.
After a forward
call producing an output tensor, you’ll compute a loss from the output, and finally perform the backward pass. Gradients are updated through the backward pass with an autograd mechanism based on the computational graph associated to the loss (see this tutorial for more information on this mechanism). Every tensors that needs gradients (as the model parameters) and that were used during the forward pass will accumulate some gradients. When you call the step
method of your optimizer, the parameters tracked by the optimizer are updated based the accumulated gradients.
Here is a little example:
class MyModule(nn.Module):
def __init__(self):
super().__init__()
# Tensor not registered, you should avoid doing that.
self.tensor = torch.randn(1)
# Tensor registered (aka Buffer)
self.register_buffer('registered_buffer', torch.randn(1))
# Parameters registered (parameters are automatically registered)
self.parameter = nn.Parameter(torch.randn(1))
# Sub-module (convolution layer)
self.conv = torch.nn.Conv1d(1,1,1)
# Parameters registered but unused in forward
self.parameter_unused = nn.Parameter(torch.randn(1))
def forward(self, x):
# registered tensor used in forward
x = x + self.registered_buffer
# registered parameter used in forward
x = x + self.parameter
# sub-module used in forward
x = self.conv(x)
return x
Now, if you do a training iteration as follow:
#### INITIALIZATION
# Model init:
model = MyModule()
# Create optimizer
optimizer = optim.Adam(model.parameters(), lr=0.0001)
#### TRAINING ITERATION
# Set zero grad (to perform at the begining of every training iteration)
optimizer.zero_grad()
# Get input data and target
dummy_input = torch.randn(1,1,1)
dummy_target = torch.randn(1,1,1)
# get model prediction
model_out = model(dummy_input)
# loss computation
loss = (model_out - dummy_target)**2
loss = loss.sum()
# Perform backward pass
loss.backward()
# Update parameters
optimizer.step()
Then only the tensors (and tensors of the sub-modules) that were used during the forward pass and that needs gradients get some gradient during the backward and are updated during the step
call. So, in this example:
-
model.tensor
won’t get any gradient obviously
-
model.registered_buffer
will not get any gradient, because a buffer doesn’t requires gradients (by default)
-
model.parameter
will get gradient and be updated, as parameters need gradients (by default) and this one is used during the forward pass
-
model.parameter_unused
won’t get any gradient because it was not used during the forward pass.
-
model.conv
will get gradients and be updated, because this sub-module is used, the parameters it updates are model.conv.weight
and model.conv.bias
Last thing, you see I commented that you should not use unregistered tensor as I did in this example with self.tensor = torch.randn(1)
in the model constructor. Because if you use unregistered tensor you will get some trouble later, for instance when loading/saving your parameters, or when changing the dtype and device of your module, as only the registered buffer, parameters, and sub-modules will be tracked.
That was a long answer, hope you don’t get more confused ^^