How to interpret the ".grad" tensor in the optimizer

Using a simple example, after Initializing the model:

import numpy as np

import torch
from torch import nn
from torch import tensor
from torch import optim


torch.manual_seed(42)
device = 'gpu' if torch.cuda.is_available() else 'cpu'

X = xor_input = tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)
Y = xor_output = tensor([[0],[1],[1],[0]]).float().to(device)

# Use tensor.shape to get the shape of the matrix/tensor.
num_data, input_dim = X.shape
print('Inputs Dim:', input_dim) # i.e. n=2 

num_data, output_dim = Y.shape
print('Output Dim:', output_dim) 
print('No. of Data:', num_data) # i.e. n=4

hidden_dim = 5
learning_rate= 0.3
model = nn.Sequential(
            # Use nn.Linear to get our simple perceptron.
            nn.Linear(input_dim, hidden_dim),
            # Use nn.Sigmoid to get our sigmoid non-linearity.
            nn.Sigmoid(),
            # Second layer neurons.
            nn.Linear(hidden_dim, output_dim),
            nn.Sigmoid()
        )
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
criterion = nn.L1Loss()

Before the first backwards pass, the optimizer’s parameter groups doesn’t contain any .grad tensors, e.g. this returns None.

optimizer.param_groups[0]['params'][0].grad

After the backwards pass:

predictions = model(X)
loss = criterion(predictions, Y)
loss.backward()

optimizer.param_groups[0]['params'][0].grad

Now the optimizer’s param_groups contain .grad tensors, e.g. optimizer.param_groups[0]['params'][0].grad now returns:

tensor([[ 0.0002,  0.0002],
        [-0.0005,  0.0003],
        [-0.0000,  0.0000],
        [-0.0000, -0.0002],
        [ 0.0003, -0.0001]])

I understand that these values are the ones that are added during the .step() function.

The .grad tensors would come from the loss.backwards() but I don’t see any interaction between the L1Loss object the SGD optimizer object, so the tensors from the model.parameters() should be the keeping these backwards value.

But how are these values from the .grad tensors obtained?

These two lines create a chain on operations linking your model and the loss applied to the output.
When you do loss.backward(), the autograd engine computes the gradient of the loss with respect to each weight in your forward graph and populates the .grad attribute.

So these values are added during the .backward() call and NOT the .step() call.

The role of the .step() call is to update the weights according to the optimiser you have defined.

Does that mean that all tensors are passed by reference between the model.parameters(), criterion and optimizer?