How does the optimizer work behind the scenes?

Thomas_Poulsen · October 30, 2018, 6:20pm

In advance i apollogize to the rookienes of my question. I am bit new to pytorch.

I have implemented a simple Unet model like this:

class Unet(nn.Module):

    def __init__(self):
      super(Unet, self).__init__()

      # Down hill1
      self.conv1 = nn.Conv3d(1, 2, kernel_size=3,  stride=1)
      self.conv2 = nn.Conv3d(2, 2, kernel_size=3,  stride=1)

      # Down hill2
      self.conv3 = nn.Conv3d(2, 4, kernel_size=3,  stride=1)
      self.conv4 = nn.Conv3d(4, 4, kernel_size=3,  stride=1)

      #up hill1
      self.upConv1 = nn.Conv3d(4, 2, kernel_size=3,  stride=1)
      self.upConv2 = nn.Conv3d(2, 2, kernel_size=3,  stride=1)

      #up hill2
      self.upConv3 = nn.Conv3d(2, 1, kernel_size=3, stride=1)
      self.upConv4 = nn.Conv3d(1, 1, kernel_size=3, stride=1)

      self.mp = nn.MaxPool3d(kernel_size=3, stride=2, padding=1)

From this i get the impression that i have 8 sets of weights. One set for each conv.

conv1.weight.data
conv2.weight.data
....

The documentation tells me i can update these weights through optim

optimizer = optim.Adam(unet.parameters(), lr=0.1)
pred = unet.forward(x)
loss = MyLossFunction(pred, y)
loss.backward()
optimizer.step()

For some reason i find it hard to believe that it will update all 8 sets of weights correctly using this approach.

Theoretically the weights needs to be optimized through the use of the derivative of the loss function with respect to x. I can’t seem to find any connection between the loss function and the optimizer in that code. So how on earth can the optimizer figure out the derivative? The optimizer only seem to have knowledge about the model parameters and not the loss function.

ptrblck · October 30, 2018, 10:31pm

This is correct! The optimizer does not necessarily need to know more about the model, just which parameters should be optimized using the gradient.
The gradient on the other side will be calculated during the loss.backward() call.
Since the loss was calculated using your input, the model, and the target, the backward call can use all this information to calculate the gradients.
Once these gradients are calculated, the optimizer just needs to update all parameters using its formula.