Model isn't learning during training

I am building a version of AlphaZero an am having the problem that my model never learns. model.training = True. The Loss is calculated using the NN outputs and it isn’t detached.

The model weights never change their values when it’s being trained.

I’m not sure what the error is.

def train_model(self):
    '''
    We calculate and store the loss between the MCTS and NN values/policies.
    Also, the model is trained according to the loss calculated.
    '''
    value_loss = torch.mean((self.values["MCTS"] - self.values["NN"])**2)
    policy_loss = - sum([torch.dot(self.policies["MCTS"][i,:], torch.log(self.policies["NN"][i,:])) for i in range(self.episodes)]) / self.episodes
    total_loss = value_loss + policy_loss
                                                
    self.optimizer.zero_grad()
    total_loss.backward()
    self.optimizer.step()

For reference, this is the model. Where channels = 1, filters = 64, layers = 4, board_squares = 9, value dense = 64 and policy dense = 9.

class ResNet(nn.Module):
    def __init__(self, channels=config.CHANNELS, filters=config.FILTERS, board_size=config.BOARD_SQUARES, policy_dense=config.POLICY_DENSE, value_dense=config.VALUE_DENSE):
        super(ResNet, self).__init__()
        self.conv0 = nn.Conv2d(channels, filters, kernel_size=3, stride=1, padding=1)
        self.conv1 = nn.Conv2d(filters, filters, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(filters, 1, kernel_size=1, stride=1)
        self.conv3 = nn.Conv2d(filters, 2, kernel_size=1, stride=1)
        self.relu = nn.ReLU()
        self.batch_norm1 = nn.BatchNorm2d(filters)
        self.fc1 = nn.Linear(board_size * 1, value_dense)
        self.fc2 = nn.Linear(value_dense, 1)
        self.fc3 = nn.Linear(board_size * 2, policy_dense)
        print("Model is initialised.")

    def relu_bn(self, x):
        x = self.relu(x)
        x = self.batch_norm1(x)
        return x

    def residual_block(self, x):
        y = x
        x = self.conv1(x)
        x = self.relu_bn(x)
        x = self.conv1(x)
        x += y
        x = self.relu_bn(x)
        return x

    def convolution_block(self, x):
        x = self.conv0(x)
        x = self.relu_bn(x)
        return x

    def value_head(self, x):
        x = self.conv2(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x =  (x - torch.mean(x)) / torch.var(x, unbiased=False)
        x = self.fc2(x)
        a = nn.Tanh()
        x = a(x)
        return x

    def policy_head(self, x):
        x = self.conv3(x)
        x = torch.flatten(x, 1)
        x = self.fc3(x)
        a = nn.Softmax(dim=-1)
        x = a(x)
        return x

    def forward(self, x):
        x = self.convolution_block(x)
        for _ in range(config.LAYERS):
            x = self.residual_block(x)
        vh = self.value_head(x)
        ph = self.policy_head(x)
        return vh, ph

Could you explain why you are detaching this activation in your residual blocks here:

x += y.detach()

This usage would mean that e.g. the internal conv1 layer won’t be trained.
Could this explain the issue you are seeing?

Thanks for spotting this. I’ve changed the model (edited in post) however my model still isn’t learning. Using the loop below, the model is trained:

conv_layer = list(self.model.parameters())
for _ in range(5):
    self.simulation()
    self.mcts_values()
    train_model(self) # The model is trained in this fn
    print(list(self.model.parameters()) == conv_layer)
    print("")

Which just prints “True” 5 times. I’m assuming that this means that the model weights aren’t changing? The learning rate I’ve set is 0.01 with an adam optimizer:

self.optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

No, since you are storing the references to the parameters instead of the actual value.
To compare parameters you would need to clone them as seen in this example:

model= nn.Conv2d(3, 3, 3)

conv_layer = list(model.parameters())
w0 = model.weight.clone()
optimizer = torch.optim.Adam(model.parameters(), lr=1.)

for _ in range(5):
    optimizer.zero_grad()
    out = model(torch.randn(1, 3, 224, 224))
    out.mean().backward()
    optimizer.step()
    print(list(model.parameters()) == conv_layer)
    print((model.weight - w0).abs().max())
    
# output
# True
# tensor(1.0000, grad_fn=<MaxBackward1>)
# True
# tensor(2.0012, grad_fn=<MaxBackward1>)
# True
# tensor(2.8943, grad_fn=<MaxBackward1>)
# True
# tensor(2.9914, grad_fn=<MaxBackward1>)
# True
# tensor(2.9810, grad_fn=<MaxBackward1>)

When I try this

conv_layer = list(self.model.parameters())
w0 = self.model.weight.clone()
for _ in range(5):
    self.simulation()
    self.mcts_values()
    train_model(self) # The model is trained in this fn
    print(list(self.model.parameters()) == conv_layer)
    print((self.model.weight - w0).abs().max())
    print("")

I get back: “AttributeError: ‘ResNet’ object has no attribute ‘weight’”

My minimal code snippet uses a single nn.Conv2d layer: model= nn.Conv2d(3, 3, 3) so you would need to adapt the code and access a valid parameter via a registered module in your ResNet.

I have done that with the following code:

conv_layer = list(self.model.parameters())
w0 = self.model.conv1.weight.data[0][0].clone()
print(w0, end="\n\n")
for _ in range(5):
    self.simulation()
    self.mcts_values()
    train_model(self) # The model is trained in this fn
    print(list(self.model.parameters()) == conv_layer)
    print((self.model.conv1.weight.data[0][0] - w0).abs().max())
    print("")
print(self.model.conv1.weight.data[0][0])

and I’m getting back:

tensor([[ 0.0026,  0.0127, -0.0113],
        [ 0.0270, -0.0065, -0.0003],
        [ 0.0280, -0.0367,  0.0033]])

True
tensor(0.)

True
tensor(0.)

True
tensor(0.)

True
tensor(0.)

True
tensor(0.)

tensor([[ 0.0026,  0.0127, -0.0113],
        [ 0.0270, -0.0065, -0.0003],
        [ 0.0280, -0.0367,  0.0033]])

p.s. I have followed your lead and set the LR to 1

Could you post a minimal and executable code snippet reproducing this issue, please?