Confusion regarding trainable layer parameters

Hi all,

I want to pre-train some layers with an autoencoder for my task. The forward pass is described below:

def forward(self, InputData):

    ConvTensor = self.Conv1D_L1(InputData)
    ReLUTensor = self.LeakyReLU_L2(ConvTensor)
    BatchNormTensor = self.BatchNorm1D_L3(ReLUTensor)
    MaxPooledTensor, Indices = self.MaxPool1D_L4(BatchNormTensor)
    MaxUnPooledTensor = self.MaxUnPool1D_L5(MaxPooledTensor, Indices)
    Output = self.Conv1DTransposed_L6(MaxUnPooledTensor)

    return Output

I am only interested in the first to the fourth layers and from my knowledge, only layer 1 and layer 3 have trainable parameters and I should only extract their parameters but however, I get 6 tensors when I call Model.parameters(). I have consulted the documentation regarding the trainable parameters for other layers but couldn’t understand/have been yet to able to find any information. Hence my questions:

  1. When I extract the parameters of the trained layers, should I extract the parameters of only layers 1 and 3 or should I extract the parameters of layers 1-4?

  2. Would it be sufficient to just use Layer.parameters() or should I separately extract the weights and biases using Layer.weight and Layer.bias?

If I try to recreate your model with plain PyTorch modules, I get the expected parameters from layer1, 2, and 5:

model = nn.Sequential(
    nn.Conv1d(1, 1, 1),
    nn.LeakyReLU(),
    nn.BatchNorm1d(1),
    nn.MaxPool1d(2),
    nn.MaxUnpool1d(2),
    nn.ConvTranspose1d(1, 1, 1))

dict(model.named_parameters())
> {'0.weight': Parameter containing:
 tensor([[[0.1611]]], requires_grad=True),
 '0.bias': Parameter containing:
 tensor([-0.0869], requires_grad=True),
 '2.weight': Parameter containing:
 tensor([1.], requires_grad=True),
 '2.bias': Parameter containing:
 tensor([0.], requires_grad=True),
 '5.weight': Parameter containing:
 tensor([[[-0.5314]]], requires_grad=True),
 '5.bias': Parameter containing:
 tensor([0.6917], requires_grad=True)}

so I assume your custom implementations might use additional parameters.

Hi ptrblck,

Thanks for help. I was using “list(Model.parameters())” to try to get the parameters but it gave 6 tensors and hence I was confused as to whether the ReLU or the MaxPooling layers had trainable parameters. I tried using your method and “Model.state_dict()” and I was able to get the parameters for the Conv1D, BatchNorm and Conv1DTransposed but however when I use “Model.state_dict()”, my tensors have the “requires_grad” attribute set to False. If I loaded those layers, am I able to load them in as is or do I still have to explicitly load them with “no_grad()” while having codes that set their “requires_grad” attribute to False?

This is expected, as you would use the state_dict to initialize a model, thus these tensors don’t store any Autograd history.

You can just load the state_dict via model.load_state_dict(state_dict) without the no_grad() guard.

This is expected, as you would use the state_dict to initialize a model, thus these tensors don’t store any Autograd history.

Would these tensors then have their requires_grad flag set to True after calling Model.train() prior to starting the training loop for the main model? It seems that calling Model.train() would set the requires_grad flag for all the parameters to True.

I’ve seen the autograd mechanics documentation as well as several other posts regarding freezing layer weights but all of them only mention either setting the requires_grad flag for the parameters that you wish to freeze to False, or inputting only the parameters that you wish to train into your optimizer instance.

Loading the state_dict does not change the requires_grad attribute of the model parameters and they will keep their old value:

model = nn.Linear(1, 1)
for name, param in model.named_parameters():
    print(name, param.requires_grad)
> weight True
bias True

sd = model.state_dict()
model.load_state_dict(sd)
for name, param in model.named_parameters():
    print(name, param.requires_grad)
> weight True
bias True
    
for param in model.parameters():
    param.requires_grad = False
for name, param in model.named_parameters():
    print(name, param.requires_grad)
> weight False
bias False

model.load_state_dict(sd)
for name, param in model.named_parameters():
    print(name, param.requires_grad)
> weight False
bias False

Also, note that model.train() and model.eval() does not change the requires_grad attribute and thus does not freeze parameters.
It changes the behavior of some layers, such as dropout (which will be disabled during eval()) and batchnorm (which uses the running stats instead of the input batch stats).
To freeze/unfreeze parameters you would need to change the requires_grad attribute.

1 Like

But when I have a model as such:

class Net(torch.nn.Module):

    def __init__(self):

        super().__init__()

        self.Linear_L1 = torch.nn.Linear(10, 5)

    def forward(self, InputData):

        Output = self.Linear_L1(InputData)

        return Output

Model = Net()

Calling Model.state_dict() by default sets the requires_grad flag to False. When I pass it to my training loop:

LossMetric = torch.nn.MSELoss()
Optimiser = torch.nn.Adam(Model.parameters(), lr = LearningRate)

Model.train()

for Epoch in range(NumberOfEpochs):

    for Iteration, Data in enumerate(TrainLoader):

        TrainX, TrainY = Data

        Optimiser.zero_grad()

        Output = Model(TrainX)

        Loss = LossMetric(Output, TrainY)

        Loss.backward()
        Optimiser.step() 

I am able to train the model and can observe the loss decreasing. Hence, would it be safe to assume that the instance of Optimiser is the one setting the requires_grad flag of all of the model’s parameters to True?

No, that’s not the case.
Model.parameters(), which is passed to the optimizer, will return the parameters with their requires_grad attribute. Neither the optimizer nor the call will change it.
In your code snippet you are not using the state_dict method, so I’m unsure if I misunderstand the question.

In your code snippet you are not using the state_dict method, so I’m unsure if I misunderstand the question.

I’ll elaborate more on my question.

If I used the same code as above, calling state_dict() on the created instance of the main model would give me it’s weights with their requires_grad attribute set to False as expected.

However, should I use the training loop in my previous reply for the instance of the main model (Regardless of if the pretrained weights have been loaded or not), the weights for all of it’s layers do get updated during the training cycles and my thinking is that either setting the requires_grad attribute of the pretrained layers to False, or only loading the parameters of the layers that require training to the instance of the optimiser would not freeze the weights of the pretrained layers.

Hence, the rephrased question would be: Assuming that I have already loaded the pretrained layer weights into the main model, how do I ensure that the weights for the pretrained layers stay frozen?

Set their requires_grad attribute to False after creating the model.
As given in my previous code snippet: the loading of a state_dict doesn’t change the requires_grad attributes. As long as you set them once to False and do not recreate the model or any layers, they will keep this value.

As long as you set them once to False and do not recreate the model or any layers, they will keep this value.

Does this mean that when an instance of the model is created, all of it’s parameters would have their requires_grad attribute set to True by default?

Yes, you can verify it by creating a new model instance and checking it via:

model = nn.Linear(1, 1)
for name, param in model.named_parameters():
    print(name, param.requires_grad)
> weight True
bias True

I see. Thank you for your patience and kind clarification. :grinning:

@ptrblck I’ve loaded the pretrained weights, set the requires_grad attribute of the pretrained layers to False and then proceeded to train the main model. I called Model.state_dict() on the trained main model and visualised the weights but it appears that the weights of the pretrained layers were altered.

Shown below are the first 5 weights of one of the pretrained layers before and after training the main model.

Before training:
-6.1031e-02, -4.0205e-02, -8.9633e-02, -1.9241e-02, -1.2121e-02

After training:
-6.5355e-02, -4.4506e-02, -9.4377e-02, -2.1566e-02, -1.1395e-02

It seems that setting the requires_grad attribute of the pretrained layers to False did not work as intended. Should I be using Optimiser = torch.nn.Adam(Model.parameters()) or Optimiser = torch.nn.Adam(Model.(Layer to be trained).parameters())?

Could you post an executable code snippet to reproduce this issue?

I’ve checked the weights before and after calling Model.load_state_dict() on the instance of the main model and the weights were loaded correctly.

I then used:

Model.Conv1D_L1.requires_grad = False

Model.BatchNorm1D_L3.requires_grad = False

LossMetric = torch.nn.CrossEntropyLoss()

Optimiser = torch.optim.Adam(Model.parameters())

After that, I fed the instance of the main model to the training loop as defined below:

def TrainModel(TrainLoader, Model, NumberOfEpochs, Optimiser, LossMetric):

    if(torch.cuda.is_available()):

        Model.to("cuda:0")
        UseGPU = True

    else:

        UseGPU = False

    TrainingLoss = []

    for Epoch in range(NumberOfEpochs):

        TrainingRunningLoss = 0

        Model.train()

        for Iteration, Data in enumerate(TrainLoader):

            TrainX, TrainY = Data

            if(UseGPU == True):

                TrainX = TrainX.to("cuda:0")
                TrainY = TrainY.to("cuda:0")

            Optimiser.zero_grad()

            Output = Model(TrainX)

            Loss = LossMetric(Output, TrainY)

            TrainingRunningLoss += Loss.item()

            Loss.backward()
            Optimiser.step()

        TrainingLoss.append(TrainingRunningLoss)

   return Model, TrainingLoss

I think I’ve found the issue. It appears that the instance of the optimiser needs to be

Optimiser = torch.nn.Adam(Model.(Layer to be trained).parameters())

and it seems that passing all parameters of the model to the optimiser instance would set the requires_grad attribute of all the layers to True. This means that one should only pass the parameters of the layers to be trained to their optimiser instance.

No, that’s not the case as seen here:

# Setup
model = models.resnet18()
for name, param in model.named_parameters():
    print(name, param.requires_grad)

# Freeze fc params
model.fc.weight.requires_grad = False
model.fc.bias.requires_grad = False
for name, param in model.named_parameters():
    print(name, param.requires_grad)

# Create optimizer and check, if layers were "unfrozen"
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for name, param in model.named_parameters():
    print(name, param.requires_grad)

Again, feel free to post an executable code snippet, which we could use to debug. :wink:

Here’s the full code to reproduce the issue.

Model and function definitions:

import matplotlib.pyplot as plt

class Conv1DAE(torch.nn.Module):

    def __init__(self):

        super().__init__()

        self.Conv1D_L1 = torch.nn.Conv1d(7, 32, 48, 1)
        self.LeakyReLU_L2 = torch.nn.LeakyReLU()
        self.BatchNorm1D_L3 = torch.nn.BatchNorm1d(32)
        self.MaxPool1D_L4 = torch.nn.MaxPool1d(4, 4, 2, return_indices = True)
        self.MaxUnPool1D_L5 = torch.nn.MaxUnpool1d(4, 4)
        self.Conv1DTransposed_L6 = torch.nn.ConvTranspose1d(32, 7, 45, 1)

    def forward(self, InputData):

        ConvTensor = self.Conv1D_L1(InputData)
        ReLUTensor = self.LeakyReLU_L2(ConvTensor)
        BatchNormTensor = self.BatchNorm1D_L3(ReLUTensor)
        MaxPooledTensor, Indices = self.MaxPool1D_L4(BatchNormTensor)
        MaxUnPooledTensor = self.MaxUnPool1D_L5(MaxPooledTensor, Indices)
        Output = self.Conv1DTransposed_L6(MaxUnPooledTensor)

        return Output

class ConvNet(torch.nn.Module):

    def __init__(self):

        super().__init__()

        self.Conv1D_L1 = torch.nn.Conv1d(7, 32, 48, 1)
        self.LeakyReLU_L2 = torch.nn.LeakyReLU()
        self.BatchNorm1D_L3 = torch.nn.BatchNorm1d(32)
        self.MaxPool1D_L4 = torch.nn.MaxPool1d(4, 4, 2)
        self.Flattening_L5 = torch.nn.Flatten()
        self.Linear_L6 = torch.nn.Linear(800, 400)
        self.LeakyReLu_L7 = torch.nn.LeakyReLU()
        self.DropOut_L8 = torch.nn.Dropout(0.2)
        self.FinalLayer = torch.nn.Linear(400, 5)

    def forward(self, InputData):

        ConvHeadTensor = self.Conv1D_L1(InputData)
        ReLUTensor1 = self.LeakyReLU_L2(ConvHeadTensor)
        BatchNormTensor = self.BatchNorm1D_L3(ReLUTensor1)
        MaxPoolTensor = self.MaxPool1D_L4(BatchNormTensor)
        FlattenedTensor = self.Flattening_L5(MaxPoolTensor)
        LinearTensor1 = self.Linear_L6(FlattenedTensor)
        ReLUTensor2 = self.LeakyReLu_L7(LinearTensor1)
        DropOutTensor = self.DropOut_L8(ReLUTensor2)

        Output = self.FinalLayer(DropOutTensor)

        return Output

def TrainAE(TrainLoader, TestLoader, AutoEncoder, NumberOfEpochs, LearningRate = 0.001):

    LossMetric = torch.nn.MSELoss()
    Optimiser = torch.optim.Adam(AutoEncoder.parameters(), lr = LearningRate)

    if(torch.cuda.is_available()):

        AutoEncoder.to("cuda:0")
        UseGPU = True

    else:

        UseGPU = False

    TrainingLoss = []
    ValidationLoss = []

    for Epoch in range(NumberOfEpochs):

        TrainingRunningLoss = 0

        AutoEncoder.train()

        for Iteration, Data in enumerate(TrainLoader):

            TrainX, _ = Data

            if(UseGPU == True):

                TrainX = TrainX.to(Device)

            Optimiser.zero_grad()

            Output = AutoEncoder(TrainX)

            Loss = LossMetric(Output, TrainX)

            TrainingRunningLoss += Loss.item()

            Loss.backward()
            Optimiser.step()

        TrainingLoss.append(TrainingRunningLoss)
        print("Epoch: {0}/{1}, Training Loss: {2}".format((Epoch + 1), NumberOfEpochs, round(TrainingRunningLoss, 5)))

        ValidationRunningLoss = 0

        AutoEncoder.eval()

        with torch.no_grad():

            for Iteration, Data in enumerate(TestLoader):

                TestX, _ = Data

                if(UseGPU == True):

                    TestX = TestX.to(Device)

                Predictions = AutoEncoder(TestX)

                Loss = LossMetric(Predictions, TestX)

                ValidationRunningLoss += Loss.item()

        ValidationLoss.append(ValidationRunningLoss)
        print("Epoch: {0}/{1}, Validation Loss: {2}".format((Epoch + 1), NumberOfEpochs, round(ValidationRunningLoss, 5)))

    plt.title("Plot of Auto Encoder losses")
    plt.plot(TrainingLoss)
    plt.plot(ValidationLoss)
    plt.show()
    plt.clf()

    AutoEncoder.to("cpu")

    AutoEncoderParameters = AutoEncoder.state_dict()

    del(AutoEncoderParameters["Conv1DTransposed_L6.weight"]) 
    del(AutoEncoderParameters["Conv1DTransposed_L6.bias"])

    return AutoEncoderParameters

def TrainModel(TrainLoader, TestLoader, Model, NumberOfEpochs, Optimiser, LossMetric):

    if(torch.cuda.is_available()):

        Model.to("cuda:0")
        UseGPU = True

    else:

        UseGPU = False

    TrainingLoss = []
    ValidationLoss = []

    for Epoch in range(NumberOfEpochs):

        TrainingRunningLoss = 0

        Model.train()

        for Iteration, Data in enumerate(TrainLoader):

            TrainX, TrainY = Data

            if(UseGPU == True):

                TrainX = TrainX.to(Device)
                TrainY = TrainY.to(Device)

            Optimiser.zero_grad()

            Output = Model(TrainX)

            Loss = LossMetric(Output, TrainY)

            TrainingRunningLoss += Loss.item()

            Loss.backward()
            Optimiser.step()

        TrainingLoss.append(TrainingRunningLoss)
        print("Epoch: {0}/{1}, Training Loss: {2}".format((Epoch + 1), NumberOfEpochs, TrainingRunningLoss))

        ValidationRunningLoss = 0

        Model.eval()

        with torch.no_grad():

            for Iteration, Data in enumerate(TestLoader):

                TestX, TestY = Data

                if(UseGPU == True):

                    TestX = TestX.to(Device)
                    TestY = TestY.to(Device)

                Predictions = Model(TestX)

                Loss = LossMetric(Predictions, TestY)

                ValidationRunningLoss += Loss.item()

        ValidationLoss.append(ValidationRunningLoss)
        print("Epoch: {0}/{1}, Validation Loss: {2}".format((Epoch + 1), NumberOfEpochs, ValidationRunningLoss))


    plt.title("Plot of training and validation losses")
    plt.plot(TrainingLoss)
    plt.plot(ValidationLoss)
    plt.show()
    plt.clf()

    Model.to("cpu")

    return Model

Weight loading along with print statements for visual checks:

AutoEncoder = Conv1DAE()

AutoEncoderParameters = TrainAE(TrainLoader, TestLoader, AutoEncoder, 325, LearningRate = 0.0000064)

MainNetwork = ConvNet()

MainNetworkParameters = MainNetwork.state_dict()

print(MainNetworkParameters["BatchNorm1D_L3.bias"])

print(AutoEncoderParameters["BatchNorm1D_L3.bias"])

MainNetworkParameters.update(AutoEncoderParameters)

MainNetwork.load_state_dict(MainNetworkParameters)

UpdatedMainNetworkParameters = MainNetwork.state_dict()

print(UpdatedMainNetworkParameters["BatchNorm1D_L3.bias"])

Setting the requires_grad attribute of the pretrained layers to False

MainNetwork.Conv1D_L1.requires_grad = False

MainNetwork.BatchNorm1D_L3.requires_grad = False

Training of the main network:

LossMetric = torch.nn.CrossEntropyLoss()

Optimiser = torch.optim.Adam(MainNetwork.parameters(), lr = 0.0000001)

MainNetwork = TrainModel(TrainLoader, TestLoader, MainNetwork, 350, Optimiser, LossMetric)

Checking of the main network’s weights post training:

TrainedMainNetworkParameters = MainNetwork.state_dict()

print(TrainedMainNetworkParameters["BatchNorm1D_L3.bias"])

Thanks for the code.
You are checking the state_dict again instead of the parameters in the model directly as shown in my code snippets.
The state_dict will return tensors without any Autograd history, so their requires_grad attribute will always be False.