Transfer learning of a pretrained network - am I in the wrong?

Hello dear pytorch community,

I have two tasks that rely on a binary outcome that I would like to train.
The model I am using for that is a normalization free ResNet-26 structure.

Task #1 is fairly easy, it is simple to learn and after training for a couple of epochs, the accuracy and loss are all fairly reasonable. Lets say that Task#1 is to differentiate between A and B.

Task#2 however is much harder. Even after 50 epochs of training, the test accuracies dont surpass 70% and overall the model is overfitting. (i.e, 98% accuracy in training and 70% accuracy in test).
Task #2 is to differentiate between A and C.

Since both tasks use A as an input, I thought it would be a good idea to just train a model on task 1 and then use the pretrained structure to test it on task 2. Since both try to classify between A and some other input.

Now here is the point where I am a bit lost. When I do load the pretrained model on task #1 and test it, it does not succeed in task #2, meaning chance level accuracies. A colleague told me that I might need to fine tune the model first.

So I did some googling and I now have the following setup.

I fine tune the pretrained model for 30 epochs using this code

fine_tune_epochs = 30
optimizer = optim.Adam(model.parameters(), lr=0.0001)

for epoch in range(fine_tune_epochs):
    model.train()
    running_loss = 0.0
    train_correct = 0
    train_total = 0

    for inputs, labels in tqdm(train_loader, desc="Testing", leave=False):
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        train_total += labels.size(0)
        train_correct += (predicted == labels).sum().item()
    print(f"Epoch [{epoch+1}/{fine_tune_epochs}], Loss: {running_loss/len(train_loader):.4f}, Accuracy: {100 * train_correct / train_total:.2f}%")

This actually gives me training accuracies of around 78% and more importantly, when testing I actually get 75 % accuracy in the testset.

To me this seems like there is of course some transfer learning happening, because I do need way less training time and also less data to achieve superior accuracies (and loss) then when I would just try to train the task with a model itself.

But I am questioning, if I am merely just re-training the model itself. Im not sure if this set-up makes too much sense, results aside.

Furthermore, I also did try to freeze the fully-connected layer but this actually just worsened the performance. Is there maybe a way to just re-train the layers that are associated with input B?

As in, leave the layers that have learned the representations of input A and just re-train the layers that are associated with the other task…?

I would appreciate any help and feeback a lot. Please let me know if I can provide more information to paint a clearer picture.

Main questions are: Is this set-up Im using reasonable at all, and if yes/or no, how could I utilize more sophisticated freezing strategies.

All the best

Hey so I have a few questions:

  1. Can you be more specific when you say task? Because from what you have described about taking the model that was trained to differentiate between A and B and then using it in Task #2 to train it between A and C is transfer learning in itself. The model used to differentiate between A and B would technically be a pre-trained model.
  2. What kind of model are you using? It sounds like you could just add classes A, B, and C into one dataset to then train the model.

ANSWERS:

You are 100% correct, you are simply just retraining the model.

The reason freezing the layers made performance worse is because at that point you entirely changed the mapping of anything before it. This would be akin to telling someone how to get to LA from NY but deciding that it would be easier to tell them how to get to LA from San Diego and assuming that they know how to get San Diego.

Unfortunately, you don’t get to pick and choose layers based on performance. Everything is connected.

I would say no. I would rather suggest you train the model on both tasks at the same time. This is called multi-task learning. Essentially, you’re optimizing the model to perform its best given both tasks. You can even weigh the outputted loss from each task based on which one you want it to perform better on.

CONCLUSION:
If you can give us more information about the tasks, the data, etc. That would be extremely helpful. Its hard to give you an answer without really knowing what kind of problem we’re actually solving outside of your interpretation of it.