Transfer learning with different inputs


I have used the transfer learning example provided on the website and it works pretty well. So I plan to apply pre-trained model, such as ResNet18, to different modalities, and then fuse the two models at FC layer to continue the training? Any thoughts on how this could be implemented here? The network parameters could either be shared or not shared. Thanks.

1 Like

What do you mean by “different modalities”?
As far as I understand you would like to get two pre-trained models and concat their activations at some point?

To clarify, for instance, one input is from the raw image, and another input is from the depth image.

Yes, concatenation at an early stage or a late stage should be fine. Thanks.

Ah ok, I see.
So I suppose the images have different numbers of channels, i.e. image has 3 channels while depth has 1 channel?
I created a small code snippet:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        image_modules = list(models.resnet18().children())[:-1]
        self.modelA = nn.Sequential(*image_modules)

        depth_modules = list(models.resnet18().children())[:-1]
        self.modelB = nn.Sequential(nn.Conv2d(1, 3, 3, 1, 1),
        self.fc = nn.Linear(1024, 1)
    def forward(self, image, depth):
        a = self.modelA(image)
        b = self.modelB(depth)
        x =, -1), b.view(b.size(0), -1)), dim=1)
        x = self.fc(x)
        x = F.sigmoid(x)
        return x

x_image = Variable(torch.randn(1, 3, 224, 224))
x_depth = Variable(torch.randn(1, 1, 224, 224))

model = MyModel()
output = model(x_image, x_depth)

Does it suit your needs?


Alternatively, you could add the depth information as a fourth channel and edit the first layer of resnet18 so that it takes 4 input channels instead of three.

to steal from @ptrblck’s nice example:

x_image = Variable(torch.randn(1, 3, 224, 224))
x_depth = Variable(torch.randn(1, 1, 224, 224))

input =, x_depth, dim=1) # RGBD input

model = resnet18()
model.conv1 = nn.Conv2d(4, 64, kernel_size=7, stride=2, padding=3, bias=False)

output = model(input)

I’m not sure which would work better for your purposes but this saves a lot of parameters vs the siamese method.


what changes are to be done for the data loader here?

Basically none. Your Dataset should provide the samples containing 4 channels.

but custom class has to be written for the data loader if i am right.

You would write a custom Dataset and just pass it to the DataLoader.
Here is a small example:

class MyDataset(Dataset):
    def __init__(self, image_paths, targets, transform=None):
        self.image_paths = image_paths
        self.targets = targets
        self.transform = transform
    def __getitem__(self, index):
        x =[index])
        y = self.targets[index]
        if self.transform:
            x = self.transform(x)
        return x, y
    def __len__(self):
        return len(self.image_paths)

dataset = MyDataset(image_paths, targets)
loader = DataLoader(
1 Like