Halfway Fusion with two datasets

I am attempting to create a model for halfway fusion using visual and thermal data. The following is the model:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 53 * 53, 157) 
        self.fc2 = nn.Linear(157+157, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x1, x2):
        x1 = self.pool(F.relu(self.conv1(x1)))
        x1 = self.pool(F.relu(self.conv2(x1)))
        x1 = x1.view(x1.size(0), -1)
        x1 = F.relu(self.fc1(x1))
        
        x2 = self.pool(F.relu(self.conv1(x2)))
        x2 = self.pool(F.relu(self.conv2(x2)))
        x2 = x2.view(x2.size(0), -1)
        x2 = F.relu(self.fc1(x2))
        
        #print(x1.shape, x2.shape)
        
        x3 = torch.cat((x1, x2), dim=1)
        
        #print(x3.shape)
        
        x3 = F.relu(self.fc2(x3))
        x3 = self.fc3(x3) 
        return x3
    net = Net()
for epoch in range(1):
        running_loss = 0.0
        for i, vs_data in enumerate(vs_trainloader, 0):
            vs_images, vs_labels, vs_bbox = vs_data
            vs_images = Variable(images).to(device)
            vs_labels = Variable(labels).to(device)
            
        for i, th_data in enumerate(th_trainloader, 0):
            th_images, th_labels, th_bbox = th_data
            th_images = Variable(images).to(device)
            th_labels = Variable(labels).to(device)
            
            optimizer.zero_grad()
            outputs = net(vs_images.to(device), th_images.to(device))
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
        
            running_loss += loss.item()
            if i % 30 == 29:    # print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 30))
                running_loss = 0.0
        print('Finished Training Trainset')

The issue I am having is with loss = criterion(outputs, labels) line. How can I make sure that the loss is calculated properly for the two different datasets? Each image for both datasets will have its own corresponding labels.

Any advice would be greatly appreciated. Thank you in advance.

If each dataset has its own label tensor the concatenation seems to be wrong.
Currently you are fusing the activations of both images so that your model has only a single output layer. Could you explain your use case a bit?
Maybe processing both images separately without the fusion would be the right approach?

I am trying to design a halfway fusion technique, where I fuse the features from the two pictures (visual and thermal) to design a classifier. The technique requires that the feature extraction takes place within the network. Something like the image below.

I’m sorry if this is a bit confusing. I am new to Deep Learning and still learning to use PyTorch.

It seems the output should be bounding boxes for pedestrians using both images as inputs.
In that case you would only have the bbox output and a single loss function.
The gradients in the shared layers will be accumulated.

I think your approach should be right.

Sorry, I do not fully understand your reply. Could, if possible, give me a basic example of what you mean? Or do you know of any examples that exist that I could look at?

Sorry for not being clear enough.
What I meant is that your model architecture should generally work as you would expect.
As far as I understand your use case, you would like to predict bounding boxes using two different image modalities. I assume you only have one set of bounding boxes for each image pair, i.e. the boxes have the same location in the visible and IR image. Is that correct?

If so, all shared layers /Multispectral Pedestrian Detection layer, Multispectral Feature Fusion layer) will get gradients which will be calculated using a single loss based on both image inputs.
The Visible and Infrared Feature Extraction layers will also get valid gradients corresponding to the different image modalities.

What confuses me is the statement “each image for both datasets will have its own corresponding labels”. Could you explain it a bit? Are the bounding box locations different for the visible and IR image?

Also, let me know if I completely misunderstood your use case. :wink:

Sorry, I must not have been clear when I stated my use case initially. I do have bounding boxes for each colour and thermal images in different csv files. So I have a separate csv file, one for the colour images and one for the thermal images. Each csv file contains the name and location of the image, the label and bounding box values.

So for example, if I have colour image1, the csv file will have its location and label. The thermal image1 will also have its own csv file with location and label information.

However, for what I’m attempting to design, I was going to use only the images and their corresponding labels only and add the bounding box values later.

So just to clarify, the information of the colour and thermal images have separate csv files. I hope that makes sense, and I hope I haven’t over-explained it. If it doesn’t make sense, perhaps I can post a snippet of the csv files?

Thanks for the explanation!

In that case I’m not sure the fusion layer would be the best approach.
If your use case would just involve a single target, I think the fused layer approach could be a good idea (at least in my opinion :wink: ).
Both feature extraction blocks would work separately on the two image modalities, and the fusion layers would use these features to learn the label and later the bounding boxes. Basically the “classifier” part (fused layers) would be able to select the necessary features of the feature extractors.

However, you are apparently dealing with separate labels and bounding box coordinates.
If you feed the features from the feature extractors to the fused block, you would have to separate them afterwards to use two (or four) different output layers (label + bbox for both modalities). Alternatively you could try to use a multi-label approach and use a single large output layer, which should learn to predict both labels, but I’m not sure if that’s the best way here.

Does it make sense to you or did I misunderstood something?

Thank so much for taking time out to understand my problem and the advise that you have given me. I will try your suggestion of the multi-label approach as that makes more sense for what I am wanting to do. I will let you know how i get on.

Again, thank you for all your advise.

1 Like

I know this problem was posted a while back and I am currently working on sensor/feature fusion. I used torch.cat to use the output features of certain layers of my network. However, I have recently seen an example where the fusion was achieved by simply adding the two output features. Something like this:

def forward(self,input):
        conv_1 = self.conv_1(input)
        conv_2 = self.conv_2(conv_1)
        res = conv_1 + conv_2
        conv_3 = self.conv_3(res)
        return conv_3

Could some be able to confirm which method is better for my problem? Thanks in advance.