Multimodal fusion Semantic segmentation and forcasting

For my binary segmentation model, I have two entries images and sensors data (history of numeric data) for each image.

The numeric data are temporal series.

Assuming I have

70 entries for each day

sequence lenght = 20 (days)

image batch size = 10

images size = torch.Size([10, 3, 256, 256])

sensors_data size = torch.Size([10, 20, 70])

For semantic segmentation I am using Unet Encoder , decoder
for numeric data I am using RNN, the Idea is to take the output of the RNN as a weight that forces the segmentation.

So my forward function is as folows

def forward(self, images, x):

    encoded = sef.encoder(images)
    decoded = self.decoder(encoded)
    Fuse = torch.clone(decoded)  #([10, 2, 256, 256]) 

    h0 = torch.zeros(self.layer_dim, 1, self.hidden_dim).requires_grad_(true).to(self.device)
    hidden = torch.empty_like(torch.unsqueeze(h0,0))
    out  = torch.empty(1).requires_grad_().to(self.device)
      
    #loop over image batchs 
    for i in range(x.size(1)): 

            #loop over days in each image batch
            for j in range(x.size(0)):  

                out, h0 = self.rnn(x[j:j+1,i:i+1,:], h0.detach())
                
                out = out[:, -1, :]

            out = self.fc(out)

            last = torch.clone(out[0])
  
            myoutput = torch.cat((myoutput, last),0)



            #fusion part 
      
            Weights = torch.clone(My_output)
            Weights =  torch.where(Weights > 35, 0.7, 0.3)
            Weights = Weights.view(Weights.size(0), 1, 1, 1)
            #print(Weights)
    
            Fuse[:,1,None] = torch.mul(Fuse[:,1,None],Weights)

           return Decoded, myoutput, Fuse


in the trainning loop :

        logits , sensors , fusion = model(images, x) 

        optimizer1.zero_grad()
        optimizer2.zero_grad()
        optimizer3.zero_grad()

        loss1 = F.cross_entropy(logits, labels)  
        loss0 = Funcloss0(sensors, labels_sensors)   
        loss2 = F.cross_entropy(fusion, labels_fusion)

        loss0.backward(retain_graph=True) 
        loss1.backward(retain_graph=True) 
        loss2.backward(retain_graph = False)

PS : I kept the output without fusion to compare results

The problem is that the prediction of RNN are very bad and I don’t know why, can you spot the mistake. Does this seems logical?

I would recommend to check the shapes of all tensors, as it seems you might be indexing the inputs/outputs in the RNN in a wrong way.
E.g. here

    #loop over image batchs 
    for i in range(x.size(1)): 
            #loop over days in each image batch
            for j in range(x.size(0)):  
                out, h0 = self.rnn(x[j:j+1,i:i+1,:], h0.detach())            
                out = out[:, -1, :]

it seems x is packed as [seq_len, batch_size, features] and while out should have the same dimension order you are indexing it in dim1 (so the batch dimension) while I guess you want to use the last time step?

I don’t fully understand this comment.
Indexing out in dim0 (the seq_len dimension) should give you the desired time step for each sample.
It sounds as if you are now concerned about the data loading itself? Could you explain the issue a bit more?