Semi multi-task learning problem

Hi all,
I’m dealing with a problem which is similar to the multi-task learning with one different:
in the multi-task problems the mean of the losses related to each branches will be backpropagated, (loss1 + loss2 + …+ lossN)/N. However, in my problem each branches should backpropagate its loss to its related part of model, i.e., loss1 will backpropagate into the part of the model related to the branch1 which I named that part paral_1 and in the same manner the paral_2, …, paral_N supposed to dedicate to its related branches. How I handled the different backprogation related to each branch ? I defined N=10 different dataloaders corresponding to each tasks. So whenever a specific dataloader gets activated only the related part of the model will be activated, requires_grad= True, and the not-related parts of the model will be freezed, requires_grad= False. in this way I can trained all branches, in my case N= 10, simultaneously.
having said that, when there is ONLY one dataloader, N=1, I observed that the model is trained(high accuracy) ,However when I increase the number of dataloader, which means the number of branches increases correspondingly, the model is not trained at all(almost zero accuracy for each branch). It makes sense that when the number of branches increase each branch should train as in
each branch trained separately, because, the dataloaders are independent from each other and with the freezing technique I explained above each branches is in charge of its own part of the model, NOTED non-related parts get freezed at training phase , so parameters of each branches gets updated at the training phase independently. So why the model is not trained(almost zero accuracy) when the parameters of the branches get updated in parallel, at backpropation time. But when the number of branches set to one the model get trained(high accuracy).
I hope I could convey what is the problem. and the following piece I code is what I suspect causing this problem but I could not see where is issue:

def train_kd(model, optimizer, dataloader, ...):
    """Train the model on `num_steps` batches

    # set model to training mode
    model.train()
 

    # dataloader_list = [dataloader[i] for i in range(len(dataloader))]
    dataloader_list = [dataloader[6]]
    batches = []
    for i, batches in enumerate(zip(*dataloader_list)):


        loss_functions = [
            net.loss_fn_kd0,  # loss0
            net.loss_fn_kd1,  # loss1
            net.loss_fn_kd2,  # loss2
            net.loss_fn_kd3,  # loss3
            net.loss_fn_kd4,  # loss4
            net.loss_fn_kd5,  # loss5
            net.loss_fn_kd6,  # loss6
            net.loss_fn_kd7,  # loss7
            net.loss_fn_kd8,  # loss8
            net.loss_fn_kd9   # loss9
            
        ]
        loss0 =torch.tensor(0.0)
        loss1 =torch.tensor(0.0)
        loss2 =torch.tensor(0.0)
        loss3 =torch.tensor(0.0)
        loss4 =torch.tensor(0.0)
        loss5 =torch.tensor(0.0)
        loss6 =torch.tensor(0.0)
        loss7 =torch.tensor(0.0)
        loss8 =torch.tensor(0.0)
        loss9 =torch.tensor(0.0)
        
        loss_score = [
            Variable(loss0, requires_grad = True),
            Variable(loss1, requires_grad = True),
            Variable(loss2, requires_grad = True),
            Variable(loss3, requires_grad = True),
            Variable(loss4, requires_grad = True),
            Variable(loss5, requires_grad = True),
            Variable(loss6, requires_grad = True),
            Variable(loss7, requires_grad = True),
            Variable(loss8, requires_grad = True),
            Variable(loss9, requires_grad = True)          
        ]
		
  		
        for j, (train_batch, labels_batch) in enumerate(batches, 6):
  		

            require_grad_list = [False] * 10
            require_grad_list[j] = True
            # move to GPU if available                
            if params.cuda:
            # convert to torch Variables
                train_batch, labels_batch = Variable(train_batch), Variable(labels_batch)
            train_batch, labels_batch = train_batch.cuda(async=True), labels_batch.cuda(async=True)

            output_batch  = model(train_batch)			
            # get one batch output from teacher_outputs list
            if params.cuda:
                output_batch = output_batch.cuda(async=True)
                
            output_teacher_batch = torch.from_numpy(teacher_outputs[i])
            if params.cuda:
                output_teacher_batch = output_teacher_batch.cuda(async=True)
            output_teacher_batch = Variable(output_teacher_batch, requires_grad=False)

            
            #############################################
            #############multi-task learning#############
            #############################################

            ## layer4 freezing parallel 0
            for child in model.layer4.paral_0.children():
                for param in child.parameters():
                    param.requires_grad = require_grad_list[0]
            for param in model.layer4.fc0.parameters():
                param.requires_grad = require_grad_list[0]
				
            ## layer4 freezing parallel 1
            for child in model.layer4.paral_1.children():
                for param in child.parameters():
                    param.requires_grad = require_grad_list[1]
            for param in model.layer4.fc1.parameters():
                param.requires_grad = require_grad_list[1]
				
            ##  layer4 freezing parallel 2
            for child in model.layer4.paral_2.children():
                for param in child.parameters():
                    param.requires_grad = require_grad_list[2]
            for param in model.layer4.fc2.parameters():
                param.requires_grad = require_grad_list[2]
				
            ##  layer4 freezing parallel 3
            for child in model.layer4.paral_3.children():
                for param in child.parameters():
                    param.requires_grad = require_grad_list[3]
            for param in model.layer4.fc3.parameters():
                param.requires_grad = require_grad_list[3]
				
            ##  layer4 freezing parallel 4
            for child in model.layer4.paral_4.children():
                for param in child.parameters():
                    param.requires_grad = require_grad_list[4]
            for param in model.layer4.fc4.parameters():
                param.requires_grad = require_grad_list[4]
				
            ##  layer4 freezing parallel 5
            for child in model.layer4.paral_5.children():
                for param in child.parameters():
                    param.requires_grad = require_grad_list[5]
            for param in model.layer4.fc5.parameters():
                param.requires_grad = require_grad_list[5]
				
            ##  layer4 freezing parallel 6
            for child in model.layer4.paral_6.children():
                for param in child.parameters():
                    param.requires_grad = require_grad_list[6]
            for param in model.layer4.fc6.parameters():
                param.requires_grad = require_grad_list[6]
				
            ##  layer4 freezing parallel 7
            for child in model.layer4.paral_7.children():
                for param in child.parameters():
                    param.requires_grad = require_grad_list[7]
            for param in model.layer4.fc7.parameters():
                param.requires_grad = require_grad_list[7]
				
            ##  layer4 freezing parallel 8
            for child in model.layer4.paral_8.children():
                for param in child.parameters():
                    param.requires_grad = require_grad_list[8]
            for param in model.layer4.fc8.parameters():
                param.requires_grad = require_grad_list[8]
				
            ##  layer4 freezing parallel 9
            for child in model.layer4.paral_9.children():
                for param in child.parameters():
                    param.requires_grad = require_grad_list[9]
            for param in model.layer4.fc9.parameters():
                param.requires_grad = require_grad_list[9]
					
            optimizer.zero_grad()
            # print('one pass')			
            loss_score[j].backward()
            # print('two pass')						
            optimizer.step()
            # print('---------------------------------')

                
        # Evaluate summaries only once in a while
        if i % params.save_summary_steps == 0:
            # extract data from torch Variable, move to cpu, convert to numpy arrays
            output_batch = output_batch.data.cpu().numpy()
            labels_batch = labels_batch.data.cpu().numpy()
            .....
            # compute all metrics on this batch
            summary_batch = {metric:metrics[metric](output_batch, labels_batch, params)
                             for metric in metrics}

I used also nn.CrossEntropyLoss()(outputs, labels) for calculating the loss for each branches.