Hi all,

I’m dealing with a problem which is similar to the multi-task learning with one different:

in the multi-task problems the mean of the losses related to each branches will be backpropagated, (loss1 + loss2 + …+ lossN)/N. However, in my problem each branches should backpropagate its loss to its related part of model, i.e., loss1 will backpropagate into the part of the model related to the branch1 which I named that part paral_1 and in the same manner the paral_2, …, paral_N supposed to dedicate to its related branches. How I handled the different backprogation related to each branch ? I defined N=10 different dataloaders corresponding to each tasks. So whenever a specific dataloader gets activated only the related part of the model will be activated, requires_grad= True, and the not-related parts of the model will be freezed, requires_grad= False. in this way I can trained all branches, in my case N= 10, simultaneously.

**having said that**, when there is ONLY one dataloader, N=1, I observed that the model is trained(high accuracy) ,However when I increase the number of dataloader, which means the number of branches increases correspondingly, the model is not trained at all(almost zero accuracy for each branch). It makes sense that when the number of branches increase each branch should train as in

each branch trained separately, because, the dataloaders are independent from each other and with the freezing technique I explained above each branches is in charge of its own part of the model, NOTED non-related parts get freezed at training phase , so parameters of each branches gets updated at the training phase independently. So why the model is not trained(almost zero accuracy) when the parameters of the branches get updated in parallel, at backpropation time. But when the number of branches set to one the model get trained(high accuracy).

I hope I could convey what is the problem. and the following piece I code is what I suspect causing this problem but I could not see where is issue:

```
def train_kd(model, optimizer, dataloader, ...):
"""Train the model on `num_steps` batches
# set model to training mode
model.train()
# dataloader_list = [dataloader[i] for i in range(len(dataloader))]
dataloader_list = [dataloader[6]]
batches = []
for i, batches in enumerate(zip(*dataloader_list)):
loss_functions = [
net.loss_fn_kd0, # loss0
net.loss_fn_kd1, # loss1
net.loss_fn_kd2, # loss2
net.loss_fn_kd3, # loss3
net.loss_fn_kd4, # loss4
net.loss_fn_kd5, # loss5
net.loss_fn_kd6, # loss6
net.loss_fn_kd7, # loss7
net.loss_fn_kd8, # loss8
net.loss_fn_kd9 # loss9
]
loss0 =torch.tensor(0.0)
loss1 =torch.tensor(0.0)
loss2 =torch.tensor(0.0)
loss3 =torch.tensor(0.0)
loss4 =torch.tensor(0.0)
loss5 =torch.tensor(0.0)
loss6 =torch.tensor(0.0)
loss7 =torch.tensor(0.0)
loss8 =torch.tensor(0.0)
loss9 =torch.tensor(0.0)
loss_score = [
Variable(loss0, requires_grad = True),
Variable(loss1, requires_grad = True),
Variable(loss2, requires_grad = True),
Variable(loss3, requires_grad = True),
Variable(loss4, requires_grad = True),
Variable(loss5, requires_grad = True),
Variable(loss6, requires_grad = True),
Variable(loss7, requires_grad = True),
Variable(loss8, requires_grad = True),
Variable(loss9, requires_grad = True)
]
for j, (train_batch, labels_batch) in enumerate(batches, 6):
require_grad_list = [False] * 10
require_grad_list[j] = True
# move to GPU if available
if params.cuda:
# convert to torch Variables
train_batch, labels_batch = Variable(train_batch), Variable(labels_batch)
train_batch, labels_batch = train_batch.cuda(async=True), labels_batch.cuda(async=True)
output_batch = model(train_batch)
# get one batch output from teacher_outputs list
if params.cuda:
output_batch = output_batch.cuda(async=True)
output_teacher_batch = torch.from_numpy(teacher_outputs[i])
if params.cuda:
output_teacher_batch = output_teacher_batch.cuda(async=True)
output_teacher_batch = Variable(output_teacher_batch, requires_grad=False)
#############################################
#############multi-task learning#############
#############################################
## layer4 freezing parallel 0
for child in model.layer4.paral_0.children():
for param in child.parameters():
param.requires_grad = require_grad_list[0]
for param in model.layer4.fc0.parameters():
param.requires_grad = require_grad_list[0]
## layer4 freezing parallel 1
for child in model.layer4.paral_1.children():
for param in child.parameters():
param.requires_grad = require_grad_list[1]
for param in model.layer4.fc1.parameters():
param.requires_grad = require_grad_list[1]
## layer4 freezing parallel 2
for child in model.layer4.paral_2.children():
for param in child.parameters():
param.requires_grad = require_grad_list[2]
for param in model.layer4.fc2.parameters():
param.requires_grad = require_grad_list[2]
## layer4 freezing parallel 3
for child in model.layer4.paral_3.children():
for param in child.parameters():
param.requires_grad = require_grad_list[3]
for param in model.layer4.fc3.parameters():
param.requires_grad = require_grad_list[3]
## layer4 freezing parallel 4
for child in model.layer4.paral_4.children():
for param in child.parameters():
param.requires_grad = require_grad_list[4]
for param in model.layer4.fc4.parameters():
param.requires_grad = require_grad_list[4]
## layer4 freezing parallel 5
for child in model.layer4.paral_5.children():
for param in child.parameters():
param.requires_grad = require_grad_list[5]
for param in model.layer4.fc5.parameters():
param.requires_grad = require_grad_list[5]
## layer4 freezing parallel 6
for child in model.layer4.paral_6.children():
for param in child.parameters():
param.requires_grad = require_grad_list[6]
for param in model.layer4.fc6.parameters():
param.requires_grad = require_grad_list[6]
## layer4 freezing parallel 7
for child in model.layer4.paral_7.children():
for param in child.parameters():
param.requires_grad = require_grad_list[7]
for param in model.layer4.fc7.parameters():
param.requires_grad = require_grad_list[7]
## layer4 freezing parallel 8
for child in model.layer4.paral_8.children():
for param in child.parameters():
param.requires_grad = require_grad_list[8]
for param in model.layer4.fc8.parameters():
param.requires_grad = require_grad_list[8]
## layer4 freezing parallel 9
for child in model.layer4.paral_9.children():
for param in child.parameters():
param.requires_grad = require_grad_list[9]
for param in model.layer4.fc9.parameters():
param.requires_grad = require_grad_list[9]
optimizer.zero_grad()
# print('one pass')
loss_score[j].backward()
# print('two pass')
optimizer.step()
# print('---------------------------------')
# Evaluate summaries only once in a while
if i % params.save_summary_steps == 0:
# extract data from torch Variable, move to cpu, convert to numpy arrays
output_batch = output_batch.data.cpu().numpy()
labels_batch = labels_batch.data.cpu().numpy()
.....
# compute all metrics on this batch
summary_batch = {metric:metrics[metric](output_batch, labels_batch, params)
for metric in metrics}
```

I used also nn.CrossEntropyLoss()(outputs, labels) for calculating the loss for each branches.