Questions about loss and question about selecting from a distribution

Question 1: If I have multiple separate output head layers of a network, with the one in use triggered by either a network input of the head to use or by a onehot encoded matrix (have done both implementations), is loss propagated according to the head? So for example if there were two heads, and head two were never used in the forward in a batch, would the weights exclusive to head two be unchanged?

To give a more specific context, if they are both Q value predictors for different tasks, can I simply sum the loss for each task together (calculated by a complex comparison to the relevant task rewards) does torch automatically makes weight changes proportionally? If head one’s values were perfect and head two’s values were awful, would summing the losses affect the heads equally or not?

Question 2: If i have a distribution constructed from a softmax output, such as torch.distributions.Categorical, but i sometimes want to be able to choose actions deterministically (like DDPG), how can I select the highest probability output from a distribution rather than random sampling according to probabilities?

To question 1:
The computation graph is only created for all operations which were used in the forward pass.
Autograd will thus only calculate the gradients for all parameters, which were involved in these calculations.
However, if you are using e.g. an optimizer with running estimates, all parameters with valid running stats will be updated.
Example:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(1, 1)
        self.fc2 = nn.Linear(1, 1)
        
    def forward(self, x, idx):
        if idx==0:
            x = self.fc1(x)
        elif idx==1:
            x = self.fc2(x)
        return x

model = MyModel()
optimizer = optim.Adam(model.parameters(), lr=1.)

x = torch.randn(1, 1)
output = model(x, idx=0)
output.backward()
print('fc1.weight.grad ', model.fc1.weight.grad)
print('fc2.weight.grad', model.fc2.weight.grad)

print('Before 1st optimization')
print(model.fc1.weight)
print(model.fc2.weight)

optimizer.step()
optimizer.zero_grad()

print('After')
print(model.fc1.weight)
print(model.fc2.weight)

output = model(x, idx=1)
output.backward()
print('fc1.weight.grad ', model.fc1.weight.grad)
print('fc2.weight.grad ', model.fc2.weight.grad)

print('Before 2nd optimization')
print(model.fc1.weight)
print(model.fc2.weight)

optimizer.step()
print('After')
print(model.fc1.weight)
print(model.fc2.weight)

> 
fc1.weight.grad  tensor([[-0.4842]])
fc2.weight.grad None
Before 1st optimization
Parameter containing:
tensor([[-0.7616]], requires_grad=True)
Parameter containing:
tensor([[-0.8436]], requires_grad=True)
After
Parameter containing:
tensor([[0.2384]], requires_grad=True)
Parameter containing:
tensor([[-0.8436]], requires_grad=True)
fc1.weight.grad  tensor([[0.]])
fc2.weight.grad  tensor([[-0.4842]])
Before 2nd optimization
Parameter containing:
tensor([[0.2384]], requires_grad=True)
Parameter containing:
tensor([[-0.8436]], requires_grad=True)
After
Parameter containing:
tensor([[0.9085]], requires_grad=True)
Parameter containing:
tensor([[0.1564]], requires_grad=True)

I’m not sure I understand the second question completely (I’m not really experienced in RL), but if you select the highest value e.g. via torch.max, only the max value will get a valid gradient:

x = torch.randn(1, 10, requires_grad=True)
out, idx = torch.max(x, 1)
out.backward()
print(x.grad, idx)
> tensor([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]]) tensor([5])

Would that work or am I misunderstanding the question?

Firstly, thank you for the reply,

For the first, that is as I thought was correct, but the example confuses me- in the second case, you feed through only fc2, the gradient for fc1 weights is 0, but the weight for fc1 still changes? Is this a result of using adam?

To put this another way, lets say I have a rewards tensor of N x T, where N is the batch size and T is the number of tasks. I have some input of states of N x (irrelevant) and output a value for these states. The input is fed through some shared layers, and then separate network head layers feed out values. These values are generated by each network head so I have T heads outputting a N x 1 tensor of values.

Case 1:
I could then concatenate this into a N x T tensor we’ll call QVals.
If I now call critic_loss = nn.MSELoss(Qvals, rewards), and then call critic_loss.backward() and step the optimiser, does this act as i wish it to, training each part according to the relevant loss?

Case 2:
If I instead do not concatenate these losses, and instead output T N x 1 tensors, and compare them to individual reward tensors (T N x 1 tensors) by mseloss again i now have T loss values. if i were to call backward() on each of these individually and then step() after all T backwards calls, would that be equivalent to case 1?

Case 3:
As case 2, but i backwards() and step() and zero_grad() for each of T losses. How does this differ from case 1 and 2?

Case 4:
I completely separately feed through the network for each task only outputting for one head at a time (i modify my network to also take the head to use as an argument). The input data is the exact same. I can calculate/backprop losses again as per case 2 or case 3. How does this differ?

I appreciate I’m asking some potentially difficult questions so thank you a lot for your help. If there is any material you could link me to that may help me understand that would be appreciated greatly.

For question 2 (this is unrelated to 1), I’m simply saying if i have a distribution dist, and normally i sample from dist with .sample(), how can i instead select the value with highest probability value from dist:
If I have p(a) = 0.3, p(b)=0.6 and p© = 0.1, sample is going to uniformly sample between them, and output a, b or c according to these probabilities. which is what I want for one model, but I also have a second model which I want to reuse the same code for, but instead of calling sample i wish to instead output from dist always the output with the highest value, in this case always b.