Firstly, thank you for the reply,
For the first, that is as I thought was correct, but the example confuses me- in the second case, you feed through only fc2, the gradient for fc1 weights is 0, but the weight for fc1 still changes? Is this a result of using adam?
To put this another way, lets say I have a rewards tensor of N x T, where N is the batch size and T is the number of tasks. I have some input of states of N x (irrelevant) and output a value for these states. The input is fed through some shared layers, and then separate network head layers feed out values. These values are generated by each network head so I have T heads outputting a N x 1 tensor of values.
Case 1:
I could then concatenate this into a N x T tensor we’ll call QVals.
If I now call critic_loss = nn.MSELoss(Qvals, rewards), and then call critic_loss.backward() and step the optimiser, does this act as i wish it to, training each part according to the relevant loss?
Case 2:
If I instead do not concatenate these losses, and instead output T N x 1 tensors, and compare them to individual reward tensors (T N x 1 tensors) by mseloss again i now have T loss values. if i were to call backward() on each of these individually and then step() after all T backwards calls, would that be equivalent to case 1?
Case 3:
As case 2, but i backwards() and step() and zero_grad() for each of T losses. How does this differ from case 1 and 2?
Case 4:
I completely separately feed through the network for each task only outputting for one head at a time (i modify my network to also take the head to use as an argument). The input data is the exact same. I can calculate/backprop losses again as per case 2 or case 3. How does this differ?
I appreciate I’m asking some potentially difficult questions so thank you a lot for your help. If there is any material you could link me to that may help me understand that would be appreciated greatly.
For question 2 (this is unrelated to 1), I’m simply saying if i have a distribution dist, and normally i sample from dist with .sample(), how can i instead select the value with highest probability value from dist:
If I have p(a) = 0.3, p(b)=0.6 and p© = 0.1, sample is going to uniformly sample between them, and output a, b or c according to these probabilities. which is what I want for one model, but I also have a second model which I want to reuse the same code for, but instead of calling sample i wish to instead output from dist always the output with the highest value, in this case always b.