I have the following problem. I am trying to propagate multiple outputs out of my network which are scalars, for example, latency or memory consumption of respective layers in addition to the output itself. These outputs I would then like to add to the main loss, let’s say cross-entropy.
With a single GPU, I am using a
@dataclass to accumulate the respective scalar layer outputs in an accumulator that I then add to the loss, which contents I add to the main loss. However, I do have multiple GPUs that I could utilise for training and I am not sure how to propage the respective scalars and combine them such that I could use
.backward(). Any help is much appreciated. Thanks.
If you are using
nn.DataParallel the model will be replicated to each GPU and each model will get a chunk of your input batch.
The output will be gathered on the default device, so most likely you wouldn’t have to change anything.
However, I’m not sure about the use case.
How are you calculating the memory consumption and is this operation differentiable?
I assume it’s not differentiable so that your accumulated loss will in fact just be the
Thank you for getting back to me. I forgot to mention that the scalars are multiplied by a parameter that I would like to learn (I am experimenting with neural architecture search).
When I did some small-scale experiments, I did not observe any errors so it has to be my implementation that is wrong, nevertheless, thank you for your clarification.