In my training I need to perform a
forward pass, followed by multiple
backward passes with
retain graph = True, and finally using
backward final time before
I do this because I compute a heavy loss that causes cuda OOM before reaching the final
backward if I accumulate the ‘partial’ losses before using
backward on everything together.
I now want to train on multiple GPUs,but from what I see here it seems that I can’t use
no_sync for my use case, as the forward pass also need to be inside the context.
Another point is that the number of backwards between the final one won’t be the same across the different inputs.
Is there a way for me to disable syncing until the final backward pass?