Hi,
In my training I need to perform a forward
pass, followed by multiple backward
passes with retain graph = True
, and finally using backward
final time before optimizer.step
.
I do this because I compute a heavy loss that causes cuda OOM before reaching the final backward
if I accumulate the ‘partial’ losses before using backward
on everything together.
I now want to train on multiple GPUs,but from what I see here it seems that I can’t use no_sync
for my use case, as the forward pass also need to be inside the context.
Another point is that the number of backwards between the final one won’t be the same across the different inputs.
Is there a way for me to disable syncing until the final backward pass?