Why do we need to set the gradients manually to zero in pytorch?

jdhao · November 19, 2017, 8:49am

Follow up. First I try to accumulate 64 single loss, then do one backward, but without success (GPU out of memory). When I reduce the number of accumulated loss to 16, it works. So right now, the real batch size is 64, but I do backward for every 16 samples (4 backward for the whole batch).