Are there best practices for avoiding out of memory errors when using dataparallel. For instance, I have 8 cards, but I think the gradients are all accumulated on one card (card 0 by default), so would it make sense to use 7 cards, and set the 8th one as the output device? Or would that be a waste? I’ve tried experimenting with it but the overhead is so high I’m not sure i’d get there quickly.
Using a card solely for the to accumulate gradients doesn’t sound so good. Whats your model look like?
In many models the activations use more memory than the weights and gradients, so try decreasing your batch size if you’re running out of memory.
Did you finally solve this problem? And how to solve this? THank you.