Lesser memory consumption with a larger batch in multi GPU setup

Isn’t that related to what we’ve discussed in Bug in DataParallel? Only works if the dataset device is cuda:0 - #12 by rasbt? I.e., that the results are intermittently gathered on one of the devices?

623cbf5248b1ccc24d4d4895bfefee98b8abd682_1_690x244

If the batch size is larger, there will be more stuff to be gathered, which is what could explain why the difference is more pronounced if you increase the batch size.