Model takes twice the memory footprint with distributed data parallel

You are most likely seeing the same effect described here.

1 Like