Is it expected for DistributedDataParallel to use more memory on 1 GPU in a 1GPU:1process setup?

mrshenli · April 26, 2020, 3:54pm

It might be relevant to this post. If CUDA_VISIBLE_DEVICES is not set to one device per process, and the application program calls clear_cache somewhere without a device context, it could try to initialize the CUDA context on device 0.