Is it expected for DistributedDataParallel to use more memory on 1 GPU in a 1GPU:1process setup?

It might be relevant to this post. If CUDA_VISIBLE_DEVICES is not set to one device per process, and the application program calls clear_cache somewhere without a device context, it could try to initialize the CUDA context on device 0.

1 Like