In DistributedDataParallel training additional n-1 processes are using memory on the first GPU, n being the number of GPUs