I am training a classification model using 4 gpus. I see there are 3 extra processes running on gpu 0 so it wont fit a batch size larger than 2 but the others can still accept a batch size of 4. I dont understand what the 3 extra processes are and is manually setting batch size for each gpu correct?
It seems you might have created multiple CUDA contexts on the default device (GPU0). Are you launching the script via torchrun? If so, make sure that each script sets the proper device e.g. via torch.cuda.set_device.
You can also set CUDA_VISIBLE_DEVICES to make sure that each process only sees one device, so that they won’t unintentionally create CUDA context on cuda:0