Suppose I have two GPUs, GPU-0 and GPU-1 (they are the same type). I hope to train a simple classification network (e.g. ResNet) on them. For some special reasons, I hope GPU-0 can take more memories.
For example, consider the batch size set to 64, I hope about 40 batches of data are allocated on GPU-0 and the rest 24 batches on GPU-1.
I am guessing this can not be done via nn.DataParallel or nn.DistributedDataParallel, right? To do this, I think I need to copy the model and data manually to GPU-0 and GPU-1, then merge the computed loss together.
I am pretty unfamiliar with distributional training in PyTorch and fail to find a proper tutorial. A related question is raised here, however the objective is quite different.
Could anyone illustrate this problem with an example? Thanks ahead.