Model parallelism in pytorch for large(r than 1 GPU) models?

@apaszke Hi, very thanks for your examples.

I notice that when I split the whole model in 4 gpus and do forward/backward, the GPU memory on the fisrt GPU cost much more than it should be. For example, if the whole model cost 12GB on a single GPU, when split it to four GPUs, the first GPU cost 11GB and the sum of others cost about 11GB.
Is there an explaination for how does the GPU memory be malloced when using multiple GPUs for model parallelism.

Another question, when forward with the model parallelism, there is only one gpu hasing the Volatile GPU-Util with 100%, the others are 0%.
Is there any method to leverage all GPU-Util with the all four GPUs?

5 Likes