https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
This page explains how we can split a big nn model on multiple GPUs. But I don’t understand when the intermediate training result needs to be transferred from one GPU to another (by using “.to(‘cuda:1’)”), does pytorch runtime moves the data to CPU mem first and then to another gpu’s mem, or pytorch actually utilizes some direct data transfer technique between GPUs like SLI?
@mrshenli Thanks! This really helps.
I have another question that does data copy between GPUs in DataParallel module utilize the same kernel?
Sorry I forgot to ask this before.