Parallelising models with variable size inputs across multiple GPUs

CompRhys · March 21, 2020, 7:33pm

I have written a module for small graphs that has a custom collate_batch function so that I have batches of 128 graphs at a time however the graphs are different sizes and so we have a variable number of nodes. I am using torch_scatter for some operations.

My aim is to parallelise my code over multiple GPUs to reduce the real time for training. Naively using nn.DataParallel(model) gives me an error as I believe it requires the sub-tensors it splits up the tensor into to all be the same size which is not the case here as while each GPU gets 64 graphs the number of nodes varies. Is there a way to do this currently across multiple GPUs or do I need to just take the hit and train on one GPU?

mrshenli · March 23, 2020, 2:18am

I am inclined to think this shouldn’t cause a problem for DataParallel. Because DataParallel would replicate your model and scatter your input on the first (batch) dimension. Even though different input element (graph) can contain different node, as long as your model forward function can handle it properly, it should not hit error.

Could you please share a min repro for the error you saw?