Hi there,
sorry if this question has been asked before but I could not find something on this particular question:
I have a batch of data x
(e.g. Nx1x28x28) which I want to evaluate on my model net
using M
forward passes but with different random numbers in each run, e.g. for dropout during test time, and average them at the end. What is the most efficient way to parallelize the forward passes on multiple GPUs? I could do something like
gpu_count = torch.cuda.device_count()
net = nn.DataParallel(net, list(range(gpu_count)))
x = x.repeat(gpu_count,1,1,1)
x = x.cuda()
outputs = []
for _ in range(M):
outputs.append(net(x))
But this increases the memory demand on GPU 0 by a factor of gpu_count
as the tensor x
needs to be copied to one of the GPUs first and is then distributed in chunks to the other ones. Can this be done more efficiently?
Also: Do the individual instances on each GPU have different random seeds for the forward passes?
Thanks for your help!
EDIT: Also it is probably not that efficient to distribute the repeated x
for every of the M
forward passes. It would be nice to have the x
on each GPU individually for all the M
iterations until the next batch.