I have a trained model and 4 GPUs available. I have a dataset which I want to split in 4 (and process with the same batch size on each GPU) independently of each other and essentially add the results I get from each GPU. I did not get too much wiser from e.g. How do I run Inference in parallel?. Please not that I am only doing inference with my model, and no gradient computations etc are required.
A minimal example of what I’m trying to do is this:
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import os
from torch.nn.parallel import DistributedDataParallel as DDP
torch.random.manual_seed(123)
input_dim, out_dim = 10, 1
net = nn.Linear(input_dim, out_dim) # I load my model from a saved state_dict
m = torch.cuda.device_count()
n = 5 # number of data pts sent to each GPU
x = torch.rand((n,input_dim,m)) # full data set
# without parallel processing:
s0 = 0
for i in range(m):
s0 += net(x[:,:,i]).sum()
print('s0', s0)
# with parallel processing:
def example(rank, world_size):
print('rank', rank)
# create default process group
dist.init_process_group("gloo", rank=rank, world_size=world_size)
# create local model
model = net.to(rank)
# construct DDP model
ddp_model = DDP(model, device_ids=[rank])
# forward pass
x_local = x[:,:,rank].to(rank)
outputs = ddp_model(x_local)
print(outputs.sum()) # these add up to the desired value s0
# but how do I return these values from each GPU process and add them?
def main():
world_size = torch.cuda.device_count()
mp.spawn(example,
args=(world_size,),
nprocs=world_size,
join=True)
if __name__=="__main__":
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "29500"
main()
I’ve seen examples where people do all of the data/model preparation inside the example function, but I don’t know why.
Also, I feel like there should be a cleaner way of loading the data using DataLoader which would work smoother with DPP, so pointers for this are also very welcome. I also posted this question to stackexchange but got no replies.