Hi Everyone,
I am using 4 GPUs for training a model, which was earlier being trained on single gpu, for leveraging the data parallelism and speeding up the training process. For my code, I have set the batch size as 8, and was expecting that while training on 4 GPUs the data would evenly distribute among the 4gpus as individual batch size of 2. But I find that all the inputs
are always placed on GPU 0.
Code for Data Parallelism
print ("GPU Count ", torch.cuda.device_count())
models["encoder"] = ResnetEncoder(
num_layers=options.resnet_num_layers, pretrained=True)
models["depth"] = DepthDecoder(
models["encoder"].num_ch_enc, scales=range(args.numScales), use_bn=args.use_bn)
if torch.cuda.device_count() > 1:
models["encoder"] = torch.nn.DataParallel(models["encoder"])
models["decoder"] = torch.nn.DataParallel(models["depth"])
models["encoder"].cuda()
models["depth"].cuda()
Data Loading Phase
for epoch in range(options.numEpochs):
data_iterator = tqdm(dataloader)
optimizer.zero_grad()
for sampleIndex, inputs in enumerate(data_iterator):
for k in inputs:
print (inputs[k].size())
# predict depth maps, now only apply on the first frame
outputs = {}
for seq_index in range(options.numFrames):
# images are supposed to be normalized to [0, 1] in dataloader
inputs['image', seq_index] = inputs['image', seq_index].cuda()
inputs['extrinsic', seq_index].cuda()
print (inputs['image', seq_index].device)
The inputs
is a dictionary which has tensors of following size:
torch.Size([8, 3, 480, 640])
torch.Size([8, 3, 480, 640])
torch.Size([8, 480, 640])
torch.Size([8, 4, 4])
torch.Size([8, 3, 480, 640])
torch.Size([8, 3, 480, 640])
torch.Size([8, 480, 640])
torch.Size([8, 4, 4])
torch.Size([8, 3, 480, 640])
torch.Size([8, 3, 480, 640])
torch.Size([8, 480, 640])
torch.Size([8, 4, 4])
torch.Size([8, 6])
The following print statement print (inputs['image', seq_index].device)
always print Cuda: 0
. I was expecting it should have printed 1, 2 or 3
too but looks like the data is not getting splitted between all the GPUs. Am I missing something here ?