Factors to help assess efficiency of parallelization

When processing 1K batches of data using torch.nn.DataParallel on 8 GPUs, it took 700+ seconds. But when I attempted to do that same job with just 1 GPU it took 400+ seconds.

I could imagine the bookeeping and associated processing time with parallelization. Could you share any guidelines or learnings you may have on the factors that could affect parallelization efficiency?.

Thanks.